How To Get The Most Frequent K-mers Of A String?

How To Get The Most Frequent K Mers Of A String

The k-mers are the most frequent sequence of characters present in any string. In any string structures if some group characters are repeating several times, then these sequences are considered as k-mers in Python. This concept is very similar to the DNA-RNA structure of human beings. For example, The human DNA-RNA is made up of different character sequences like ‘ACTGTGACT’.

Here, the different groups of sequences repeat themselves in this string. The ‘AC’, ‘CT’, and ‘TG’ sequences are repeated for 2 times. The ‘ACT’ sequence is repeating for two times. In this article, we will learn about the different techniques to find out the most frequent k-mers.

How To Get K-mers In Python?

The k-mers are the sequences repeated in the given string. The ‘K’ term here represents the number of characters present in the sequence. We will assign the value of ‘k’ while writing the code. Here first, we require to implement one function that will contain loops and logic to calculate the k-mers in Python.

Let’s see the code to get k-mers from the given string in Python.

my_string = "ABCDBCABDABCHKABCDBC"
k_value = 3

def cal_kmers(s, k):
    count_kmers = []
    for i in range(len(s) - k + 1):
        kmer = s[i:i+k]
        count_kmers.append(kmer)
    return count_kmers
    
Result = cal_kmers(my_string, k_value)
print(Result)

In this code, we need to mention two things first,i.e., string and length of k-mers. Then, the cal_kmers function is defined to find out the k-mers from the string. Here, the empty array is initialized to store the k-mers from the strings on line No. 5. The for loop is used to check the repeated sequence from the strings. The .append function is also used to append the empty array. In the end, we are printing all the k-mers present in the provided string in array format.

K Mers In Python
K Mers In Python

In the output, we can see the array of all the k-mers present in the provided string.

How to Count the Frequency of K-mers in Python?

To count the total number of k-mers present in the string of length k, we need to use the counter class, which is a part of the collections module in Python. The counter class helps to calculate the number of times k-mers appeared in the provided string. The counter class considers a single k-mer as a key and then counts the frequency of that k-mer.

from collections import Counter
my_string = "ABDABCABD"
value_of_k = 3

def kmers_freq(s, k):
    kmers = [s[i:i+k] for i in range(len(s) - k + 1)]
    no_of_kmers = Counter(kmers)
    return no_of_kmers

result = kmers_freq(my_string, value_of_k)
print(result)

Here, the counter function is used on line No. 7, which will count the frequency of each k-mers present in the string. Let’s the results

K Mers Frequency
K Mers Frequency

In the output, we can see the number of times the k-mers are repeated in the sequence.

How To Get the Most Frequent K-mers of The String?

To solve this problem again, the counter class from the collections method will be used. The function from the counter class i.e. most_common() function, will be used to get the most frequent k-mers from the provided string. Let’s try out this method to solve the problem.

This most_common() function is mainly used in sorting techniques of arrays, lists, etc. Here, in this code, this function is used to return the most frequent elements/k-mers.

from collections import Counter
my_string = "ABDABCABD"
value_of_k = 3
value_of_n = 1

def kmers_freq(s, k,n):
    kmers = [s[i:i+k] for i in range(len(s) - k + 1)]
    no_of_kmers = Counter(kmers)
    most_freq_kmers = no_of_kmers.most_common(n)
    return most_freq_kmers

result = kmers_freq(my_string, value_of_k, value_of_n)
print(result)

This code is based on the same of logic of finding the k-mers from the provided string. Here, the .most_common() is used to get the most frequent k-mers from the string. You can see on line No. 4 only 1 k-mers from the string which is most frequent is printed as a result. Let’s see the final output.

Most Frequent K Mers From String
Most Frequent K Mers From String

The ‘ABD’ k-mer is repeated two times in the whole string, so, this is printed as the most frequent k-mers in the result.

Why We Need K-mers in Python?

Biological Feature and Pattern Detection

The model, which is based on the feature detection or pattern recognition of the DNA and RNA structure, uses this k-mers technique to find out the repetition. This method helps to identify the biological features and extract information from the DNA and RNA structure. This is very important technique to find out the similarity and dissimilarity between their structures.

Data Reduction Techniques

This k-mers technique can be used to find out the repeating information from the database. A large amount of data may contain information that is repeated several times. This kind of information always consumes a lot of space and memory. To avoid such problems, we can use the K-mers technique. This will find out similar parts from the database and reduce the space.

Machine Learning and Data-Related Fields

The machine learning models extensively use the k-mers technique to classify the data. There are different models available in the machine learning domain, which requires pattern searching for the classification. This method always helps to detect similar patterns from large datasets. The prediction of results becomes easy when we use the k-mers method in the machine learning models.

Summary

This article highlights the terms related to K-mers. This covers the implementations of different problems like how to get k-mers, calculating the number of times k-mers repeated in the string, and printing the most frequent k-mers from the string. This article also covers information related to the use of K-mers in different fields. Hope you will enjoy this article.

References

Read the official documentation for more details about the collections module and counter class.