(M,N)-mer - A Simple Statistical Feature for Sequence Classification
In mnmer: '(m,n)-mer' - A Simple Statistical Feature for Sequence Classification

\section{Introduction to the (m,n)-mer concept}

The (m,n)-mer is a k-mer variant that expresses conditional probabilities instead of unconditional ones (Figure 1). Our proposal suggests using the relative frequency of an n-mer conditioned to its preceding m-mer, rather than the unconditional k-mer frequency. When m+n=k, this new feature has an identical size and range to k-mer.

\begin{figure} \includegraphics[keepaspectratio,width=12cm]{\jobname-Fig1.png} \raggedright \caption{\label{fig:wide}Comparing k-mer to mn-mer relative frequency.} \end{figure}

Here, we consider k-mers as an particular case of (m,n)-mers, where k-mers are specifically (0,k)-mers (Figure 2). All other (m,n)-mers that share the same size and range as a k-mer can be categorized as follows: (1,k-1), (2,k-2), up to (k-1,1). For instance, the following three m-n mers: (1,3)-mer, (2,2)-mer and (3, 1)-mer all have the same characteristics as their related k-mer, which is in this case the (0,4)-mer.

\begin{figure} \includegraphics[keepaspectratio,width=12cm]{\jobname-Fig2.png} \raggedright \caption{\label{fig:wide}Numeric representation.} \end{figure}

For more methodological details and feature performance comparison in binary, multiclass and clustering classifications, please see Andrade et al., 2022 (in press). \newpage

\section{The (m,n)-mer as an R package} \begin{figure} \includegraphics[keepaspectratio,width=12cm]{\jobname-Fig4.png} \raggedright \caption{\label{fig:wide}Package logo} \end{figure}

The \Githubpkg{labinfo-lncc/mnmer} R package is used to summarize biological data by generating a table that shows the conditional frequency distribution of selected (m,n)-mers in FASTA files. This is an alternative to using k-mers for numerical characterization. By combining this output with class information, a feature matrix suitable for classification is created.

The output table (Figure 4) includes the FASTA file accession numbers as an ID column, the relative frequency of (m,n)-mers up to $4.k$ columns, and class information.

\begin{figure} \includegraphics[keepaspectratio,width=12cm]{\jobname-Fig3.png} \raggedright \caption{\label{fig:wide}Output example.} \end{figure}

\subsection{Dependencies}

The package needs R(>= 4.2.0), Biostrings(>= 3.1) and Utils(>= 2.0.0).

\subsection{Instalation}

The user should install the package from the GitHub repository. It can be done by using the \CRANpkg{devtools} package.

library(devtools)
install_github("labinfo-lncc/mnmer", ref="main")

\subsection{The readNumFASTA function}

This function employs Biostrings for reading FASTA files into the R system. It enables users to limit the number of sequences loaded, select sequences at random, and set a non-ACTG base cutoff percentage.

The parameters are:

FASTAfile = It could be a multiFASTA.

size = Number of sequences to be loaded.

rand = Select sequences randomly or not. Set TRUE or FALSE

pni = Percentage of non-ACTG (default = 0.20)

As default, all sequences more than 20% of N + IUPAC bases will be removed from further analysis given the little informative nature of those bases. In that case, the mnmer function prints the following warning:

score <- 30
paste0("Warning: Sequence has a proportion of N + IUPAC bases = ", score, "%")

If the user would like to accept these sequences, the pni parameter should be set to 0.00

The readNumFASTA function returns an DNAStringSet data structure, used by the mnmer in further analysis. To learn more about DNAStringSet please check Biostrings documentation.

\subsection{The mnmer function}

The main function of this package is the mnmer function. It creates dataframes with the conditional probability. By invoking the function cmmer from the C++ script, this function can create both k-mers and (m,n)-mers.

The parameters receives:

seqs = DNAStringSet object

m = Value of k for k-mer generation. Needs to be different from zero.

n = Value of m for (m,n)-mer generation in the format of (m, k-m). In case of k-mer generation, m should be zero as (0,k).

\subsection{Pratical example}

Assume we need to classify viruses that infect human from viruses that infect plants. The corresponding FASTA files can be found in the extdata folder.

After package installation, the user should run:

library("mnmer")
dir <-system.file("extdata", package="mnmer")

\subsubsection{Load sequences in to R system}

From the readNumFASTA function we load biological sequences into R system as an DNAStringSet structure. In our practical example, we would like to load 1,000 random sequences per FASTA. These sequences must have less than 50% of non-ACTG bases. We should run:

human <-readNumFASTA((file.path(dir, "human_vir.fasta")), 10,TRUE,0.50)
plant <-readNumFASTA((file.path(dir, "plant_vir.fasta")), 10,TRUE,0.50)

\subsubsection{Produce k-mer distributions}

For k-mer generation, the parameter k is set to choice, while the parameter m is set to zero. Given that the k-mers have been conditioned to zero bases.

human_02mer <- mnmer(human,2,0)
plant_02mer <- mnmer(plant,2,0)

\subsubsection{Produce (m,n)-mer distributions}

The user specifies the k and m parameters for (mn)-mer generation.

For example, k = 2 and m = 1 produce the (1,1)-mer, in which one base is conditioned on the frequency of the base before it. Bases other than A, C, T, and G were disregarded.

human_21mer <- mnmer(human,2,1)
plant_21mer <- mnmer(plant,2,1)

Here, we utilize the (1,1)-mer feature matrices generated by the mnmer to run an classification using Caret.

Caret (https://topepo.github.io/caret/) is a library of functions for building predictive models from R systems. We utilized the createDataPartition, trainControl, and train functions in this example. The createDataPartition method splits the feature matrix and creates the train and test datasets. The trainControl function generates parameters that further regulate how the train function creates models.

To execute the example, enter the following code:

library(caret)
# Add class information 
classes <- replicate(nrow(human_21mer), "human.vir")
featureMatrix_human_21mer <- cbind(human_21mer,classes)
classes <- replicate(nrow(plant_21mer), "plant.vir")
featureMatrix_plant_21mer <- cbind(plant_21mer,classes)
featureMatrix <- rbind(featureMatrix_human_21mer, featureMatrix_plant_21mer)
featureMatrix <- subset(featureMatrix, select = -c(seqid))

# Machine Learning
train_index <- caret::createDataPartition(featureMatrix$classes, p=0.8, list=FALSE)
train <- featureMatrix[train_index, ]
test <- featureMatrix[-train_index, ]
control <- caret::trainControl(method="cv", 
                               summaryFunction=twoClassSummary, 
                               classProbs=TRUE, 
                               savePredictions = TRUE)
roc <- caret::train(classes ~ ., 
                    data=train, 
                    method="rf", 
                    preProc=c("center"), 
                    trControl=control)