In mathornton01/afgencomp: Alignment Free Genetic Comparisons

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
evenlyScaleEnsemble <- function(spectraList){
  spectraLengths <- lapply(spectraList,length);
  maxLength <- max(unlist(spectraLengths));
  maxIndex <- which(unlist(lapply(spectraList,length)) == max(unlist(lapply(spectraList,length))));
  scaledSpectra <- list(length(spectraList));
  scaledSpectra[[maxIndex]] <- spectraList[[maxIndex]];
  for (i in 1:length(spectraList)){
    if (i == maxIndex){
      next;
    }
    scaledSpectra[[i]] <- evenlyScaleSingle(spectraList[[i]], maxLength);
  }
  return(scaledSpectra);
}

Background & Description

This document is the package vignette for the afgencomp package in R. The code in this package was originally included in a github package distributed under the name YinGenomicDFTDistances. This is the most current version of the software from the previous package, and all associated works.

This software is developed independently from many of the extant enhanced datatypes which are available within the BioConductor software suite, as it is intended to be a light-weight standalone package which can be used to quickly produce distances between genomic sequences using alignment free methods. This being said, the primary data-type utilized for storing sequences is the string.

Package Installation

This vignette is distributed and maintained along with the software in afgencomp. However, if you happen to find your way to this vignette, and are looking to install the afgencomp R-package you may do so via one of several routes:

Install the Software from A Source Distribution using install.packages()
Install the Software from the Github Repository using devtools::install_github()
(Coming Soon) Install the Software from CRAN using install.packages()

Installing The Software from Source

The source distribution of the software is written in R, and hosted on Github. The source code can be downloaded using the git software to clone the afgencomp repository. This can be accomplished by simply typing the following git command in a terminal window where you would like to keep the package files or by downloading the package manually using a web-browser and navigating to the github page.

```{bash eval=FALSE} git clone https://github.com/mathornton01/afgencomp.git

Once the package has been downloaded it can be installed for R by navigating to 
the `afgencomp` directory from within R and using the `install.packages()`
function. 

```r
afgencomp.pkg.dir <- "full/path/to/directory/goes/here";
install.packages(afgencomp.pkg.dir, repos=NULL, type="source");

Or if you would instead like to choose the file using the file-manager, you can do so by running:

install.packages(file.choose(), repos=NULL)

then selecting the top-level directory for the package source. That is the directory which contains the R/, data/, man/, and vignettes/ folders.

Install The Software from Its Official Github Repository Directly

The devtools library in R allows for developers to quickly and easily share there packages with R-users via Github. the install_github function of the devtools package. Be sure to specify that you would like for the package vignette (this document) to be constructed when you run this, so that the vignette is available via the browseVignettes() or ?? functions.

library(devtools);
devtools::install_github("https://github.com/mathornton01/afgencomp.git",build_vignettes = TRUE);

Alignment-Free Genomic Distances

When comparing genomic sequences, most procedures first determine a set of mutations required to transform one sequence into another, these are referred to as 'post-alignment' procedures in this work. When doing this for a large group of sequences simultaneously, it can become unwieldy to align every sequence to every other sequence. This is why alignment-free procedures can be useful. It is frequently very slow to perform Multiple Sequence Alignment (MSA), with large datasets. In the afgencomp (previously YinGenomicDFTDistance) package two alignment free approaches are implemented. However, prior to actually computing distances and comparing sequences, the data must be processed in an appropriate manner

Quickstart Guide

A quickstart guide is provided here for ease of adoption and implementation, but it is expected that for most procedures, the researcher may use the `?' functionality in R as usual to retrieve runnable examples for each of the functions in the package.

library(afgencomp)
# Create List with some Example Sequences. 
sequencelist.example <- c("ACCTCGCGGCGGCGCTCTCGAGAGNNCGCGTGAGAGCTCGCN",
                          "ACCTTGCGGCGGCGCTCTCCGTAGNNCGCGTGAGAGCTCGCN",
                          "ACCACGGGCGGGGGCGCGTTNNNTGAGAGTNCCCGCGCGCGG",
                          "ACCTCGCGGCGGCGCTCTCGAGAGNNCGCGTGATCGCTCGCN",
                          "ACCTCGCGGCGGCGCTCTCGAGAGNNCGCG",
                          "ACCTCGCGGCGGCGCTCTCGAGAGNNCGCGTGATCGCTCGCAGAGGAGGN");

# Encode The Ensemble and create a 2D encoded genomic string ensemble 
encoded.sequences <- encodeGenomes(sequencelist.example); 

# Display First Sequence Signal for Example 
plot(encoded.sequences[[1]][1,],col='blue',type='l',main="Encoded Genome 1", xlab="Genomic Loci", ylab="Encoding");
lines(encoded.sequences[[1]][2,],col='red');

Once the genomes are encoded, the power spectra can be computed and the even scaling procedure can be applied to produce equal length sequences, and distances can be taken to produce phylogenies and dendrograms.

getPowerSpectraEnsemble(encoded.sequences) -> power.spectra.sample; 
evenlyScaleEnsemble(power.spectra.sample) -> scaled.spectra;
library(rlist)
list.rbind(scaled.spectra) -> sspecmat; 
scale(sspecmat) -> scaled.sspecmat; 
dist(scaled.sspecmat)->scaled.sspecmat.dist;
plot(hclust(scaled.sspecmat.dist))