library("abcd")
abcdversion = "0.9-x"

Markus Gumbel, Ali Karpuzoglu

Mannheim University of Applied Sciences

This tutorial is for abcd version r abcdversion.

Overview

This document introduces the abcd (Analysis of Base Composition of long DNA sequences) package which belongs to a familiy of R packages known as GCAT R packages (Genetic Code Analysis Toolkit). The R package abcd can be used for the analysis of n-plets and the skew. More information about GCAT can be found at http://wwww.gcat.bio.

Installation of package abcd

Note: We are working on a simpler installation procedure - this is work in progress.

The R package abcd requires

We recommend Rstudio as a workbench.

abcd internally calls Java routines; before the package itself is going to be installed, you need to install rJava. Type in the R console:

install.packages("rJava")

Next install seqinr, the package for accessing genetic sequences and more.

install.packages("seqinr")

The steps above only need to be done once. Anytime you update the abcd package these steps are not required anymore.

Finally, we need to prepare the JVM anytime we start a R session, e.g. after launching Rstudio. Important: The JVM must be of the same architecture than R itself. If R is 32 bit (x86) you also need a x86 JVM. On the other hand, if R is 64 bit (x64) the JVM has to be a x64 version, too.

Run the following command to indicate R where the JVM is or set JAVA_HOME as an environment variable in your operating system. This differs from computer to computer and operating system to operating system. Under windows it looks something like this where <NUMBER> indicates a specific JVM version:

Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk<NUMBER>")

Under Mac it is typically:

Sys.setenv(JAVA_HOME=""/Library/Java/JavaVirtualMachines/jdk<NUMBER>")

Now the abcd package itself can be installed or updated. You might skip this step if abcd is up-to-date. We assume that abcd_0.9-0.tar.gz is the package - certainly there will be other versions in the future. Type:

install.packages("abcd_0.9-0.tar.gz", 
     INSTALL_opts = "--no-multiarch", # Ignore architecture
     repos = NULL) # allows us to use a file; no download

We are almost done. Just type

library("abcd")

to load the library. You should now be ready to go!

Note that you can browse the manual where each function is described in detail. In RStudio: go to Packages, click on the library abcd.

Skew

Skew for single nucleotides

Let us use the following short sequence as an example:

seq = "AGTACGCAGTGGCCAATA"

The base distribution over a sequence can be calculated with:

dist = sequence.basedist(seq)
dist

It returns a vector with labels for each base.

The skews $S_{A\sim T}(1,1)$ and $S_{C\sim G}(1,1)$ is calculated with:

skew(seq)

Internally, the base distribution is calculated first.

Skew for n-plets

Next we want to calculate the skew for n-plet. The following example gives us the skews for a n-plet size of $n=3$.

nplet.skew(seq, 3)

An n-plet of size 3 has 3 positions ($k = 1,2,3$) where the skew is calculated. This is returned by this function. The names of the skew (header) are the positions.

There is also an alternative implementation of the skew function that is based on Java. It computes the skew much faster than the R version. Before Java can be used we need to prepare R to work with Java. The abcd package will automatically do this for you at start time.

The Java based skew is calculated with:

nplet.skewj(seq, 3)

Clearly, we obtain the same result.

The nplet.skew function can be generalized by passing a vector of n-plet sizes to the nplet.skews function (note "skews" not "skew" in the function name).

skews123 = nplet.skews(seq, c(1,2,3))
skews123

As we can see, the return value (data structure) is more complex: it consists of two nested lists:

If we want to access the C-G skews for n-plet size 2 we simply can query:

skews123$cg$`2`

By default, this functions uses the Java-based version. The parameter skewF can be set to change the function: skewF = nplet.skew or skewF = nplet.skewj.

Boxplot for n-plet skews

Let us load a slightly larger DNA sequence of about 2800 bases in the fasta format that is stored in data/demo.fasta. The function seq.fastaread reas a fasta file and returns the first entry.

library("seqinr") # This library provides a read fasta function.
filename = "../data/demo.fasta"
demoseq = sequence.fastaread(filename)
sequence.basedist(demoseq)

In the following sections we use the n-plet sizes nsizes = c(2, 3, 10, 30, 50) by default.

nsizes = c(2, 3, 10, 30, 50) # default n-plet sizes
demoskews = nplet.skews(demoseq, nsizes)
demoskews

There is a convenience command nplet.skewplot which plots the skews for different n-plet sizes.

nplet.skewplot(demoskews,
               seqId = "demo sequence",
               yMax = .5)        # Range of y-axis

Standard deviation of skews

The following R command calculates the standard deviation of the skews in different positions for iven n-plet sizes.

skews = nplet.skews(demoseq, nSizes = nsizes)
unlist(lapply(skews$at, sd)) # std. deviation of AT skews
unlist(lapply(skews$cg, sd)) # std. deviation of CG skews

The list of standard deviations can be plotted with the function nplet.sdskewplot. Note that this function calculates the standard deviation, i.e. a parameter is the skews:

nplet.sdskewplot(skews, nSizes = nsizes, seqId = "demo sequence",
                 yMax = .2)

Normalized skews

Calculate the sample-size normalized skews with the function nplet.sizenormskews and plot them with nplet.skewplot. Note that we use the optional parameter ylabPrefix in nplet.skewplot to modify the y-axis label. A sufficient cover factor is $cf > 50$, optimal is $c > 100$. (It takes more than a couple of minutes if $cf = 25$, so we set $cf = 10$.)

ndemoskews = nplet.sizenormskews(demoseq, nsizes, cf = 10)
nplet.skewplot(ndemoskews, seqId = "demo sequence", yMax = 0.5, 
               ylabPrefix = "size norm. ")

Also plot the standard deviations; again with the the ylabPrefix parameter:

nplet.sdskewplot(ndemoskews, nSizes = nsizes, seqId = "demo sequence",
                 yMax = .25, ylabPrefix = "size norm.")

n-plet skews of sequences with different lengths

We'd like to analyze now how the n-plet skews differ if we have sequences of different lengths. Let us take the demo sequence again and calculate the skews for n-plets of size 20 for length 100 and 1000. The demo sequence contains about 2800 bases:

sdemoskews = nplet.nskewpersize(demoseq, sizes = c(500, 1000), n = 20)
sdemoskews

Now we calculate the standard deviation of these values with the convenience function nplet.stddevpersize and plot them with nplet.stddevplotpersize. We add "n = 20" to the plot's title to indicate that the data is for a n-plet size of 20:

sdsdemoskews = nplet.stddevpersize(sdemoskews)
sdsdemoskews
nplet.stddevplotpersize(sdsdemoskews, seqId = "demo sequence; n = 20", 
                        yMax = 0.5, xtickssize = 2)

As we can see, the A~T skews of the short seqence (|X|=100) is not a number.

n-plet skews in partitions of sequences

n-plet skews may not only be calculated for the entire sequence but also for a partition of a sequence, i.e. a subsequence. First, we introduce the subseqpart function which simply divides a sequence into m partitions of the same size $N / m$ where $N$ is the length of the sequence. Example:

seq
subseqpart(seq, 3 , 1)
subseqpart(seq, 3 , 2)
subseqpart(seq, 3 , 3)

The function nplet.partskew calculates the skews for all partitions. Let us use the longer sequence demoseq:

part = nplet.partskew(demoseq, 
                      nSize = 3,  # n-plet size
                      pCount = 4) # Number of partitions
part

The result is (again) a complex data structure:

Finally, we can plot the results as a boxplot with nplet.partskewplot:

nplet.partskewplot(demoseq, 
                   nSize = 4,  # n-plet size
                   pCount = 4, # Number of partitions
                   seqId = "demo sequence",
                   yMax = .5)

Note that the x-axis shows now the partition (not the n-plet size). Once again with two times more partitions:

nplet.partskewplot(demoseq, 
                   nSize = 4,  # n-plet size
                   pCount = 2 * 4, # Number of partitions
                   seqId = "demo sequence",
                   yMax = .5)

Random sequences

(under construction)

sequence.condbaseprob calculates the conditional probabilities of bases for a DNA or RNA sequence. The following example does this for the demo sequence. As output we calculate the values in percentage and round them to one digit.

cp = sequence.condbaseprob(demoseq)
round(cp * 100, digits = 1)

Note that there is a special base N that represents an unknown base.

The function sequence.rndbycondprob generates a random DNA sequence of a given length using conditional probabilities:

seqrnd = sequence.rndbycondprob(50, cp)
seqrnd
sequence.basedist(seqrnd)
sequence.basedist(seqrnd) / sum(sequence.basedist(seqrnd))

or for a longer random sequence:

seqrnd2 = sequence.rndbycondprob(100000, cp)
sequence.basedist(seqrnd2)
sequence.basedist(seqrnd2) / sum(sequence.basedist(seqrnd2))


informatik-mannheim/abcd-R-package documentation built on May 5, 2019, 9:18 a.m.