library("abcd") abcdversion = "0.9-x"
Markus Gumbel, Ali Karpuzoglu
Mannheim University of Applied Sciences
This tutorial is for abcd version r abcdversion
.
This document introduces the abcd (Analysis of Base Composition of long DNA sequences) package which belongs to a familiy of R packages known as GCAT R packages (Genetic Code Analysis Toolkit). The R package abcd can be used for the analysis of n-plets and the skew. More information about GCAT can be found at http://wwww.gcat.bio.
Note: We are working on a simpler installation procedure - this is work in progress.
The R package abcd requires
rJava
and seqinr
We recommend Rstudio as a workbench.
abcd internally calls Java routines; before the package itself is going to be installed, you need to install rJava
. Type in the R console:
install.packages("rJava")
Next install seqinr
, the package for accessing genetic sequences and more.
install.packages("seqinr")
The steps above only need to be done once. Anytime you update the abcd package these steps are not required anymore.
Finally, we need to prepare the JVM anytime we start a R session, e.g. after launching Rstudio. Important: The JVM must be of the same architecture than R itself. If R is 32 bit (x86) you also need a x86 JVM. On the other hand, if R is 64 bit (x64) the JVM has to be a x64 version, too.
Run the following command to indicate R where the JVM is or set JAVA_HOME
as an environment variable in your operating system. This differs from computer to computer and operating system to operating system. Under windows it looks something like this where <NUMBER>
indicates a specific JVM version:
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk<NUMBER>")
Under Mac it is typically:
Sys.setenv(JAVA_HOME=""/Library/Java/JavaVirtualMachines/jdk<NUMBER>")
Now the abcd package itself can be installed or updated. You might skip this step if abcd is up-to-date. We assume that abcd_0.9-0.tar.gz
is the package - certainly there will be other versions in the future. Type:
install.packages("abcd_0.9-0.tar.gz", INSTALL_opts = "--no-multiarch", # Ignore architecture repos = NULL) # allows us to use a file; no download
We are almost done. Just type
library("abcd")
to load the library. You should now be ready to go!
Note that you can browse the manual where each function is described in detail. In RStudio: go to Packages, click on the library abcd.
Let us use the following short sequence as an example:
seq = "AGTACGCAGTGGCCAATA"
The base distribution over a sequence can be calculated with:
dist = sequence.basedist(seq) dist
It returns a vector with labels for each base.
The skews $S_{A\sim T}(1,1)$ and $S_{C\sim G}(1,1)$ is calculated with:
skew(seq)
Internally, the base distribution is calculated first.
Next we want to calculate the skew for n-plet. The following example gives us the skews for a n-plet size of $n=3$.
nplet.skew(seq, 3)
An n-plet of size 3 has 3 positions ($k = 1,2,3$) where the skew is calculated. This is returned by this function. The names of the skew (header) are the positions.
There is also an alternative implementation of the skew function that is based on Java. It computes the skew much faster than the R version. Before Java can be used we need to prepare R to work with Java. The abcd package will automatically do this for you at start time.
The Java based skew is calculated with:
nplet.skewj(seq, 3)
Clearly, we obtain the same result.
The nplet.skew
function can be generalized by passing a vector of n-plet sizes to the nplet.skews
function (note "skews" not "skew" in the function name).
skews123 = nplet.skews(seq, c(1,2,3)) skews123
As we can see, the return value (data structure) is more complex: it consists of two nested lists:
If we want to access the C-G skews for n-plet size 2 we simply can query:
skews123$cg$`2`
By default, this functions uses the Java-based version. The parameter skewF
can be set to change the function: skewF = nplet.skew
or skewF = nplet.skewj
.
Let us load a slightly larger DNA sequence of about 2800 bases in the fasta format that is stored in data/demo.fasta
. The function seq.fastaread
reas a fasta file and returns the first entry.
library("seqinr") # This library provides a read fasta function. filename = "../data/demo.fasta" demoseq = sequence.fastaread(filename) sequence.basedist(demoseq)
In the following sections we use the n-plet sizes nsizes = c(2, 3, 10, 30, 50)
by default.
nsizes = c(2, 3, 10, 30, 50) # default n-plet sizes demoskews = nplet.skews(demoseq, nsizes) demoskews
There is a convenience command nplet.skewplot
which plots the skews for different n-plet sizes.
nplet.skewplot(demoskews, seqId = "demo sequence", yMax = .5) # Range of y-axis
The following R command calculates the standard deviation of the skews in different positions for iven n-plet sizes.
skews = nplet.skews(demoseq, nSizes = nsizes) unlist(lapply(skews$at, sd)) # std. deviation of AT skews unlist(lapply(skews$cg, sd)) # std. deviation of CG skews
The list of standard deviations can be plotted with the function nplet.sdskewplot
. Note that this function calculates the standard deviation, i.e. a parameter is the skews:
nplet.sdskewplot(skews, nSizes = nsizes, seqId = "demo sequence", yMax = .2)
Calculate the sample-size normalized skews with the function nplet.sizenormskews
and plot them with nplet.skewplot
. Note that we use the optional parameter ylabPrefix
in nplet.skewplot
to modify the y-axis label. A sufficient cover factor is $cf > 50$, optimal is $c > 100$. (It takes more than a couple of minutes if $cf = 25$, so we set $cf = 10$.)
ndemoskews = nplet.sizenormskews(demoseq, nsizes, cf = 10) nplet.skewplot(ndemoskews, seqId = "demo sequence", yMax = 0.5, ylabPrefix = "size norm. ")
Also plot the standard deviations; again with the the ylabPrefix
parameter:
nplet.sdskewplot(ndemoskews, nSizes = nsizes, seqId = "demo sequence", yMax = .25, ylabPrefix = "size norm.")
We'd like to analyze now how the n-plet skews differ if we have sequences of different lengths. Let us take the demo sequence again and calculate the skews for n-plets of size 20 for length 100 and 1000. The demo sequence contains about 2800 bases:
sdemoskews = nplet.nskewpersize(demoseq, sizes = c(500, 1000), n = 20) sdemoskews
Now we calculate the standard deviation of these values with the convenience function nplet.stddevpersize
and plot them with nplet.stddevplotpersize
. We add "n = 20" to the plot's title to indicate that the data is for a n-plet size of 20:
sdsdemoskews = nplet.stddevpersize(sdemoskews) sdsdemoskews nplet.stddevplotpersize(sdsdemoskews, seqId = "demo sequence; n = 20", yMax = 0.5, xtickssize = 2)
As we can see, the A~T skews of the short seqence (|X|=100) is not a number.
n-plet skews may not only be calculated for the entire sequence but also for a partition of a sequence, i.e. a subsequence. First, we introduce the subseqpart
function which simply divides a sequence into m partitions of the same size $N / m$ where $N$ is the length of the sequence. Example:
seq subseqpart(seq, 3 , 1) subseqpart(seq, 3 , 2) subseqpart(seq, 3 , 3)
The function nplet.partskew
calculates the skews for all partitions. Let us use the longer sequence demoseq
:
part = nplet.partskew(demoseq, nSize = 3, # n-plet size pCount = 4) # Number of partitions part
The result is (again) a complex data structure:
The first entry encapsulated the A-T skews
then there is a list of partitions (here 1 to 4)
followed by the number of skews for each position (here 1,2,3) in the n-plet of size 3
The second entry is the same for the C-G skew.
Finally, we can plot the results as a boxplot with nplet.partskewplot
:
nplet.partskewplot(demoseq, nSize = 4, # n-plet size pCount = 4, # Number of partitions seqId = "demo sequence", yMax = .5)
Note that the x-axis shows now the partition (not the n-plet size). Once again with two times more partitions:
nplet.partskewplot(demoseq, nSize = 4, # n-plet size pCount = 2 * 4, # Number of partitions seqId = "demo sequence", yMax = .5)
(under construction)
sequence.condbaseprob
calculates the conditional probabilities of bases for a DNA or RNA sequence. The following example does this for the demo sequence. As output we calculate the values in percentage and round them to one digit.
cp = sequence.condbaseprob(demoseq) round(cp * 100, digits = 1)
Note that there is a special base N
that represents an unknown base.
The function sequence.rndbycondprob
generates a random DNA sequence of a given length using conditional probabilities:
seqrnd = sequence.rndbycondprob(50, cp) seqrnd sequence.basedist(seqrnd) sequence.basedist(seqrnd) / sum(sequence.basedist(seqrnd))
or for a longer random sequence:
seqrnd2 = sequence.rndbycondprob(100000, cp) sequence.basedist(seqrnd2) sequence.basedist(seqrnd2) / sum(sequence.basedist(seqrnd2))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.