Description Preliminaries to R/eqtl Walk-through of an analysis with R/eqtl Author(s)
A brief introduction to the R/eqtl package, with a walk-through of a typical analysis.
In order to use the R/eqtl package, you must type (within R) library(eqtl)
. This function will automatically load the R/qtl library required. You may want to include this in a .Rprofile
file.
Documentation and several tutorials are available from the R archive : http://cran.r-project.org.
Use the help.start
function to start the html version of the R help.
Type library(help=qtl)
to get a list of the functions in R/qtl.
Type library(help=eqtl)
to get a list of the functions in R/eqtl.
Download the latest version of R/qtl and R/eqtl.
Here I briefly describe how to use R/eqtl to analyze an experimental cross. R/eqtl contains functions which required Karl Broman's R/qtl functions. This tutorial takes in consideration prior knowledge of R/qtl. Therefore, it is highly recommended that you read the R/qtl documentation and tutorials before you perform any analysis.
A difficult first step in the use of most data-analysis software is to import the data in an adequate format. This step is perfectly described in R/qtl tutorials. With R/eqtl you should import some extra data in addition to the data needed for R/qtl. We will not discuss data import at this point. This step is described in the chapter “Importing the data”.
We consider the example data seed10
, an experiment on gene expression in Arabidopsis~thaliana. Use the data
function to load the data.
1 |
seed10
data is formatted by read.cross
function. This data object has class cross
and riself
and describes an experiment on an A.~thaliana RIL population. The function summary.cross
gives summary information on the data, and checks the data for internal consistency. A lot of utility functions are available in R/qtl and are widely described in Karl's tutorials. Please note~: seed10
is too large to be viewed in the R window. What is shown is the average phenotypes. Is is possible to use the attributes
function later to get a closer look.
To project our results on the physical map, we also need to load the physical position of the genetic markers and the genomic physical coordinates of the probes used to estimate expression traits described in seed10
. For information, BSpgmap
and ATH.coord
are simple data frames with specific column names.
1 2 3 4 5 |
Before running the QTL analysis, intermediate calculations need to be performed. The function calc.genoprob
is used to compute the conditional probabilities at each pseudo-marker while sim.geno
simulates sequences of genotypes from their joint probabilities. See R/qtl manual for details. These steps have already been performed on seed10 and you do not need to run them again. Here, pseudo-markers have been defined every 0.5 centimorgan by defining the parameter step=0.5
as described in the following lines.
1 2 3 4 5 6 7 |
The microarray probes usually contain data for which we don't want to perform any QTL analysis like the buffers, the controls or some missed probes. The function cleanphe
cleans the seed10
data for undesired phenotypes.
1 2 3 |
In this example, dropped data comes from probes named "Buffer"
and "Ctrl"
found within CATMA data. This function is based on the grep
function of R. Thus it can be used to remove all the data defined within a specific word (for example "CHLORO"
will remove all items that contain "CHLORO"
within it).
Note that the function cleanphe
can also be run on scanone
object. This is useful in case of you forget to clean your cross
object before running scanone
. For this, you have to be very careful to what you're doing. It is indeed important for the next steps of the analysis to keep a cross
object whose phenotypes fit perfectly with those described in scanone
object.
Use the scanone
function to perform an interval mapping.
1 2 3 | BaySha.em <- scanone(seed10.clean,method='em',
pheno.col=1:nphe(seed10.clean), model='normal')
|
Keep in mind that BaySha.em
is obtained from seed10.clean
which has been removed of some phenotypes from seed10
. Thus, the dataframe from now on corresponding to BaySha.em
is always seed10.clean
.
Here start the main differences with R/qtl. One of the major problematic steps for genome-wide expression QTL analysis, is to read all the LOD curves and sytematically define the QTLs. Because of the amount of results, it is not feasible to read the LOD curves by hand. R/eqtl allows to detect several QTL by chromosome with drop LOD support interval and a genetic exclusionnary window.
Use define.peak
function to define QTL with drop LOD support interval from the scanone results, here the interval mapping results BaySha.em
.
1 2 3 | BaySha.peak <- define.peak(BaySha.em,locdolumn='all');
class(BaySha.peak);
|
The parameter lodcolumn='all'
specifies to analyze all LOD columns (all the traits) of the scanone object BaySha.em
. By using lodcolumn='CATrck'
, it specifies to analyze only the the scanone LOD column CATrck
, which is supposed to be the interval mapping result of the trait CATrck
.
We call peak
object, the results of the define.peak
function. The peak
object is used to store the QTL definition. The QTL are defined by several features decribed in the peak
objects attributes. At this step, a QTL is only defined by its LOD score, location and the subjective quality of the LOD peak. See define.peak
function for details.
1 | attributes(BaySha.peak);
|
Within the BaySha.peak
attributes, you can see the "scanone"
that it record from which the scanone
object the QTL was defined.
Back to the define.peak
parameters. graph=TRUE
specifies to draw the LOD curves with LOD support interval. The curves showing a QTL detected will be drawn on different charts for each chromosome. Note that, no graphical setup has been defined and therefore all graphs generated will appear one above the others. You should specify the graphical parameter mfrow
of the R function par()
before running define.peak
to draw all charts in the same window. You may not want to set the parameters graph=TRUE
and lodcolumn='all'
at the same time, depending on the amount of traits analyzed.
The following command lines is an example to define QTL and draw chart for a unique trait CATrck
. Because A.~thaliana genome contains 5 chromosomes, 5 charts will be drawn for a unique trait.
1 2 3 4 5 6 |
png()
and dev.off()
are classical R functions which indicates here to print the graph generated as a png file 'CATrck.png'
. By using these functions, you can page set the graph as you would like it. By adding save.pict=TRUE
, to define.peak
function parameters, will systematically save all single LOD curves generated for each chromosome as png files. The files generated will be named with the names of the trait and the chromosomes where the QTLs are located. Pay attention to the amount of data you're analysing before setting the parameters save.pict=TRUE
.
The way to access QTL results within peak
object is quite simple:
1 2 3 | BaySha.peak
BaySha.peak$CATrck
|
BaySha.peak
will give you the define.peak
results ordered by trait and chromosomes, respectively. BaySha.peak\$CATrck
will give you the results for the trait 'CATrck'
and so on for other trait names. If no QTL had been detected for a trait, the result will be the value NA
. To avoid to save all charts, I first run define.peak
for all traits ( lodcolumn='all'
) and save the results as a peak
object. Then, when I need to check how look like the LOD curve of a specific trait, I run define.peak
again on this trait by setting graph=TRUE
without saving the peak
object obtained.
To complete the QTL analysis, use the functions calc.adef
, localize.qtl
and classify.qtl
to compute, for each QTL previously detected in peak
object, the additive effect, the estimated physical location and the estimated acting-type in case of eQTL, respectively. All of these functions will add peak features to the peak
object.
1 2 3 4 5 6 7 8 | BaySha.peak <- localize.qtl(cross=seed10.clean, peak=BaySha.peak,
data.gmap=BSpgmap);
BaySha.peak <- calc.adef(cross=seed10.clean, scanone=BaySha.em,
peak=BaySha.peak);
BaySha.peak <- classify.qtl(cross=seed10.clean, peak=BaySha.peak,
etrait.coord=ATH.coord, data.gmap=BSpgmap);
attributes(BaySha.peak);
|
For each of these functions you have to specify the peak
object. You also need to specify the related cross
object and scanone
results, the related genetic map physical data BSpgmap
and the expression traits physical data ATH.coord
. Note that, the expression trait physical data (here ATH.coord
) may contain more traits than those studied. Conversely, all traits studied within the peak
, the scanone
or the cross
objects must be described in ATH.coord
. As you can see, the name of the cross object has been recorded in the attributes of the BaySha.peak
object
Use calc.Rsq
function to compute, from the peak
object, the contribution of the individual QTLs to the phenotypic variation. At the same time this function tests and computes the contribution of significant epistatic interactions between QTLs. By default the significant threshold is set to th=0.001
. In case you wanted to take all QTL interactions whatever the significance, you must set th=1
.
1 2 3 4 |
The function peak.2.array
will format all QTL results in a simple array. The column names are the names of the peak features described in peak
object. This array has the class peak.array
. Rsq.2.array
adds the R square column to the QTL array. Formatting the results as a simple array allows the use of all basic and complex R functions (statistical, summary, graphical, histograms...) which will allow us to customize the data in the simplest way. This format also allows to write the results in a file (like text or CSV) to save out the data.
1 2 3 | BaySha.array <- peak.2.array(BaySha.peak);
BaySha.array <- Rsq.2.array(rsq=BaySha.Rsq,BaySha.array);
|
R/eqtl provides useful functions that give an overview of the QTLs results stored in peak.array
. The peaksummary
function gives a variety of summary information and an overview of peak distribution. Summary graphs are available by setting graph=TRUE
. Like define.peak
, no graphical parameters had been set and therefore all graphs generated will appear one above the others in the same R graph window. You may define mfrow
before running peaksummary
to draw all charts in the same R window.
Whole QTL summary with graphs:
1 2 3 4 5 6 |
QTL summary with graphs excluding QTL localized on the chromosome 3 between 5000 and 6000 bp:
1 2 3 4 5 6 7 | par(mfrow=c(3,4));
BaySha.sum_exc <- peaksummary( BaySha.array, seed10.cleaned,
exc=data.frame(inf=5000, sup=6000, chr=3), graph=TRUE);
par(mfrow=c(1,1));
names(BaySha.sum_exc);
BaySha.sum_exc;
|
The function genoplot
provides basic information and an overview about genome-wide eQTL parameters.
1 2 3 4 |
The parameter chr.size
is the size of the chromosomes in base pair (here A. thaliana). These sizes are used to delimit the chromosomes for genome-wide graphs. For this function, the page setting has already been specified, save.pict=TRUE
will save all graphs in different files within the current folder.
Use the function cim.peak
to systematically perform composite interval mapping by running a single genome scan scanone
with previously defined QTL as additives covariates. The additive covariates are defined from the peak
object as the closest flanking marker of LOD peaks with the function map.peak
. The cim.peak
function returns an object of the scanone
class and therefore can be analyzed by the define.peak
function. The results can then be analyzed by calc.adef
, localize.qtl
, calc.Rsq
, etc... Due to the model, the LOD curves present a high (artefactual) LOD peak at the additive covariates locations which will be incorrectly detected as a strong QTL by the define.peak
function. To avoid this, use the wash.covar
function which will set the LOD score at the covariates location to 0 LOD. This function takes care of a genetic window size which specifies the size of the region to “wash”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | BaySha.cem <- cim.peak(seed10.clean,BaySha.peak);
covar <- map.peak(BaySha.peak) ;
covar;
BaySha.cem <- wash.covar(BaySha.cem, covar, window.size=20);
BaySha.cem.peak <- define.peak(BaySha.cem, lodcolumn='all');
BaySha.cem.peak <- calc.adef(cross=seed10.clean, scanone=BaySha.cem,
peak=BaySha.cem.peak);
BaySha.cem.peak <- localize.qtl(cross=seed10.clean, peak=BaySha.cem.peak,
data.gmap=BSpgmap);
BaySha.cem.peak <- classify.qtl(cross=seed10.clean, peak=BaySha.cem.peak,
etrait.coord=ATH.coord, data.gmap=BSpgmap);
BaySha.cem.Rsq <- calc.Rsq(cross=seed10.clean, peak=BaySha.cem.peak);
plot.Rsq(BaySha.cem.Rsq);
BaySha.cem.array <- peak.2.array(BaySha.cem.peak);
BaySha.cem.array <- Rsq.2.array(BaySha.cem.Rsq,BaySha.cem.array);
|
You now have two peak.array
. BaySha.array
which contained the results from IM analysis and BaySha.cem.array
which contained the results from CIM. You may want to merge these two peak.array
in one to run the genoplot
function using all QTLs from IM and CIM. Note that you may have to add manually the class peak.array
to the merge array obtained.
1 2 3 | BaySha.em.cem.array <- rbind(BaySha.em.array,BaySha.cem.array);
attributes(BaySha.em.cem.array)$class<-c("peak.array","data.frame");
|
Hamid A Khalili, hamid.khalili@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.