A brief introduction to the R/eqtl package, with a walk-through of a typical analysis.
In order to use the R/eqtl package, you must type (within R)
library(eqtl). This function will automatically load the R/qtl library required. You may want to include this in a
Documentation and several tutorials are available from the R archive : http://cran.r-project.org.
help.start function to start the html version of the R help.
library(help=qtl) to get a list of the functions in R/qtl.
library(help=eqtl) to get a list of the functions in R/eqtl.
Download the latest version of R/qtl and R/eqtl.
Here I briefly describe how to use R/eqtl to analyze an experimental cross. R/eqtl contains functions which required Karl Broman's R/qtl functions. This tutorial takes in consideration prior knowledge of R/qtl. Therefore, it is highly recommended that you read the R/qtl documentation and tutorials before you perform any analysis.
A difficult first step in the use of most data-analysis software is to import the data in an adequate format. This step is perfectly described in R/qtl tutorials. With R/eqtl you should import some extra data in addition to the data needed for R/qtl. We will not discuss data import at this point. This step is described in the chapter “Importing the data”.
We consider the example data
seed10, an experiment on gene expression in Arabidopsis~thaliana. Use the
data function to load the data.
seed10 data is formatted by
read.cross function. This data object has class
riself and describes an experiment on an A.~thaliana RIL population. The function
summary.cross gives summary information on the data, and checks the data for internal consistency. A lot of utility functions are available in R/qtl and are widely described in Karl's tutorials. Please note~:
seed10 is too large to be viewed in the R window. What is shown is the average phenotypes. Is is possible to use the
attributes function later to get a closer look.
To project our results on the physical map, we also need to load the physical position of the genetic markers and the genomic physical coordinates of the probes used to estimate expression traits described in
seed10. For information,
ATH.coord are simple data frames with specific column names.
1 2 3 4 5
Before running the QTL analysis, intermediate calculations need to be performed. The function
calc.genoprob is used to compute the conditional probabilities at each pseudo-marker while
sim.geno simulates sequences of genotypes from their joint probabilities. See R/qtl manual for details. These steps have already been performed on seed10 and you do not need to run them again. Here, pseudo-markers have been defined every 0.5 centimorgan by defining the parameter
step=0.5 as described in the following lines.
1 2 3 4 5 6 7
The microarray probes usually contain data for which we don't want to perform any QTL analysis like the buffers, the controls or some missed probes. The function
cleanphe cleans the
seed10 data for undesired phenotypes.
1 2 3
In this example, dropped data comes from probes named
"Ctrl" found within CATMA data. This function is based on the
grep function of R. Thus it can be used to remove all the data defined within a specific word (for example
"CHLORO" will remove all items that contain
"CHLORO" within it).
Note that the function
cleanphe can also be run on
scanone object. This is useful in case of you forget to clean your
cross object before running
scanone. For this, you have to be very careful to what you're doing. It is indeed important for the next steps of the analysis to keep a
cross object whose phenotypes fit perfectly with those described in
scanone function to perform an interval mapping.
1 2 3
BaySha.em <- scanone(seed10.clean,method='em', pheno.col=1:nphe(seed10.clean), model='normal')
Keep in mind that
BaySha.em is obtained from
seed10.clean which has been removed of some phenotypes from
seed10. Thus, the dataframe from now on corresponding to
BaySha.em is always
Here start the main differences with R/qtl. One of the major problematic steps for genome-wide expression QTL analysis, is to read all the LOD curves and sytematically define the QTLs. Because of the amount of results, it is not feasible to read the LOD curves by hand. R/eqtl allows to detect several QTL by chromosome with drop LOD support interval and a genetic exclusionnary window.
define.peak function to define QTL with drop LOD support interval from the scanone results, here the interval mapping results
1 2 3
lodcolumn='all' specifies to analyze all LOD columns (all the traits) of the scanone object
BaySha.em. By using
lodcolumn='CATrck', it specifies to analyze only the the scanone LOD column
CATrck, which is supposed to be the interval mapping result of the trait
peak object, the results of the
define.peak function. The
peak object is used to store the QTL definition. The QTL are defined by several features decribed in the
peak objects attributes. At this step, a QTL is only defined by its LOD score, location and the subjective quality of the LOD peak. See
define.peak function for details.
BaySha.peak attributes, you can see the
"scanone" that it record from which the
scanone object the QTL was defined.
Back to the
graph=TRUE specifies to draw the LOD curves with LOD support interval. The curves showing a QTL detected will be drawn on different charts for each chromosome. Note that, no graphical setup has been defined and therefore all graphs generated will appear one above the others. You should specify the graphical parameter
mfrow of the R function
par() before running
define.peak to draw all charts in the same window. You may not want to set the parameters
lodcolumn='all' at the same time, depending on the amount of traits analyzed.
The following command lines is an example to define QTL and draw chart for a unique trait
CATrck. Because A.~thaliana genome contains 5 chromosomes, 5 charts will be drawn for a unique trait.
1 2 3 4 5 6
dev.off() are classical R functions which indicates here to print the graph generated as a png file
'CATrck.png'. By using these functions, you can page set the graph as you would like it. By adding
define.peak function parameters, will systematically save all single LOD curves generated for each chromosome as png files. The files generated will be named with the names of the trait and the chromosomes where the QTLs are located. Pay attention to the amount of data you're analysing before setting the parameters
The way to access QTL results within
peak object is quite simple:
1 2 3
BaySha.peak will give you the
define.peak results ordered by trait and chromosomes, respectively.
BaySha.peak\$CATrck will give you the results for the trait
'CATrck' and so on for other trait names. If no QTL had been detected for a trait, the result will be the value
NA. To avoid to save all charts, I first run
define.peak for all traits (
lodcolumn='all') and save the results as a
peak object. Then, when I need to check how look like the LOD curve of a specific trait, I run
define.peak again on this trait by setting
graph=TRUE without saving the
peak object obtained.
To complete the QTL analysis, use the functions
classify.qtl to compute, for each QTL previously detected in
peak object, the additive effect, the estimated physical location and the estimated acting-type in case of eQTL, respectively. All of these functions will add peak features to the
1 2 3 4 5 6 7 8
BaySha.peak <- localize.qtl(cross=seed10.clean, peak=BaySha.peak, data.gmap=BSpgmap); BaySha.peak <- calc.adef(cross=seed10.clean, scanone=BaySha.em, peak=BaySha.peak); BaySha.peak <- classify.qtl(cross=seed10.clean, peak=BaySha.peak, etrait.coord=ATH.coord, data.gmap=BSpgmap); attributes(BaySha.peak);
For each of these functions you have to specify the
peak object. You also need to specify the related
cross object and
scanone results, the related genetic map physical data
BSpgmap and the expression traits physical data
ATH.coord. Note that, the expression trait physical data (here
ATH.coord) may contain more traits than those studied. Conversely, all traits studied within the
scanone or the
cross objects must be described in
ATH.coord. As you can see, the name of the cross object has been recorded in the attributes of the
calc.Rsq function to compute, from the
peak object, the contribution of the individual QTLs to the phenotypic variation. At the same time this function tests and computes the contribution of significant epistatic interactions between QTLs. By default the significant threshold is set to
th=0.001. In case you wanted to take all QTL interactions whatever the significance, you must set
1 2 3 4
peak.2.array will format all QTL results in a simple array. The column names are the names of the peak features described in
peak object. This array has the class
Rsq.2.array adds the R square column to the QTL array. Formatting the results as a simple array allows the use of all basic and complex R functions (statistical, summary, graphical, histograms...) which will allow us to customize the data in the simplest way. This format also allows to write the results in a file (like text or CSV) to save out the data.
1 2 3
R/eqtl provides useful functions that give an overview of the QTLs results stored in
peaksummary function gives a variety of summary information and an overview of peak distribution. Summary graphs are available by setting
define.peak, no graphical parameters had been set and therefore all graphs generated will appear one above the others in the same R graph window. You may define
mfrow before running
peaksummary to draw all charts in the same R window.
Whole QTL summary with graphs:
1 2 3 4 5 6
QTL summary with graphs excluding QTL localized on the chromosome 3 between 5000 and 6000 bp:
1 2 3 4 5 6 7
genoplot provides basic information and an overview about genome-wide eQTL parameters.
1 2 3 4
chr.size is the size of the chromosomes in base pair (here A. thaliana). These sizes are used to delimit the chromosomes for genome-wide graphs. For this function, the page setting has already been specified,
save.pict=TRUE will save all graphs in different files within the current folder.
Use the function
cim.peak to systematically perform composite interval mapping by running a single genome scan
scanone with previously defined QTL as additives covariates. The additive covariates are defined from the
peak object as the closest flanking marker of LOD peaks with the function
cim.peak function returns an object of the
scanone class and therefore can be analyzed by the
define.peak function. The results can then be analyzed by
calc.Rsq, etc... Due to the model, the LOD curves present a high (artefactual) LOD peak at the additive covariates locations which will be incorrectly detected as a strong QTL by the
define.peak function. To avoid this, use the
wash.covar function which will set the LOD score at the covariates location to 0 LOD. This function takes care of a genetic window size which specifies the size of the region to “wash”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
BaySha.cem <- cim.peak(seed10.clean,BaySha.peak); covar <- map.peak(BaySha.peak) ; covar; BaySha.cem <- wash.covar(BaySha.cem, covar, window.size=20); BaySha.cem.peak <- define.peak(BaySha.cem, lodcolumn='all'); BaySha.cem.peak <- calc.adef(cross=seed10.clean, scanone=BaySha.cem, peak=BaySha.cem.peak); BaySha.cem.peak <- localize.qtl(cross=seed10.clean, peak=BaySha.cem.peak, data.gmap=BSpgmap); BaySha.cem.peak <- classify.qtl(cross=seed10.clean, peak=BaySha.cem.peak, etrait.coord=ATH.coord, data.gmap=BSpgmap); BaySha.cem.Rsq <- calc.Rsq(cross=seed10.clean, peak=BaySha.cem.peak); plot.Rsq(BaySha.cem.Rsq); BaySha.cem.array <- peak.2.array(BaySha.cem.peak); BaySha.cem.array <- Rsq.2.array(BaySha.cem.Rsq,BaySha.cem.array);
You now have two
BaySha.array which contained the results from IM analysis and
BaySha.cem.array which contained the results from CIM. You may want to merge these two
peak.array in one to run the
genoplot function using all QTLs from IM and CIM. Note that you may have to add manually the class
peak.array to the merge array obtained.
1 2 3
Hamid A Khalili, firstname.lastname@example.org