AA.brief.introduction: Introductory comments on R/eqtl

Description Preliminaries to R/eqtl Walk-through of an analysis with R/eqtl Author(s)

Description

A brief introduction to the R/eqtl package, with a walk-through of a typical analysis.

Preliminaries to R/eqtl

Walk-through of an analysis with R/eqtl

Here I briefly describe how to use R/eqtl to analyze an experimental cross. R/eqtl contains functions which required Karl Broman's R/qtl functions. This tutorial takes in consideration prior knowledge of R/qtl. Therefore, it is highly recommended that you read the R/qtl documentation and tutorials before you perform any analysis.

The data

A difficult first step in the use of most data-analysis software is to import the data in an adequate format. This step is perfectly described in R/qtl tutorials. With R/eqtl you should import some extra data in addition to the data needed for R/qtl. We will not discuss data import at this point. This step is described in the chapter “Importing the data”.

We consider the example data seed10, an experiment on gene expression in Arabidopsis~thaliana. Use the data function to load the data.

1

seed10 data is formatted by read.cross function. This data object has class cross and riself and describes an experiment on an A.~thaliana RIL population. The function summary.cross gives summary information on the data, and checks the data for internal consistency. A lot of utility functions are available in R/qtl and are widely described in Karl's tutorials. Please note~: seed10 is too large to be viewed in the R window. What is shown is the average phenotypes. Is is possible to use the attributes function later to get a closer look.

To project our results on the physical map, we also need to load the physical position of the genetic markers and the genomic physical coordinates of the probes used to estimate expression traits described in seed10. For information, BSpgmap and ATH.coord are simple data frames with specific column names.

1
2
3
4
5

The Interval Mapping

Before running the QTL analysis, intermediate calculations need to be performed. The function calc.genoprob is used to compute the conditional probabilities at each pseudo-marker while sim.geno simulates sequences of genotypes from their joint probabilities. See R/qtl manual for details. These steps have already been performed on seed10 and you do not need to run them again. Here, pseudo-markers have been defined every 0.5 centimorgan by defining the parameter step=0.5 as described in the following lines.

1
2
3
4
5
6
7
#DO NOT RUN
seed10 <- calc.genoprob(seed10, step=0.5, off.end=0, error.prob=0, 
	map.function='kosambi', stepwidth='fixed');
#DO NOT RUN
seed10 <- sim.geno(seed10, step=0.5, off.end=0, error.prob=0,
	map.function='kosambi', stepwidth='fixed');
	

The microarray probes usually contain data for which we don't want to perform any QTL analysis like the buffers, the controls or some missed probes. The function cleanphe cleans the seed10 data for undesired phenotypes.

1
2
3
seed10.clean <- cleanphe(seed10,"Buffer");
seed10.clean <- cleanphe(seed10,"Ctrl");
	

In this example, dropped data comes from probes named "Buffer" and "Ctrl" found within CATMA data. This function is based on the grep function of R. Thus it can be used to remove all the data defined within a specific word (for example "CHLORO" will remove all items that contain "CHLORO" within it).

Note that the function cleanphe can also be run on scanone object. This is useful in case of you forget to clean your cross object before running scanone. For this, you have to be very careful to what you're doing. It is indeed important for the next steps of the analysis to keep a cross object whose phenotypes fit perfectly with those described in scanone object.

Use the scanone function to perform an interval mapping.

1
2
3
BaySha.em <- scanone(seed10.clean,method='em',
        pheno.col=1:nphe(seed10.clean),	model='normal')
	

Keep in mind that BaySha.em is obtained from seed10.clean which has been removed of some phenotypes from seed10. Thus, the dataframe from now on corresponding to BaySha.em is always seed10.clean.

Mapping the QTLs

Here start the main differences with R/qtl. One of the major problematic steps for genome-wide expression QTL analysis, is to read all the LOD curves and sytematically define the QTLs. Because of the amount of results, it is not feasible to read the LOD curves by hand. R/eqtl allows to detect several QTL by chromosome with drop LOD support interval and a genetic exclusionnary window.

Use define.peak function to define QTL with drop LOD support interval from the scanone results, here the interval mapping results BaySha.em.

1
2
3
BaySha.peak <- define.peak(BaySha.em,locdolumn='all');
class(BaySha.peak);
	

The parameter lodcolumn='all' specifies to analyze all LOD columns (all the traits) of the scanone object BaySha.em. By using lodcolumn='CATrck', it specifies to analyze only the the scanone LOD column CATrck, which is supposed to be the interval mapping result of the trait CATrck.

We call peak object, the results of the define.peak function. The peak object is used to store the QTL definition. The QTL are defined by several features decribed in the peak objects attributes. At this step, a QTL is only defined by its LOD score, location and the subjective quality of the LOD peak. See define.peak function for details.

1
attributes(BaySha.peak);

Within the BaySha.peak attributes, you can see the "scanone" that it record from which the scanone object the QTL was defined.

Back to the define.peak parameters. graph=TRUE specifies to draw the LOD curves with LOD support interval. The curves showing a QTL detected will be drawn on different charts for each chromosome. Note that, no graphical setup has been defined and therefore all graphs generated will appear one above the others. You should specify the graphical parameter mfrow of the R function par() before running define.peak to draw all charts in the same window. You may not want to set the parameters graph=TRUE and lodcolumn='all' at the same time, depending on the amount of traits analyzed.

The following command lines is an example to define QTL and draw chart for a unique trait CATrck. Because A.~thaliana genome contains 5 chromosomes, 5 charts will be drawn for a unique trait.

1
2
3
4
5
6
png(filename='CATrck.png',width=800,height=600);
par(mfrow=c(1,5));
define.peak(BaySha.em, lodcolumn='CATrck', graph=TRUE, chr=c(1,5));
par(mfrow=c(1,1));
dev.off();
	

png() and dev.off() are classical R functions which indicates here to print the graph generated as a png file 'CATrck.png'. By using these functions, you can page set the graph as you would like it. By adding save.pict=TRUE, to define.peak function parameters, will systematically save all single LOD curves generated for each chromosome as png files. The files generated will be named with the names of the trait and the chromosomes where the QTLs are located. Pay attention to the amount of data you're analysing before setting the parameters save.pict=TRUE .

The way to access QTL results within peak object is quite simple:

1
2
3
BaySha.peak
BaySha.peak$CATrck
	

BaySha.peak will give you the define.peak results ordered by trait and chromosomes, respectively. BaySha.peak\$CATrck will give you the results for the trait 'CATrck' and so on for other trait names. If no QTL had been detected for a trait, the result will be the value NA. To avoid to save all charts, I first run define.peak for all traits ( lodcolumn='all') and save the results as a peak object. Then, when I need to check how look like the LOD curve of a specific trait, I run define.peak again on this trait by setting graph=TRUE without saving the peak object obtained.

Defining the QTLs

To complete the QTL analysis, use the functions calc.adef, localize.qtl and classify.qtl to compute, for each QTL previously detected in peak object, the additive effect, the estimated physical location and the estimated acting-type in case of eQTL, respectively. All of these functions will add peak features to the peak object.

1
2
3
4
5
6
7
8
BaySha.peak <- localize.qtl(cross=seed10.clean, peak=BaySha.peak,
	data.gmap=BSpgmap);
BaySha.peak <- calc.adef(cross=seed10.clean, scanone=BaySha.em,
	peak=BaySha.peak);
BaySha.peak <- classify.qtl(cross=seed10.clean, peak=BaySha.peak,
	etrait.coord=ATH.coord, data.gmap=BSpgmap);
attributes(BaySha.peak);
	

For each of these functions you have to specify the peak object. You also need to specify the related cross object and scanone results, the related genetic map physical data BSpgmap and the expression traits physical data ATH.coord. Note that, the expression trait physical data (here ATH.coord) may contain more traits than those studied. Conversely, all traits studied within the peak, the scanone or the cross objects must be described in ATH.coord. As you can see, the name of the cross object has been recorded in the attributes of the BaySha.peak object

Use calc.Rsq function to compute, from the peak object, the contribution of the individual QTLs to the phenotypic variation. At the same time this function tests and computes the contribution of significant epistatic interactions between QTLs. By default the significant threshold is set to th=0.001. In case you wanted to take all QTL interactions whatever the significance, you must set th=1.

1
2
3
4
BaySha.Rsq <- calc.Rsq(cross=seed10.clean,peak=BaySha.peak);
BaySha.Rsq;
plotRsq(rsq=BaySha.Rsq);
	

Manipulating the results

The function peak.2.array will format all QTL results in a simple array. The column names are the names of the peak features described in peak object. This array has the class peak.array. Rsq.2.array adds the R square column to the QTL array. Formatting the results as a simple array allows the use of all basic and complex R functions (statistical, summary, graphical, histograms...) which will allow us to customize the data in the simplest way. This format also allows to write the results in a file (like text or CSV) to save out the data.

1
2
3
BaySha.array <- peak.2.array(BaySha.peak);
BaySha.array <- Rsq.2.array(rsq=BaySha.Rsq,BaySha.array);
	

R/eqtl provides useful functions that give an overview of the QTLs results stored in peak.array. The peaksummary function gives a variety of summary information and an overview of peak distribution. Summary graphs are available by setting graph=TRUE. Like define.peak, no graphical parameters had been set and therefore all graphs generated will appear one above the others in the same R graph window. You may define mfrow before running peaksummary to draw all charts in the same R window.

Whole QTL summary with graphs:

1
2
3
4
5
6
par(mfrow=c(3,4));
BaySha.summary <- peaksummary(BaySha.array,seed10.cleaned,graph=TRUE);
par(mfrow=c(1,1));
names(BaySha.summary);
BaySha.summary;
	

QTL summary with graphs excluding QTL localized on the chromosome 3 between 5000 and 6000 bp:

1
2
3
4
5
6
7
par(mfrow=c(3,4));
BaySha.sum_exc <- peaksummary( BaySha.array, seed10.cleaned,
	exc=data.frame(inf=5000, sup=6000, chr=3), graph=TRUE);
par(mfrow=c(1,1));
names(BaySha.sum_exc);
BaySha.sum_exc;
	

The function genoplot provides basic information and an overview about genome-wide eQTL parameters.

1
2
3
4
genoplot(BaySha.array,seed10.clean, ATH.coord, BSpgmap,
	chr.size=c(30432457, 19704536, 23470536, 18584924, 26991304),
	save.pict=TRUE);
	

The parameter chr.size is the size of the chromosomes in base pair (here A. thaliana). These sizes are used to delimit the chromosomes for genome-wide graphs. For this function, the page setting has already been specified, save.pict=TRUE will save all graphs in different files within the current folder.

The Composite Interval Mapping

Use the function cim.peak to systematically perform composite interval mapping by running a single genome scan scanone with previously defined QTL as additives covariates. The additive covariates are defined from the peak object as the closest flanking marker of LOD peaks with the function map.peak. The cim.peak function returns an object of the scanone class and therefore can be analyzed by the define.peak function. The results can then be analyzed by calc.adef, localize.qtl, calc.Rsq, etc... Due to the model, the LOD curves present a high (artefactual) LOD peak at the additive covariates locations which will be incorrectly detected as a strong QTL by the define.peak function. To avoid this, use the wash.covar function which will set the LOD score at the covariates location to 0 LOD. This function takes care of a genetic window size which specifies the size of the region to “wash”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
BaySha.cem <- cim.peak(seed10.clean,BaySha.peak);
covar <- map.peak(BaySha.peak) ;
covar;
BaySha.cem <- wash.covar(BaySha.cem, covar, window.size=20);
BaySha.cem.peak <- define.peak(BaySha.cem, lodcolumn='all');
BaySha.cem.peak <- calc.adef(cross=seed10.clean, scanone=BaySha.cem,
	peak=BaySha.cem.peak);
BaySha.cem.peak <- localize.qtl(cross=seed10.clean, peak=BaySha.cem.peak,
	data.gmap=BSpgmap);
BaySha.cem.peak <- classify.qtl(cross=seed10.clean, peak=BaySha.cem.peak,
	etrait.coord=ATH.coord, data.gmap=BSpgmap);
BaySha.cem.Rsq <- calc.Rsq(cross=seed10.clean, peak=BaySha.cem.peak);
plot.Rsq(BaySha.cem.Rsq);
BaySha.cem.array <- peak.2.array(BaySha.cem.peak);
BaySha.cem.array <- Rsq.2.array(BaySha.cem.Rsq,BaySha.cem.array);
	

You now have two peak.array. BaySha.array which contained the results from IM analysis and BaySha.cem.array which contained the results from CIM. You may want to merge these two peak.array in one to run the genoplot function using all QTLs from IM and CIM. Note that you may have to add manually the class peak.array to the merge array obtained.

1
2
3
BaySha.em.cem.array <- rbind(BaySha.em.array,BaySha.cem.array);
attributes(BaySha.em.cem.array)$class<-c("peak.array","data.frame");
	

Author(s)

Hamid A Khalili, hamid.khalili@gmail.com


eqtl documentation built on May 2, 2019, 5:42 a.m.