Matrix_eQTL_main: Main function for fast eQTL analysis in MatrixEQTL package

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/Matrix_eQTL_engine.R

Description

Matrix_eQTL_engine function tests association of every row of the snps dataset with every row of the gene dataset using a linear regression model defined by the useModel parameter (see below).

The testing procedure accounts for extra covariates specified by the cvrt parameter.

The errorCovariance parameter can be set to the error variance-covariance matrix to account for heteroskedastic and/or correlated errors.

Associations significant at pvOutputThreshold (pvOutputThreshold.cis) levels are saved to output_file_name (output_file_name.cis), with corresponding estimates of effect size (slope coefficient), test statistics, p-values, and q-values (false discovery rate).

Matrix eQTL can perform separate analysis for local (cis) and distant (trans) eQTLs. For such analysis one has to set the cis-analysis specific parameters pvOutputThreshold.cis > 0, cisDist, snpspos and genepos in the call of Matrix_eQTL_main function. A gene-SNP pair is considered local if the distance between them is less or equal to cisDist. The genomic location of genes and SNPs is defined by the data frames snpspos and genepos. Depending on p-value thresholds pvOutputThreshold and pvOutputThreshold.cis Matrix eQTL runs in one of three different modes:

Matrix_eQTL_engine is a wrapper for Matrix_eQTL_main for eQTL analysis without regard to gene/SNP location and provided for compatibility with the previous versions of the package.

The parameter pvalue.hist allows to record information sufficient to create a histogram or QQ-plot of all the p-values (see plot).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Matrix_eQTL_main(
        snps, 
        gene, 
        cvrt = SlicedData$new(), 
        output_file_name = "", 
        pvOutputThreshold = 1e-5,
        useModel = modelLINEAR, 
        errorCovariance = numeric(), 
        verbose = TRUE, 
        output_file_name.cis = "", 
        pvOutputThreshold.cis = 0,
        snpspos = NULL, 
        genepos = NULL,
        cisDist = 1e6,
        pvalue.hist = FALSE,
        min.pv.by.genesnp = FALSE,
        noFDRsaveMemory = FALSE)

Matrix_eQTL_engine(
        snps, 
        gene, 
        cvrt = SlicedData$new(), 
        output_file_name, 
        pvOutputThreshold = 1e-5, 
        useModel = modelLINEAR, 
        errorCovariance = numeric(), 
        verbose = TRUE,
        pvalue.hist = FALSE,
        min.pv.by.genesnp = FALSE,
        noFDRsaveMemory = FALSE)

Arguments

snps

SlicedData object with genotype information. Can be real-valued for linear models and must take at most 3 distinct values for ANOVA unless the number of ANOVA categories is set to a higher number (see useModel parameter).

gene

SlicedData object with gene expression information. Must have columns matching those of snps.

cvrt

SlicedData object with additional covariates. Can be an empty SlicedData object in case of no covariates. The constant is always included in the model and would cause an error if included in cvrt. The order of columns must match those in snps and gene.

output_file_name

character, connection, or NULL. If not NULL, significant associations are saved to this file (all significant associations if pvOutputThreshold=0 or only distant if pvOutputThreshold>0). If the file with this name exists, it is overwritten.

output_file_name.cis

character, connection, or NULL. If not NULL, significant local associations are saved to this file. If the file with this name exists, it is overwritten.

pvOutputThreshold

numeric. Significance threshold for all/distant tests.

pvOutputThreshold.cis

numeric. Same as pvOutputThreshold, but for local eQTLs.

useModel

integer. Eigher modelLINEAR, modelANOVA, or modelLINEAR_CROSS.

  1. Set useModel = modelLINEAR to model the effect of the genotype as additive linear and test for its significance using t-statistic.

  2. Set useModel = modelANOVA to treat genotype as a categorical variables and use ANOVA model and test for its significance using F-test. The default number of ANOVA categories is 3. Set otherwise like this: options(MatrixEQTL.ANOVA.categories=4).

  3. Set useModel = modelLINEAR_CROSS to add a new term to the model equal to the product of genotype and the last covariate; the significance of this term is then tested using t-statistic.

errorCovariance

numeric. The error covariance matrix. Use numeric() for homoskedastic independent errors.

verbose

logical. Set to TRUE to display more detailed report about the progress.

snpspos

data.frame object with information about SNP locations, must have 3 columns - SNP name, chromosome, and position, like this:

snpid chr pos
Snp_01 1 721289
Snp_02 1 752565
... ... ...
genepos

data.frame with information about transcript locations, must have 4 columns - the name, chromosome, and positions of the left and right ends, like this:

geneid chr left right
Gene_01 1 721289 731289
Gene_02 1 752565 762565
... ... ... ...
cisDist

numeric. SNP-gene pairs within this distance are considered local. The distance is measured from the nearest end of the gene. SNPs within a gene are always considered local.

pvalue.hist

logical, numerical, or "qqplot". Defines whether and how the distribution of p-values is recorded in the returned object. If pvalue.hist = FALSE, the information is not recorded and the analysis is performed faster. Alternatively, set pvalue.hist = "qqplot" to record information sufficient to create a QQ-plot of the p-values (use plot on the returned object to create the plot). To record information for a histogram set pvalue.hist to the desired number of bins of equal size. Finally, pvalue.hist can also be set to a custom set of bin edges.

min.pv.by.genesnp

logical. Set min.pv.by.genesnp = TRUE to record the minimum p-value for each SNP and each gene in the returned object. The minimum p-values are recorded even if if they are above the corresponding thresholds of pvOutputThreshold and pvOutputThreshold.cis. The analysis runs faster when the parameter is set to FALSE.

noFDRsaveMemory

logical. Set noFDRsaveMemory = TRUE to save significant gene-SNP pairs directly to the output files, reduce memory footprint and skip FDR calculation. The eQTLs are not recorded in the returned object if noFDRsaveMemory = TRUE.

Details

Note that the columns of gene, snps, and cvrt must match. If they do not match in the input files, use ColumnSubsample method to subset and/or reorder them.

Value

The detected eQTLs are saved in output_file_name and/or output_file_name.cis if they are not NULL. The method also returns a list with a summary of the performed analysis.

param

Keeps all input parameters and also records the number of degrees of freedom for the full model.

time.in.sec

Time difference between the start and the end of the analysis (in seconds).

all

Information about all detected eQTLs.

cis

Information about detected local eQTLs.

trans

Information about detected distant eQTLs.

The elements all, cis, and trans may contain the following components

ntests

Total number of tests performed. This is used for FDR calculation.

eqtls

Data frame with recorded significant associations. Not available if noFDRsaveMemory=FALSE

neqtls

Number of significant associations recorded.

hist.bins

Histogram bins used for recording p-value distribution. See pvalue.hist parameter.

hist.counts

Number of p-value that fell in each histogram bin. See pvalue.hist parameter.

min.pv.snps

Vector with the best p-value for each SNP. See min.pv.by.genesnp parameter.

min.pv.gene

Vector with the best p-value for each gene. See min.pv.by.genesnp parameter.

Author(s)

Andrey A Shabalin andrey.shabalin@gmail.com

References

The package website: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

See Also

The code below is the sample code for eQTL analysis NOT using gene/SNP locations.

See MatrixEQTL_cis_code for sample code for eQTL analysis that separates local and distant tests.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Matrix eQTL by Andrey A. Shabalin
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/
# 
# Be sure to use an up to date version of R and Matrix eQTL.

# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");
library(MatrixEQTL)

## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');

## Settings

# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = paste0(base.dir, "/data/SNP.txt");

# Gene expression file name
expression_file_name = paste0(base.dir, "/data/GE.txt");

# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste0(base.dir, "/data/Covariates.txt");

# Output file name
output_file_name = tempfile();

# Only associations significant at this level will be saved
pvOutputThreshold = 1e-2;

# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");


## Load genotype data

snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);

## Load gene expression data

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0){
    cvrt$LoadFile(covariates_file_name);
}

## Run the analysis

me = Matrix_eQTL_engine(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = output_file_name,
    pvOutputThreshold = pvOutputThreshold,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE,
    pvalue.hist = TRUE,
    min.pv.by.genesnp = FALSE,
    noFDRsaveMemory = FALSE);
        
unlink(output_file_name);

## Results:

cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');
cat('Detected eQTLs:', '\n');
show(me$all$eqtls)

## Plot the histogram of all p-values

plot(me)

Example output

Rows read: 15 done.
Rows read: 10 done.
Rows read: 2 done.
Processing covariates
Task finished in 0.00700000000000001 seconds
Processing gene expression data (imputation, residualization)
Task finished in 0.00600000000000001 seconds
Creating output file(s)
Task finished in 0.00699999999999995 seconds
Performing eQTL analysis
100.00% done, 5 eQTLs
Task finished in 0.01 seconds

Analysis done in:  0.024  seconds 
Detected eQTLs: 
    snps    gene statistic       pvalue          FDR       beta
1 Snp_05 Gene_03 38.812160 5.515519e-14 8.273279e-12  0.4101317
2 Snp_13 Gene_09 -3.914403 2.055817e-03 1.541863e-01 -0.2978847
3 Snp_11 Gene_06 -3.221962 7.327756e-03 2.853368e-01 -0.2332470
4 Snp_04 Gene_10  3.201666 7.608981e-03 2.853368e-01  0.2321123
5 Snp_14 Gene_01  3.070005 9.716705e-03 2.915011e-01  0.2147077

MatrixEQTL documentation built on Dec. 22, 2019, 5:06 p.m.