knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(daeqtlr)
library(daeqtlr)
The goal of {daeqtlr}
is to provide a minimum set of routines to perform
DAEQTL mapping.
In this vignette we'll show you how to perform DAEQTL mapping using an example data set. But before that let's recap some key concepts underlying this analysis.
DAE SNP: SNP on which the allelic expression (AE) ratio is measured. AE is the quantitative trait used as one of the variables in the statistical association test when performing DAEQTL mapping. AE ratios are only sensible for DAE SNPs that are heterozygous.
Candidate SNP: SNP candidate for DAE quantitative trait loci (DAEQTL). These are the SNPs whose zygosity level is used as the other variable in the statistical association test.
Zygosity: degree to which both copies of a locus in two homologous chromosomes have the same genetic sequence or not. If both copies are the same, then the locus is homozygous, if they are different, the loci are heterozygous.
{daeqtlr}
is bundled with an example data set that consists of three files:
snp_pairs.csv
: A table of SNP pairs to be tested for association. Check out
vignette('snp-pairs')
if you need to assemble this data from SNPs and their
genomic annotation.zygosity.csv
: A table providing the zygosity levels (homozygous or
heterozygous) for each SNP/biological sample combination. Check out
vignette('genotypes-to-zygosity')
if you have genotypes and need to convert
them to zygosity levels.ae.csv
: Allelic expression (AE) ratios for each DAE SNP / biological sample
combination.To easily read the bundled files you can use the function daeqtlr_example()
to
retrieve the path to each file, e.g. the path to snp_pairs.csv
is:
daeqtlr_example("snp_pairs.csv")
To import the data into R we provide a set of read_*
functions:
snp_pairs <- read_snp_pairs(file = daeqtlr_example("snp_pairs.csv")) zygosity <- read_snp_zygosity(file = daeqtlr_example("zygosity.csv")) ae <- read_ae_ratios(file = daeqtlr_example("ae.csv"))
The SNP pairs table indicates the pairs of SNPs (DAE SNP and candidate SNP) that
will be tested for statistical association. Each row is for a pair. If this is
not your starting point and you need to assemble this set of pairs first from
a list of SNP, namely by looking for neighboring SNPs by genomic window, please
read vignette("snp-pairs")
.
The function read_snp_pairs()
expects a path to a CSV file containing five
columns:
dae_snp
: The id of the DAE SNP.candidate_snp
: The id of the candidate DAEQTL SNP, or simply candidate SNP.chromosome
: The chromosome name.dae_snp_position
: The genomic position of the DAE SNP.candidate_snp_position
: The genomic position of the candidate SNP.These columns are expected in this order. The actual column names in the header
of the file are ignored and imported into R as indicated above.
read_snp_pairs()
will read the file with data.table::fread()
and return a
data table object.
In this example data set, all SNPs have dummy identifiers as should be clear
from the non-valid rs identifiers: note the inclusion of a character "X"
between the "rs"
prefix and the SNP number.
# First 10 pairs snp_pairs[1:10, ] # Total number of pairs nrow(snp_pairs) # Chromosomes unique(snp_pairs$chromosome)
There are r nrow(snp_pairs)
pairs, scattered across
r length(unique(snp_pairs$chromosome))
chromosomes:
r knitr::combine_words(unique(snp_pairs$chromosome))
.
This means that the DAEQTL mapping will consist (potentially) of
r nrow(snp_pairs)
statistical tests.
The zygosity
data table consists of the zygosity levels for the SNP / sample
combination. Samples are also dummy and are therefore generically named "s01"
,
"s02"
, etc. The values "hom"
and "het"
stand for homozygous and
heterozygous, respectively.
# First 10 SNPs, first 15 samples (first column is the SNP identifier) zygosity[1:10, 1:16] # Number of genotyped (determined zygosity) SNPs nrow(zygosity)
The ae
data table contains the $\log_2$ of the ratio of the expression of one
of the alleles over that of the other. This metric is also known as the M-value
[@Du.BB.2010]. In the {daeqtlr}
documentation we typically refer to M-values
by log-ratios or AE ratios.
$$\text{M} = \log_2 \frac{x_{A}}{x_{B}}$$
where $x_{A}$ and $x_{B}$ are the allelic expression of the A and B alleles, respectively. These can be microarray intensities or sequencing counts or any other measure that bears a linear relationship with allelic expression.
An alternative metric is the Beta-value $(\beta)$, i.e. the relative expression of one allele in the total of the two alleles' expression:
$$\beta = \frac{x_{A}}{x_{A} + x_{B}}$$
In this example data set, the ae
data table comprises M-values because it has
been shown to have more desirable properties for statistical testing
[@Du.BB.2010]. However, if you find Beta-values more intuitive and hence
preferable for reporting results you may use the function m2b()
for conversion
(and b2m()
for the reverse operation).
M-values can vary between -Inf
, if only one of the alleles is expressed, and
Inf
, if only the other allele is expressed. An M-value of zero means balanced
allelic expression (equal expression of both alleles).
Note how sparse the ae
data table is, having many NA
s values. That is
because allelic ratios can't be assayed for homozygous samples, hence resulting
in NA
values. A few of those NA
s should actually be truly missing ratios for
heterozygous samples.
# Number of DAE SNPs, i.e. SNPs with measured allelic expression nrow(ae) # First 5 samples (first column is the SNP identifier) ae[, 1:6]
To perform DAEQTL mapping you use the function daeqtl_mapping()
. The P-value
associated with the statistical association will be appended as a new column
to the data table snp_pairs
. Check the vignette('parallel')
to learn how
to setup your environment to run daeqtl_mapping()
in parallel.
mapping_dt <- daeqtl_mapping(snp_pairs = snp_pairs, zygosity = zygosity, ae = ae) # Omiting here the columns `dae_snp_position` and `candidate_snp_position` # and showing only the first 5 pairs for brevity. mapping_dt[1:5, -c('dae_snp_position', 'candidate_snp_position')]
Now let us sort by ascending P value to check the pairs that show the most strong evidence of association.
mapping_dt %>% dplyr::select(-c('dae_snp_position', 'candidate_snp_position')) %>% dplyr::arrange(pvalue) %>% dplyr::slice_head(n = 10)
TODO: Explanation of case 3.
# DAEQTL plot for the pair rsX059/rsX057 daeqtl_plot( ae_hom = ae_hom('rsX059', 'rsX057', zygosity, ae), ae_het = ae_het('rsX059', 'rsX057', zygosity, ae) )
TODO: Explanation of case 4.
# DAEQTL plot for the pair rsX031/rsX062 daeqtl_plot( ae_hom = ae_hom('rsX031', 'rsX062', zygosity, ae), ae_het = ae_het('rsX031', 'rsX062', zygosity, ae) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.