In maialab/daeqtlr: DAEQTL Mapping

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(daeqtlr)

library(daeqtlr)

The goal of {daeqtlr} is to provide a minimum set of routines to perform DAEQTL mapping.

In this vignette we'll show you how to perform DAEQTL mapping using an example data set. But before that let's recap some key concepts underlying this analysis.

Concepts

DAE SNP: SNP on which the allelic expression (AE) ratio is measured. AE is the quantitative trait used as one of the variables in the statistical association test when performing DAEQTL mapping. AE ratios are only sensible for DAE SNPs that are heterozygous.
Candidate SNP: SNP candidate for DAE quantitative trait loci (DAEQTL). These are the SNPs whose zygosity level is used as the other variable in the statistical association test.
Zygosity: degree to which both copies of a locus in two homologous chromosomes have the same genetic sequence or not. If both copies are the same, then the locus is homozygous, if they are different, the loci are heterozygous.

Data

{daeqtlr} is bundled with an example data set that consists of three files:

snp_pairs.csv: A table of SNP pairs to be tested for association. Check out vignette('snp-pairs') if you need to assemble this data from SNPs and their genomic annotation.
zygosity.csv: A table providing the zygosity levels (homozygous or heterozygous) for each SNP/biological sample combination. Check out vignette('genotypes-to-zygosity') if you have genotypes and need to convert them to zygosity levels.
ae.csv: Allelic expression (AE) ratios for each DAE SNP / biological sample combination.

To easily read the bundled files you can use the function daeqtlr_example() to retrieve the path to each file, e.g. the path to snp_pairs.csv is:

daeqtlr_example("snp_pairs.csv")

To import the data into R we provide a set of read_* functions:

snp_pairs <- read_snp_pairs(file = daeqtlr_example("snp_pairs.csv"))
zygosity <- read_snp_zygosity(file = daeqtlr_example("zygosity.csv"))
ae <- read_ae_ratios(file = daeqtlr_example("ae.csv"))

SNP pairs

The SNP pairs table indicates the pairs of SNPs (DAE SNP and candidate SNP) that will be tested for statistical association. Each row is for a pair. If this is not your starting point and you need to assemble this set of pairs first from a list of SNP, namely by looking for neighboring SNPs by genomic window, please read vignette("snp-pairs").

The function read_snp_pairs() expects a path to a CSV file containing five columns:

dae_snp: The id of the DAE SNP.
candidate_snp: The id of the candidate DAEQTL SNP, or simply candidate SNP.
chromosome: The chromosome name.
dae_snp_position: The genomic position of the DAE SNP.
candidate_snp_position: The genomic position of the candidate SNP.

These columns are expected in this order. The actual column names in the header of the file are ignored and imported into R as indicated above. read_snp_pairs() will read the file with data.table::fread() and return a data table object.

In this example data set, all SNPs have dummy identifiers as should be clear from the non-valid rs identifiers: note the inclusion of a character "X" between the "rs" prefix and the SNP number.

# First 10 pairs
snp_pairs[1:10, ]

# Total number of pairs
nrow(snp_pairs)

# Chromosomes
unique(snp_pairs$chromosome)

There are r nrow(snp_pairs) pairs, scattered across r length(unique(snp_pairs$chromosome)) chromosomes: r knitr::combine_words(unique(snp_pairs$chromosome)). This means that the DAEQTL mapping will consist (potentially) of r nrow(snp_pairs) statistical tests.

Zygosity

The zygosity data table consists of the zygosity levels for the SNP / sample combination. Samples are also dummy and are therefore generically named "s01", "s02", etc. The values "hom" and "het" stand for homozygous and heterozygous, respectively.

# First 10 SNPs, first 15 samples (first column is the SNP identifier)
zygosity[1:10, 1:16]

# Number of genotyped (determined zygosity) SNPs
nrow(zygosity)

Allelic expression ratios (M-values)

The ae data table contains the $\log_2$ of the ratio of the expression of one of the alleles over that of the other. This metric is also known as the M-value [@Du.BB.2010]. In the {daeqtlr} documentation we typically refer to M-values by log-ratios or AE ratios.

$$\text{M} = \log_2 \frac{x_{A}}{x_{B}}$$

where $x_{A}$ and $x_{B}$ are the allelic expression of the A and B alleles, respectively. These can be microarray intensities or sequencing counts or any other measure that bears a linear relationship with allelic expression.

An alternative metric is the Beta-value $(\beta)$, i.e. the relative expression of one allele in the total of the two alleles' expression:

$$\beta = \frac{x_{A}}{x_{A} + x_{B}}$$

In this example data set, the ae data table comprises M-values because it has been shown to have more desirable properties for statistical testing [@Du.BB.2010]. However, if you find Beta-values more intuitive and hence preferable for reporting results you may use the function m2b() for conversion (and b2m() for the reverse operation).

M-values can vary between -Inf, if only one of the alleles is expressed, and Inf, if only the other allele is expressed. An M-value of zero means balanced allelic expression (equal expression of both alleles).

Note how sparse the ae data table is, having many NAs values. That is because allelic ratios can't be assayed for homozygous samples, hence resulting in NA values. A few of those NAs should actually be truly missing ratios for heterozygous samples.

# Number of DAE SNPs, i.e. SNPs with measured allelic expression
nrow(ae)

# First 5 samples (first column is the SNP identifier)
ae[, 1:6]

DAEQTL mapping

To perform DAEQTL mapping you use the function daeqtl_mapping(). The P-value associated with the statistical association will be appended as a new column to the data table snp_pairs. Check the vignette('parallel') to learn how to setup your environment to run daeqtl_mapping() in parallel.

mapping_dt <- daeqtl_mapping(snp_pairs = snp_pairs,
                                  zygosity = zygosity,
                                  ae = ae)

# Omiting here the columns `dae_snp_position` and `candidate_snp_position`
# and showing only the first 5 pairs for brevity.
mapping_dt[1:5, -c('dae_snp_position', 'candidate_snp_position')]

Now let us sort by ascending P value to check the pairs that show the most strong evidence of association.

mapping_dt %>%
  dplyr::select(-c('dae_snp_position', 'candidate_snp_position')) %>%
  dplyr::arrange(pvalue) %>%
  dplyr::slice_head(n = 10)

rsX059 and rsX057

TODO: Explanation of case 3.

# DAEQTL plot for the pair rsX059/rsX057
daeqtl_plot(
  ae_hom = ae_hom('rsX059', 'rsX057', zygosity, ae),
  ae_het = ae_het('rsX059', 'rsX057', zygosity, ae)
)

rsX031 and rsX062

TODO: Explanation of case 4.

# DAEQTL plot for the pair rsX031/rsX062
daeqtl_plot(
  ae_hom = ae_hom('rsX031', 'rsX062', zygosity, ae),
  ae_het = ae_het('rsX031', 'rsX062', zygosity, ae)
)