This vignette describes the data analysis. All functions are included in the R package spliceQTL.

1. Data

get.snps.geuvadis

This function obtains the Geuvadis SNP data. It downloads missing genotype data from ArrayExpress, transforms variant call format to binary files, removes SNPs with a low minor allele frequency, labels SNPs in the format "chromosome:position", and changes sample identifiers.

get.snps.bbmri

This function obtains the BBMRI SNP data. It limits the analysis to specified biobanks, reads in genotype data in chunks, removes SNPs with missing values (multiple biobanks/technologies), removes SNPs with a low minor allele frequency, and fuses data from multiple biobanks/technologies.

get.exons.geuvadis

This function obtains the Geuvadis exon data. It retains exons on the autosomes, labels exons in the format "chromosome_start_end", and extracts the corresponding gene names.

get.exons.bbmri

This function obtains the BBMRI exon data. It loads quality controlled gene expression data, extracts sample identifiers, removes samples without SNP data, loads exon expression data, extracts sample identifiers, retains samples that passed quality control, and retains exons on the autosomes.

match.samples

This function removes duplicate samples from each matrix, only retains samples appearing in all matrices, and brings the samples into the same order.

2. Analysis

The $n \times q$ matrix $\boldsymbol{Y}$ represents the exons, and the $n \times p_{chr}$ matrices $\boldsymbol{X}_{chr}$ represents the SNPs, where $chr \in {1,\ldots,22}$. The row names contain the sample identifiers, and the column names indicate the genomic location of the variables.

adjust.samples

This function adjusts RNA-seq expression data for different library sizes. The $n \times q$ matrix $\boldsymbol{Y}$ contains the exon data. The library size are $\boldsymbol{s}=(s_1,\ldots,s_n)^T$, where $s_i=\sum_{j=1}^p Y_{ij}$ for all $i$. The mean library size is $\bar{s}=\sum_{i=1}^n s_i / n$. We use edgeR to compute the normalisation factors $\boldsymbol{\eta}=(\eta_1,\ldots,\eta_n)^T$. We then calculate the adjusted normalisation factors $\boldsymbol{\gamma}=(\gamma_1,\ldots,\gamma_n)^T$, where $\gamma_i=\eta_i*s_i / \bar{s}$ for all $i$. The adjusted value equals $Y_{ij}/\gamma_i$ for all samples $i$ and all covariates $j$.

adjust.variables

This function adjusts exon expression data for different exon lengths. We do this separately for each chromosome to decrease memory usage. For this adjustment, we temporarily transform matrices to vectors. An ${n \times p}$ matrix becomes a vector of length ${n \times p}$, with the first $p$ entries corresponding to covariate $1$ and samples $1$ to $n$, and the last $p$ entries corresponding to covariate $p$ and samples $1$ to $n$. Let the vector $\boldsymbol{y}=(Y_{11},\ldots,Y_{n1} \boldsymbol{,} \ldots \boldsymbol{,} Y_{1q},\ldots,Y_{nq})^T$ represent exon expression. Let $\boldsymbol{\gamma}=(\gamma_1,\ldots,\gamma_1 \boldsymbol{,} \ldots \boldsymbol{,} \gamma_q \ldots \gamma_q)^T$ represent exon lengths. And let $\boldsymbol{k}=(k_1,\ldots,k_1 \boldsymbol{,} \ldots \boldsymbol{,} k_q,\ldots,k_q)^T$ represent gene names. So, $\boldsymbol{\gamma}$ and $\boldsymbol{k}$ contain $q$ blocks of $n$ equal entries. We regress $\boldsymbol{y}$ (exon expression) on a fixed effect for $\gamma$ (exon length) and a random effet for $\boldsymbol{k}$ (gene name). The residuals from this mixed model become our adjusted exon data.

map.genes, map.exons, map.snps, drop.trivial

These functions select the variables for the spliceQTL test. First, we retrieve all protein-coding genes, excluding pseudogenes and other transcripts. Second, we attribute exons to genes, including exons within the gene. Third, we attribute SNPs to genes, including SNPs between (1) ${1\,000}$ base pairs before the start position of the gene, and (2) the end position of the gene. Although this might not occur in practice, exons or SNPs may be attributed to more than one gene. Finally, we exclude genes without any SNPs or with a single exon. It does not make sense to test whether these genes show alternative splicing.

test.multiple

We want to test for alternative splicing along the whole genome. We do not calculate $p$-values from an asymptotic distribution, but estimate them by permutation. If we tested a single gene, we could use a large number of permutations and obtain a precise estimate. We need at least $21$ permutations (including the identity) to reach the ${5\%}$ significance level. If one or two test statistics for the permuted data are larger than the one for the observed data, the estimated $p$-value equals $0.0476$ ($<0.05$) or $0.0952$ ($>0.05$), respectively. If we test multiple genes, we will need more permutations to reach Bonferroni-significance. Using a fixed number of permutations would be too computationally expensive. This is why we invest less in genes with large $p$-values and more in genes with small $p$-values. For each gene, we use between $100$ and $p/0.05+1$ permutations, where $p$ is the number of genes. From $100$ permutations onwards, we repeatedly check whether two or more test statistics for the permuted data are larger than the one for the observed data. If yes, we interrupt permutation for this gene. If one or two test statistics for the permuted data are larger than the one for the observed data, the Bonferroni-adjusted estimated $p$-value equals $0.05p/(p+0.05)$ ($<0.05$) or $0.1p/(p+0.05)$ ($>0.05$), respectively. These values converge to $0.05$ and $0.1$ when $p$ tends to infinity. Bonferroni-significance requires between ${8\,000}$ and ${60\,000}$ permutations on the chromosome level, depending on the number of genes, and about ${400\,000}$ permutations on the genome level. We therefore adjust for multiple testing for each chromosome, and not for the whole genome.



rauschenberger/spliceQTL documentation built on May 13, 2019, 3:02 a.m.