hap-eQTL
is a package containing two functions that calculates statistical associations between the level of expression at allele-specific expression (ASE) sites and the presence of cis sequence variants by running permutations of the input data and applying a generalized linear model (GLM) each time.
Only those columns that are required are shown. The sample, legend and hap files are in the IMPUTE format (https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html).
Both the legend and sample files have headers, but the hap does not. Your hap file should have the same number of rows as that of the legend (once the legend's header is accounted for) and should also have twice the number of columns as the sample has rows, because each individual will have two haplotypes. Below are examples of acceptable formats.
The ASE file must contain the positions of the ASE sites, including the chromosome they are on, the individuals in which they are found, and the alleles at the different haplotypes, including their respective read counts.
One column must contain the individual IDs, another, their sex and the third, the population to which they belong.
ID SEX POP
HG00096 male GBR
HG00097 female GBR
HG00099 female GBR
NA20827 male TSI
One column must contain the ID of the polymorphisms, another their position, and the final three the reference allele, alternative allele and type of polymorphism involved.
ID pos allele0 allele1
10:60515:C:T 60515 C T
rs148087467:60523:T:G 60523 T G
rs147855157:61372:A:C 61372 A C
Here, the 0 refers to the reference allele, and the 1, the alternative.
0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
One column for the chromosome, another for the position, another for the individual ID, two for the reference and alternative alleles, along with two more columns for their corresponding read counts.
chr start end ref alt Ind refCount altCount
1 135031 135032 G A HG00276 19 0
1 135031 135032 G A HG00282 12 0
1 135031 135032 G A NA11831 10 0
1 135031 135032 G A NA19093 12 0
Gen.input
is a function to generate an input for the model to run. Separating out into two functions saves memory and greatly speeds up parallelisation for what is a computationally demanding task in the second function. The input files are formatted, gene and transcript start site information added to the ASE file, and finally output in .RData format to be loaded into the second function.Run.Model
is a function that allows you to measure statistical association between nearby regulatory variants and the level of expression at a heterozygous coding polymorphism, controlling for factors such as sex and population, by utilising a generalized linear model and applying permutations to the data in order to provide a robust p-value. As the function supports parallelisation, a number of .txt files equal to the number of tasks, numTasks
, specified in the function will be outputted. library(devtools)
install_github("ac1990/hap-eQTL")
library("hap-eQTL")
download.file("https://www.dropbox.com/s/cosadgz59hmfoxx/sample_ASE.txt?dl=1","sample_ASE.txt")
download.file("https://www.dropbox.com/s/eycoqh5s6bh8lyq/sample_legend.leg?dl=1","sample_legend.leg")
download.file("https://www.dropbox.com/s/e8sfx3gbziserkh/sample_haplotypes.hap?dl=1","sample_haplotypes.hap")
download.file("https://www.dropbox.com/s/8su6scg0ojkc8pm/sample_Samples.txt?dl=1","sample_Samples.txt")
Gen.input
This function allows you to output .RData files to be used as input for the analysis run in the second function. This only needs to be done once for each chromosome which ensures the running of the model is not too computationally demanding. The processing of input files need not be done repeatedly, as they are constants for whichever parameters you wish to set in Run.Model.R
. Hence separating this processing out allows tweaks to be made to the parameters in the second function, without having to re-run the processing each time.
Chromosome = 22
or simply 22
. If looking at the x chromosome,
it would be Chromosome="x"
or "x"
.Gen.input(Chromosome=22, ASE_file="sample_ASE.txt", legend_file="sample_legend.leg", haplotypes_file="sample_haplotypes.hap", Samples_file="sample_Samples.txt",
output_path="RDataFiles/", Species = "hsapiens", EnsemblVersion = 78)
Gen.input(22, "ASE_file", legend_file", "haplotypes_file", "samples_file", "RDataFiles/")
Gen.input(1, "ASE_file", legend_file", "haplotypes_file", "samples_file", "RDataFiles/", species="mmusculus")
The RData files outputted from the function, Gen.input, can now be used to run the analysis in the second function, below.
Run.Model
Gen.input
.Run.Model("RDataFiles/Run.model.input_Chr22.RData", 9, numTasks=100, numPerms=100,
"ProgressFiles/",
Chromosome=22)
Run.Model("RDataFiles/Run.model.input_Chr22.RData", 10,
"path_to_progress_file",
Chromosome=22)
```
#### Run model with task set to 10, chromosome to 22, for 10,000 permutations, a transcript start site window of 500kb and a theoretical p-value of 0.0005
Run.Model("input_file.RData", 10,
"path_to_progress_file",
Chromosome=22, numPerms=10000, pval_threshold=10000)
```
Run.Model("RDataFiles/Run.model.input_Chr12.RData", 2,
"path_to_progress_file",
Chromosome=12, numPerms=10000,TSSwindow=100000, pval_threshold=10000)
```
NB: The smallest possible p-value attainable as a result of running permutations is 1/numPerms. Hence, there is no advantage to setting the minimum p-value threshold to below this number.
## Output format
The first four columns are the ID of the ASE site, its position, the individual, and the ID of the nearby variant. Then follows 7 columns representing the test statistics of the intercept, reads, sex, sub-populations and variant, with the next 7 columns representing their corresponding p-values. The next two columns are the transcript start site of the ASE site and the gene to which the nearby variant belongs. Where the nearby variant is not found in a gene, the ID of the variant itself is given. The final two columns are the number of permutations, and the number of permutations in which the test statistic is exceeded.
# STUFF TO DO LATER BELOW:
## Selecting significant sites
To calculate the permuted p-values, in R do the following to your results dataframe, `df`:
df2 <- df[which(as.numeric(df$numPerm) > 0),] df2$permuted_p <-as.numeric(df2$numPermExceed)/as.numeric(df2$numPerm)
results <- results
## Plotting function to be done later ##
df2$adjusted_p <- p.adjust(df2$permuted_p,"fdr",n=length(df2$permuted_p))
### Having difficulty installing some packages?
source("https://bioconductor.org/biocLite.R") biocLite("rtracklayer") biocLite("GenomicRanges") ```
Could even put these all in a function. They call the function and it installs them all.
speedglm works in 3.3.2, 3.4.3, 3.4.0
EOF
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.