BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({ library(longevityTools) library(ggplot2) })
This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its website here. The GitHub repository of the corresponding R package is available here and the most recent version of this vignette can be found here.
Based on muscle gene expression from 15 young (25 year old) and 15 old (65 year old) participants, Sood et al. identified a 150 probe set that accurately classified young and old individuals in external studies with gene expression data collected from tissues other than muscle [@Sood2015-pb]. A gene score based on the classifier was associated with better renal function, increased survival time over follow-up, and decreased Alzheimer's Disease prevalence.
Our goal is to perform Mendelian Randomization using the 150 gene set to determine whether SNPs associated with expression (eSNPs) of the 150 genes are associated with the aging phenotypes identified by Sood et al.
Hypothesis: eSNPs associated with younger expression profile are markers for younger biological age relative to chronological age, which would be associated with longevity.
File sood-2015-fileS1.xls reports the expression direction relative to age for each probe from the 150 probe set. See Top 150 worksheet. A gene marked as 'down' in column 'Ratio of Y:O muscle' means that the expression of the gene is lower in young vs old muscle.
The goal is to generate an eSNP risk score constructed from eSNP alleles associated with increasing gene expression for genes that more highly expressed in young tissue and from eSNP alleles associated with decreasing gene expression for genes that show decreased expression in young tissue. This eSNP risk score would represent an expression profile that more closely mimics the younger age profile rather than the older age profile.
For the 150 probes that map to genes, identify cis-eSNPs using GTEx preanalyzed results from whole blood. Preanalyzed GTEx results correct for multiple testing of SNPs for each gene. To identify independent cis-eSNPs for each gene, prune cis-eSNPs in LD using 1000G EUR population group genotypes. Once independent cis-eSNPs are identified, create a file listing the SNPs to be included in a risk score and which allele is associated with decreasing expression for genes that increase with age (upregulated gene) and which allele is associated with increasing expression for genes that decrease with age (downregulated gene). Upregulated and downregulated genes based on column 'Ratio of Y:O muscle' from Sood et al supplemental file.
Individual level eSNP risk scores will be constructed as below. The eSNP risk score represents the MR instrumental variable.
$$ eSNP Risk Score_i = \sum_{j=1}^{N} eSNP_j^{AlleleDown} + \sum_{k=1}^{N} eSNP_k^{AlleleUp} $$
The R software for running longevityTools
can be downloaded from CRAN. The longevityTools
package can be installed from the R console using the following biocLite
install command.
source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub
library("longevityTools") # Loads the package library(help="longevityTools") # Lists package info vignette("longevityTools") # Opens vignette
Preliminary functions that are still under developement, or not fully tested and documented
can be imported with the source
command from the inst/extdata
directory of the package. Imported is the function alleleCheck().
fctpath <- system.file("extdata", "longevityTools_eQTL_Fct.R", package="longevityTools") source(fctpath)
The following creates the directory structure expected by this workflow. Input data
will be stored in the data
directory and results will be written to the results
directory.
All paths are given relative to the present working directory of the user's R session.
dir.create("data"); dir.create("results")
The eQTL data used in this vignette can be downloaded from the GTEx site. Note, the following downloads a 782 MB file, which will take some time. Using patched V6 results.
download.file('http://www.gtexportal.org/static/datasets/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_eQTL.tar','./GTEx_Analysis_v6p_eQTL.tar') untar("GTEx_Analysis_v6p_eQTL.tar")
Will begin with whole blood eSNPs, as the Sood paper reported predictive blood gene expression pattern.
library(longevityTools) library(data.table) #use devel version for fwrite() library(stringr) dat_effect<-fread("~/bigdata/GTEx/GTEx_Analysis_v6p_eQTL/Whole_Blood_Analysis.v6p.egenes.txt",colClasses=c(gene_chr="character",chr="character")) sood<-fread("~/U24/MR/GTEx/data/sood150.csv") sood<-sood[!duplicated(Gene_Symbol),] sood <- sood[HU133_Probeset_ID!=Gene_Symbol , ] setnames(sood,"Y:0_muscle","young_old_muscle") query_genes<-sood[,Gene_Symbol] dat_effect[gene_name %in% query_genes,.N] #121 gene matches dat_effect<-merge(dat_effect,sood,by.x="gene_name",by.y="Gene_Symbol" ) sum(duplicated(dat_effect[,gene_name])) #some eSNPs are indels dat_effect[nchar(ref)==1 & nchar(alt)==1,.N] dat_effect[ , KGID:=paste(chr,snp_pos,sep=":") ] dat_effect[nchar(ref)>1 | nchar(alt)>1,.N] #make KGID for indels #dat_effect[nchar(ref)>1 | nchar(alt)>1 , KGID:=paste0(chr,":",snp_pos,"[b37]",ref,",",alt ) ] #remove indels since they don't merge with my 1KG data anyway dat_effect <- dat_effect[nchar(ref)==1 & nchar(alt)==1,] #count non-sig eSNPs dat_effect[qval>0.05 ,.N] #78 dat_effect[pval_nominal>pval_nominal_threshold,.N] #64 dat_effect[pval_perm>0.05 ,.N] #74 dat_effect[pval_perm>0.05 & pval_nominal<pval_nominal_threshold,.(pval_nominal,pval_nominal_threshold,pval_perm)] #convert up down to + and - setnames(dat_effect,c("alt","ref"),c("effect_allele","non_effect_allele")) dat_effect[young_old_muscle=="up",young_old_muscle:="1"] dat_effect[young_old_muscle=="down",young_old_muscle:="-1"] dat_effect[,young_old_muscle:=as.integer(young_old_muscle)] #orient effect alleles to direction of young expression dat_effect[ ,temp_effect_allele:=effect_allele ] dat_effect[ ,temp_non_effect_allele:=non_effect_allele ] #product is negative if slope and young expression direction are opposite dat_effect[(young_old_muscle*slope)<0 , effect_allele:=temp_non_effect_allele ] dat_effect[(young_old_muscle*slope)<0 , non_effect_allele:=temp_effect_allele ] dat_effect[ ,effect:=slope ] #dat_effect <- dat_effect[,.(KGID,gene_name,effect_allele,non_effect_allele,rs_id_dbSNP142_GRCh37p13,maf,slope,effect,pval_nominal,qval,pval_nominal_threshold,young_old_muscle )] dat_effect <- dat_effect[,.(KGID,effect_allele,non_effect_allele,effect ) ] #try merging SNPs with my 1KG data. If Indels don't merge, I'll deal with them later. dat2<-fread("~/U24/GWAScohorts/MrOS/1KGinfo/SNP_info_all.txt",colClasses=c(b37_C_rs.id="character") ) dat2<-dat2[SNPID %in% dat_effect[,KGID] , ] #109 eSNPs in my imputed data dat2<-dat2[ ,.(SNPID,chr,position,coded_all,noncoded_all,AF_coded_all,HWE_pval,callrate,imputed,oevar_imp,SNPorder ) ] dat2<-merge(dat2,dat_effect,by.x="SNPID",by.y="KGID") dat2<-dat2[oevar_imp >= 0.4 , ] #95 eSNPs after poorly imputed SNPs removed dat2<-dat2[order(chr,SNPorder) ,] alleleCheck(dat2) #go ahead and switch coded and noncoded alleles now dat2[ ,coded_all_temp:=coded_all ] dat2[ ,noncoded_all_temp:=noncoded_all ] dat2[allele_switch==1, coded_all:=noncoded_all_temp] dat2[allele_switch==1, noncoded_all:=coded_all_temp] dat2[allele_switch==1, AF_coded_all:=1-AF_coded_all ] #output dat2 for reference fwrite(dat2,file="~/U24/MR/GTEx/data/soodSNPs.csv",na="NA") #extract SNPs from dosage file my_chr<-unique(dat2[,chr] ) for(i in 1:length(my_chr)){ snp_annot<-dat2[chr==my_chr[i],] for(j in 1:snp_annot[,.N]) { mySNPorder<-snp_annot[j,SNPorder ] SNPline<-fread(paste0("~/bigdata/U24/GWAScohorts/MrOS/plink1KG/chr",my_chr[i],".plink.dose"),skip=mySNPorder-1,nrows=1,header=F) if(i==1 & j==1 ){ dose_out<-SNPline } else { dose_out<-rbind(dose_out,SNPline) } } } fwrite(dat2,file="~/U24/MR/GTEx/data/snp_info.csv",na="NA") fwrite(dose_out,file="~/U24/MR/GTEx/data/dose.txt",na="NA",sep="\t",col.names=F)
#read in dose file, flip dosage data as needed, multiply betas, sum across SNPs doseHeader<-scan("~/bigdata/U24/GWAScohorts/MrOS/vcf1KG/headerDosage.txt",sep="\t",what="character") doseHeader <- sapply(str_split(doseHeader,"_" ),function(x) x[1] ) dose<-fread("~/U24/MR/GTEx/data/dose.txt",header=F,col.names=doseHeader) dose<-as.data.frame(dose) map<-fread("~/U24/MR/GTEx/data/snp_info.csv") identical(map[,SNPID],dose[[1]]) #SNP map in same order as dose file #identify which rows contain SNPs that need to be flipped, then loop through those rows and flip dosage myrows<-which(map[,allele_switch]==1) for(i in myrows){ temp<-unlist(dose[i,4:4618]) temp<-2-temp dose[i,4:4618]<-temp } #multiply beta by dosage for each snp #Not for GTEx, negative and positive will cancel #for(i in 1:length(map$effect)){ # effect1<-map$effect[i] # dose[i,4:4618] <- effect1*unlist(dose[i,4:4618]) #} #sum allele dosage per person IV <- sapply(dose[4:4618],sum ) #4615 length vector IV <- unname(IV) ID <- names(dose)[4:4618] IV_predictor<-data.table(ID=ID,IV=IV) #now merge to phenotype and covariates pheno<-fread("~/U24/MR/HDL/data/MrOSsurvivalPercentiles.csv") pheno<-merge(pheno,IV_predictor,by="ID") pheno[ ,dead90:=-9L ] pheno[control60dead==1 ,dead90:=0L ] pheno[cases90==1 ,dead90:=1L ] pheno[dead90==-9 ,dead90:=NA ] L1<-glm(dead90 ~ IV + GIAGE1 + factor(SITE) + EV1 + EV2 + EV3 + EV4 ,data=pheno, family=binomial(link="logit"), maxit=50 , na.action=na.omit ) summary(L1) #survival library(survival) pheno[DADEAD=="M:Not Applicable" ,DADEAD:=NA] pheno[DADEAD=="A:Missing" , DADEAD:="0"] pheno[,DADEAD:=as.integer(DADEAD)] L1 <- coxph(Surv(FUCDTIME,DADEAD) ~ IV + GIAGE1 + factor(SITE) + EV1 + EV2 + EV3 + EV4, data=pheno) summary(L1)
The *egene datasets have selected the most significant eSNP for each gene. However, all potential eSNPs for a gene are provided in a separate file. These results could potentially be mined for secondary independent eSNP associations for each gene. GTEx V6 analysis results are based on genotypes imputed to 1000 Genomes (1KG) Phase I version 3. Thus, significant results could be LD-filtered using Phase I data. One could make use of the larger sample size in later projects, such as 1KG Phase 3 or the Haplotype Reference Panel, but the SNP identifiers might differ from GTEx results, creating difficulties.
#genotypes download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", "./ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz") untar("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz") #sample ped file download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v2.20130502.ALL.ped", "./integrated_call_samples_v2.20130502.ALL.ped") #sample super population file download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", "./integrated_call_samples_v3.20130502.ALL.panel") #identify EUR unrelated samples from 1KG phase 3 ped2<-read.table("data/integrated_call_samples_v2.20130502.ALL.ped", stringsAsFactors = F, header = T, sep="\t") ped3<-read.table("data/integrated_call_samples_v3.20130502.ALL.panel", stringsAsFactors = F, header = T, sep="\t") samples1KG <- filter_1KGsamples("EUR",ped2,ped3) samples1KG_ID <- samples1KG[,"Individual.ID",drop=F] write.table(samples1KG_ID,file="data/samples1KG.txt",quote=F,row.names=F,col.names=F)
Create region file to use with bcftools for LD.
regions<-data.frame(chr="chr1",pos=0,pos_to=0,stringsAsFactors = F) regions$chr[1]<-gsub("chr","",result$snp_chrom[1]) #1KG genotype files do not have chr regions$pos[1]<-min(result$snp_pos) regions$pos_to[1]<-max(result$snp_pos) write.table(regions,file="data/regions.txt",quote=F,row.names=F,col.names=F,sep="\t")
Filter 1KG genotypes to only include EUR unrelated individuals and eQTL region.
>bcftools --version bcftools 1.3 Using htslib 1.3 Copyright (C) 2015 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
bcftools view -Ov -o results/1KGgeno.vcf -S data/samples1KG.txt -R data/regions.txt -v snps data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz #without -v snps, multiple indels with same rsID were outputted, and plink would not read that in. #Quick solution, only include SNPs, not indels.
Run PLINK clump command using default settings, but might want to change with different nominal significance thresholds.
plink -vcf results/1KGgeno.vcf --clump data/eSNP.assoc
PLINK clump command identifies 8 independent eSNPs in the region.
Next step, extract independent eSNPs from individual level genotype data, build MR risk score, evaluate for association with survival time.
This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.