BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({
    library(longevityTools) 
    library(ggplot2) }) 

Introduction

This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its website here. The GitHub repository of the corresponding R package is available here and the most recent version of this vignette can be found here.

Based on muscle gene expression from 15 young (25 year old) and 15 old (65 year old) participants, Sood et al. identified a 150 probe set that accurately classified young and old individuals in external studies with gene expression data collected from tissues other than muscle [@Sood2015-pb]. A gene score based on the classifier was associated with better renal function, increased survival time over follow-up, and decreased Alzheimer's Disease prevalence.

Our goal is to perform Mendelian Randomization using the 150 gene set to determine whether SNPs associated with expression (eSNPs) of the 150 genes are associated with the aging phenotypes identified by Sood et al.

Hypothesis: eSNPs associated with younger expression profile are markers for younger biological age relative to chronological age, which would be associated with longevity.

[Back to Table of Contents]()

General Analytical Approach

Construction of MR eSNP risk score

File sood-2015-fileS1.xls reports the expression direction relative to age for each probe from the 150 probe set. See Top 150 worksheet. A gene marked as 'down' in column 'Ratio of Y:O muscle' means that the expression of the gene is lower in young vs old muscle.

The goal is to generate an eSNP risk score constructed from eSNP alleles associated with increasing gene expression for genes that more highly expressed in young tissue and from eSNP alleles associated with decreasing gene expression for genes that show decreased expression in young tissue. This eSNP risk score would represent an expression profile that more closely mimics the younger age profile rather than the older age profile.

For the 150 probes that map to genes, identify cis-eSNPs using GTEx preanalyzed results from whole blood. Preanalyzed GTEx results correct for multiple testing of SNPs for each gene. To identify independent cis-eSNPs for each gene, prune cis-eSNPs in LD using 1000G EUR population group genotypes. Once independent cis-eSNPs are identified, create a file listing the SNPs to be included in a risk score and which allele is associated with decreasing expression for genes that increase with age (upregulated gene) and which allele is associated with increasing expression for genes that decrease with age (downregulated gene). Upregulated and downregulated genes based on column 'Ratio of Y:O muscle' from Sood et al supplemental file.

Individual level eSNP risk scores will be constructed as below. The eSNP risk score represents the MR instrumental variable.

$$ eSNP Risk Score_i = \sum_{j=1}^{N} eSNP_j^{AlleleDown} + \sum_{k=1}^{N} eSNP_k^{AlleleUp} $$

Getting Started

Installation

The R software for running longevityTools can be downloaded from CRAN. The longevityTools package can be installed from the R console using the following biocLite install command.

source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script 
biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub
[Back to Table of Contents]()

Loading package and documentation

library("longevityTools") # Loads the package
library(help="longevityTools") # Lists package info
vignette("longevityTools") # Opens vignette
[Back to Table of Contents]()

Setup of working environment

Import custom functions

Preliminary functions that are still under developement, or not fully tested and documented can be imported with the source command from the inst/extdata directory of the package. Imported is the function alleleCheck().

fctpath <- system.file("extdata", "longevityTools_eQTL_Fct.R", package="longevityTools")
source(fctpath)
[Back to Table of Contents]()

Create expected directory structure

The following creates the directory structure expected by this workflow. Input data will be stored in the data directory and results will be written to the results directory. All paths are given relative to the present working directory of the user's R session.

dir.create("data"); dir.create("results") 
[Back to Table of Contents]()

Sample data from GTEx

The eQTL data used in this vignette can be downloaded from the GTEx site. Note, the following downloads a 782 MB file, which will take some time. Using patched V6 results.

download.file('http://www.gtexportal.org/static/datasets/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_eQTL.tar','./GTEx_Analysis_v6p_eQTL.tar')
untar("GTEx_Analysis_v6p_eQTL.tar")

Data import

Will begin with whole blood eSNPs, as the Sood paper reported predictive blood gene expression pattern.

library(longevityTools)
library(data.table) #use devel version for fwrite()
library(stringr)

dat_effect<-fread("~/bigdata/GTEx/GTEx_Analysis_v6p_eQTL/Whole_Blood_Analysis.v6p.egenes.txt",colClasses=c(gene_chr="character",chr="character"))

sood<-fread("~/U24/MR/GTEx/data/sood150.csv")
sood<-sood[!duplicated(Gene_Symbol),]
sood <- sood[HU133_Probeset_ID!=Gene_Symbol , ]
setnames(sood,"Y:0_muscle","young_old_muscle")
query_genes<-sood[,Gene_Symbol]

dat_effect[gene_name %in% query_genes,.N] #121 gene matches
dat_effect<-merge(dat_effect,sood,by.x="gene_name",by.y="Gene_Symbol" )
sum(duplicated(dat_effect[,gene_name])) 
#some eSNPs are indels
dat_effect[nchar(ref)==1 & nchar(alt)==1,.N]
dat_effect[ , KGID:=paste(chr,snp_pos,sep=":") ]
dat_effect[nchar(ref)>1 | nchar(alt)>1,.N]
#make KGID for indels
#dat_effect[nchar(ref)>1 | nchar(alt)>1 , KGID:=paste0(chr,":",snp_pos,"[b37]",ref,",",alt )  ]
#remove indels since they don't merge with my 1KG data anyway
dat_effect <- dat_effect[nchar(ref)==1 & nchar(alt)==1,]
#count non-sig eSNPs
dat_effect[qval>0.05 ,.N] #78
dat_effect[pval_nominal>pval_nominal_threshold,.N] #64
dat_effect[pval_perm>0.05 ,.N] #74
dat_effect[pval_perm>0.05 & pval_nominal<pval_nominal_threshold,.(pval_nominal,pval_nominal_threshold,pval_perm)]
#convert up down to + and -
setnames(dat_effect,c("alt","ref"),c("effect_allele","non_effect_allele"))
dat_effect[young_old_muscle=="up",young_old_muscle:="1"]
dat_effect[young_old_muscle=="down",young_old_muscle:="-1"]
dat_effect[,young_old_muscle:=as.integer(young_old_muscle)]
#orient effect alleles to direction of young expression
dat_effect[ ,temp_effect_allele:=effect_allele ]
dat_effect[ ,temp_non_effect_allele:=non_effect_allele ]
#product is negative if slope and young expression direction are opposite
dat_effect[(young_old_muscle*slope)<0 , effect_allele:=temp_non_effect_allele  ]
dat_effect[(young_old_muscle*slope)<0 , non_effect_allele:=temp_effect_allele  ]

dat_effect[ ,effect:=slope ]
#dat_effect <- dat_effect[,.(KGID,gene_name,effect_allele,non_effect_allele,rs_id_dbSNP142_GRCh37p13,maf,slope,effect,pval_nominal,qval,pval_nominal_threshold,young_old_muscle )]
dat_effect <- dat_effect[,.(KGID,effect_allele,non_effect_allele,effect ) ]

#try merging SNPs with my 1KG data. If Indels don't merge, I'll deal with them later.
dat2<-fread("~/U24/GWAScohorts/MrOS/1KGinfo/SNP_info_all.txt",colClasses=c(b37_C_rs.id="character") )

dat2<-dat2[SNPID %in% dat_effect[,KGID]  ,  ] #109 eSNPs in my imputed data
dat2<-dat2[ ,.(SNPID,chr,position,coded_all,noncoded_all,AF_coded_all,HWE_pval,callrate,imputed,oevar_imp,SNPorder ) ]
dat2<-merge(dat2,dat_effect,by.x="SNPID",by.y="KGID")
dat2<-dat2[oevar_imp >= 0.4 , ] #95 eSNPs after poorly imputed SNPs removed
dat2<-dat2[order(chr,SNPorder) ,]

alleleCheck(dat2)
#go ahead and switch coded and noncoded alleles now
dat2[ ,coded_all_temp:=coded_all ]
dat2[ ,noncoded_all_temp:=noncoded_all ]
dat2[allele_switch==1, coded_all:=noncoded_all_temp]
dat2[allele_switch==1, noncoded_all:=coded_all_temp]
dat2[allele_switch==1, AF_coded_all:=1-AF_coded_all ]

#output dat2 for reference
fwrite(dat2,file="~/U24/MR/GTEx/data/soodSNPs.csv",na="NA")

#extract SNPs from dosage file
my_chr<-unique(dat2[,chr] )
for(i in 1:length(my_chr)){
  snp_annot<-dat2[chr==my_chr[i],] 
  for(j in 1:snp_annot[,.N]) {
    mySNPorder<-snp_annot[j,SNPorder ]
    SNPline<-fread(paste0("~/bigdata/U24/GWAScohorts/MrOS/plink1KG/chr",my_chr[i],".plink.dose"),skip=mySNPorder-1,nrows=1,header=F)
    if(i==1 & j==1  ){
      dose_out<-SNPline
    } else {
      dose_out<-rbind(dose_out,SNPline)
    }    
  }
}

fwrite(dat2,file="~/U24/MR/GTEx/data/snp_info.csv",na="NA")
fwrite(dose_out,file="~/U24/MR/GTEx/data/dose.txt",na="NA",sep="\t",col.names=F)

Association analysis with final age and survival time

#read in dose file, flip dosage data as needed, multiply betas, sum across SNPs
doseHeader<-scan("~/bigdata/U24/GWAScohorts/MrOS/vcf1KG/headerDosage.txt",sep="\t",what="character")
doseHeader <- sapply(str_split(doseHeader,"_" ),function(x) x[1] ) 
dose<-fread("~/U24/MR/GTEx/data/dose.txt",header=F,col.names=doseHeader)
dose<-as.data.frame(dose)
map<-fread("~/U24/MR/GTEx/data/snp_info.csv")
identical(map[,SNPID],dose[[1]]) #SNP map in same order as dose file 
#identify which rows contain SNPs that need to be flipped, then loop through those rows and flip dosage
myrows<-which(map[,allele_switch]==1)
for(i in myrows){
  temp<-unlist(dose[i,4:4618])
  temp<-2-temp
  dose[i,4:4618]<-temp
}
#multiply beta by dosage for each snp
#Not for GTEx, negative and positive will cancel
#for(i in 1:length(map$effect)){
#  effect1<-map$effect[i]
#  dose[i,4:4618] <- effect1*unlist(dose[i,4:4618])
#}
#sum allele dosage per person
IV <- sapply(dose[4:4618],sum ) #4615 length vector
IV <- unname(IV)
ID <- names(dose)[4:4618]
IV_predictor<-data.table(ID=ID,IV=IV)

#now merge to phenotype and covariates
pheno<-fread("~/U24/MR/HDL/data/MrOSsurvivalPercentiles.csv")
pheno<-merge(pheno,IV_predictor,by="ID")
pheno[ ,dead90:=-9L ]
pheno[control60dead==1 ,dead90:=0L ]
pheno[cases90==1 ,dead90:=1L ]
pheno[dead90==-9 ,dead90:=NA ] 
L1<-glm(dead90 ~ IV + GIAGE1 + factor(SITE) + EV1 + EV2 + EV3 + EV4 ,data=pheno, family=binomial(link="logit"), maxit=50 , na.action=na.omit )
summary(L1)

#survival
library(survival)
pheno[DADEAD=="M:Not Applicable" ,DADEAD:=NA]
pheno[DADEAD=="A:Missing" , DADEAD:="0"]
pheno[,DADEAD:=as.integer(DADEAD)]
L1 <- coxph(Surv(FUCDTIME,DADEAD) ~ IV + GIAGE1 + factor(SITE) + EV1 + EV2 + EV3 + EV4, data=pheno)
summary(L1)

SNP clumping using PLINK

The *egene datasets have selected the most significant eSNP for each gene. However, all potential eSNPs for a gene are provided in a separate file. These results could potentially be mined for secondary independent eSNP associations for each gene. GTEx V6 analysis results are based on genotypes imputed to 1000 Genomes (1KG) Phase I version 3. Thus, significant results could be LD-filtered using Phase I data. One could make use of the larger sample size in later projects, such as 1KG Phase 3 or the Haplotype Reference Panel, but the SNP identifiers might differ from GTEx results, creating difficulties.

#genotypes
download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", "./ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
untar("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")

#sample ped file
download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v2.20130502.ALL.ped", "./integrated_call_samples_v2.20130502.ALL.ped")

#sample super population file
download.file("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", "./integrated_call_samples_v3.20130502.ALL.panel")

#identify EUR unrelated samples from 1KG phase 3
ped2<-read.table("data/integrated_call_samples_v2.20130502.ALL.ped", stringsAsFactors = F, header = T, sep="\t")
ped3<-read.table("data/integrated_call_samples_v3.20130502.ALL.panel", stringsAsFactors = F, header = T, sep="\t")

samples1KG <- filter_1KGsamples("EUR",ped2,ped3)
samples1KG_ID <- samples1KG[,"Individual.ID",drop=F]
write.table(samples1KG_ID,file="data/samples1KG.txt",quote=F,row.names=F,col.names=F)

Create region file to use with bcftools for LD.

regions<-data.frame(chr="chr1",pos=0,pos_to=0,stringsAsFactors = F)
regions$chr[1]<-gsub("chr","",result$snp_chrom[1]) #1KG genotype files do not have chr
regions$pos[1]<-min(result$snp_pos)
regions$pos_to[1]<-max(result$snp_pos)
write.table(regions,file="data/regions.txt",quote=F,row.names=F,col.names=F,sep="\t")

Filter 1KG genotypes to only include EUR unrelated individuals and eQTL region.

>bcftools --version
bcftools 1.3
Using htslib 1.3
Copyright (C) 2015 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
bcftools view -Ov -o results/1KGgeno.vcf -S data/samples1KG.txt -R data/regions.txt -v snps data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
#without -v snps, multiple indels with same rsID were outputted, and plink would not read that in.
#Quick solution, only include SNPs, not indels.

Run PLINK clump command using default settings, but might want to change with different nominal significance thresholds.

plink -vcf results/1KGgeno.vcf --clump data/eSNP.assoc 

PLINK clump command identifies 8 independent eSNPs in the region.

Next step, extract independent eSNPs from individual level genotype data, build MR risk score, evaluate for association with survival time.

[Back to Table of Contents]()

Funding

This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).

[Back to Table of Contents]()

Version information

sessionInfo()
[Back to Table of Contents]()

References



tgirke/longevityTools documentation built on May 31, 2019, 9:07 a.m.