mktable: Selection of SNPs and Creation of A Standard Table for...
In GMRP: GWAS-based Mendelian Randomization and Path Analyses

Description Usage Arguments Details Value Note Author(s) References See Also Examples

mktable is used to choose SNPs with LG, Pv, Pc and Pd and create a standard SNP beta table for Mendelian randomization and path analysis, see details.

1	mktable(cdata, ddata,rt, varname, LG, Pv, Pc, Pd)

`cdata`	causal variable GWAS data or GWAS meta-analysed data containing `SNP` `ID`, `SNP` position, chromosome, allele, allelic frequency, beta value, `sd`, sample size, etc.
`ddata`	disease GWAS data or GWAS meta-analysed data containing `SNP` `ID`, `SNP` position, chromosome, allele, allelic frequency, beta value, `sd`, sample size, etc.
`rt`	a string that specifies type of returning table. It has two options: `rt="beta"` returns beta table or `rt="path"` returns `SNP` direct path coefficient table. Default is "beta".
`varname`	a required string set that lists names of undefined causal variables for Mendelian randomization and path analyses. The first name is disease name. Here an example given is `varname <-c("CAD","LDL","HD","TG","TC")`.
`LG`	a numeric parameter. `LG` is a given minimum interval distnce between `SNP`s and used to choose `SNP`s with. Default `LG=1`
`Pv`	a numeric parameter. `Pv` is a given maximum p-value that is used to choose `SNP`s. Default Pv=5e-8
`Pc`	a numeric parameter. `Pc` is a given proportion of sample size to maximum sample size in causal variable data and used to choose `SNP`s. Default `Pc=0.979`
`Pd`	a numeric parameter. Pd is a given proportion of sample size to the maximum sample size in disease data and used to choose `SNP`s. Default `Pd =0.979`.

The standard GWAS cdata set should have the format with following columns: chrn, posit, rsid, a1.x1, a1.x2, ..., a1.xn, freq.x1, freq.x2, ..., freq.xn, beta.x1, beta.x2, ..., beta.xn, sd.x1, sd.x2, ..., sd.xn, pvj, N.x1, N.x2, ..., N.xn, pcj. The standard GWAS ddata set should havehg.d, SNP.d,a1.d, freq.d, beta.d, N.case,N.ctr,freq.case where x1, x2, ..., xn are causal variables. See example.

beta: is a numeric vector that is a column of beta values for regression of SNPs on variable vector X={x1, x2, ..., xn}.
freq: is a numeric vector that is a column of frequencies of allele 1 with respect to variable vector X={x1, x2, ..., xn}.
sd: is a numeric vector that is a column of standard deviations of variable x1,x2, ..., xn specific to SNP. Note that here sd is not beta standard deviation. If sd is not specifical to SNPs, then sd.xi has the same value for all SNPs in variable i.
d: denotes disease.
N: is sample size.
freq.case: is frequency of disease.
chrn: is a numeric vector for chromosome #.
posit: is a numeric vector for SNP positions on chromosome #. Some time, chrn and posit are combined into string vector: hg19/hg18.
pvj: is defined as p-value, pcj and pdj as proportions of sample size for SNP j to the maximum sample size in the causal variable data and in disease data, respectively.

Return a standard SNP beta or SNP path table containing m SNPs chosen with LG, Pv, Pc and Pd and n variables and disease for Mendelian randomization and path analysis.

The order of column variables must be chrn posit rsid a1.x1 ... a1.xn freq.x1 ... freq.xn beta.x1 ... beta.x1 ... beta.xn sd.x1 ... sd.xn ... otherwise, mktable would have error. see example.

Yuan-De Tan tanyuande@gmail.com

Do, R. et al. 2013. Common variants associated with plasma triglycerides and risk for coronary artery disease. Nat Genet 45: 1345-1352.
Sheehan, N.A. et al. 2008. Mendelian randomisation and causal inference in observational epidemiology. PLoS Med 5: e177.
Sheehan, N.A.,et al. 2010. Mendelian randomisation: a tool for assessing causality in observational epidemiology. Methods Mol Biol 713: 153-166.
Willer, C.J. Schmidt, E.M. Sengupta, S. Peloso, G.M. Gustafsson, S. Kanoni, S. Ganna, A. Chen, J.,Buchkovich, M.L. Mora, S. et al (2013) Discovery and refinement of loci associated with lipid levels. Nat Genet 45: 1274-1283.

path

data(lpd.data)
#lpd<-DataFrame(lpd.data)
lpd<-lpd.data
data(cad.data)
#cad<-DataFrame(cad.data)
cad<-cad.data
# step 1: calculate pvj
pvalue.LDL<-lpd$P.value.LDL
pvalue.HDL<-lpd$P.value.HDL
pvalue.TG<-lpd$P.value.TG
pvalue.TC<-lpd$P.value.TC
pv<-cbind(pvalue.LDL,pvalue.HDL,pvalue.TG,pvalue.TC)
pvj<-apply(pv,1,min)

#step 2: construct beta table of undefined causal variables:
beta.LDL<-lpd$beta.LDL
beta.HDL<-lpd$beta.HDL
beta.TG<-lpd$beta.TG
beta.TC<-lpd$beta.TC
beta<-cbind(beta.LDL,beta.HDL,beta.TG,beta.TC)

#step 3: construct a matrix for allele 1 in each undefined causal variable:
a1.LDL<-lpd$A1.LDL
a1.HDL<-lpd$A1.HDL
a1.TG<-lpd$A1.TG
a1.TC<-lpd$A1.TC
alle1<-cbind(a1.LDL,a1.HDL,a1.TG,a1.TC)

#step 4: calculate sample sizes of causal variables and calculate pcj
N.LDL<-lpd$N.LDL
N.HDL<-lpd$N.HDL
N.TG<-lpd$N.TG
N.TC<-lpd$N.TC
ss<-cbind(N.LDL,N.HDL,N.TG,N.TC)
sm<-apply(ss,1,sum)
pcj<-sm/max(sm)

#step 5: construct a matrix for frequency of allele1 in each undefined causal variable in 1000G.EUR
freq.LDL<-lpd$Freq.A1.1000G.EUR.LDL
freq.HDL<-lpd$Freq.A1.1000G.EUR.HDL
freq.TG<-lpd$Freq.A1.1000G.EUR.TG
freq.TC<-lpd$Freq.A1.1000G.EUR.TC
freq<-cbind(freq.LDL,freq.HDL,freq.TG,freq.TC)

#step 6: construct matrix for sd of each causal variable (here sd is not specific to SNPj)
# the sd values were averaged over 63 studies see reference Willer et al(2013) 
sd.LDL<-rep(37.42,length(pvj))
sd.HDL<-rep(14.87,length(pvj))
sd.TG<-rep(92.73,length(pvj))
sd.TC<-rep(42.74,length(pvj))
sd<-cbind(sd.LDL,sd.HDL,sd.TG,sd.TC)

#step 7: retriev SNP ID and position:
hg19<-lpd$SNP_hg19.HDL
rsid<-lpd$rsid.HDL

#step 8: invoke chrp to separate chromosome number and SNP position:
chr<-chrp(hg=hg19)

#step 9: get new data of causal variables:
newdata<-cbind(freq,beta,sd,pvj,ss,pcj)
newdata<-cbind(chr,rsid,alle1,as.data.frame(newdata))
dim(newdata)
#[1] 120165     25

#step 10: retrieve cad data from cad and calculate pdj and frequency of cad in population
hg18.d<-cad$chr_pos_b36
SNP.d<-cad$SNP #SNPID
a1.d<-tolower(cad$reference_allele)
freq.d<-cad$ref_allele_frequency
pvalue.d<-cad$pvalue
beta.d<-cad$log_odds
N.case<-cad$N_case
N.ctr<-cad$N_control
N.d<-N.case+N.ctr
freq.case<-N.case/N.d


#step 11: get new cad data:
newcad<-cbind(freq.d,beta.d,N.case,N.ctr,freq.case)
newcad<-cbind(hg18.d,SNP.d,a1.d,as.data.frame(newcad))
dim(newcad)

#step 12: give variable list
varname<-c("CAD","LDL","HDL","TG","TC")
#step 3: create beta table with function mktable 
mybeta<-mktable(cdata=newdata,ddata=newcad,rt="beta",varname=varname,LG=1, Pv=0.00000005,
Pc=0.979,Pd=0.979)

beta<-mybeta[,4:8] # save beta for path analysis
snp<-mybeta[,1:3] # save snp for annotation analysis
beta<-DataFrame(beta)