Package overview

This package allows users to estimate the science-wise false discovery rate from @JagerEtAl2013, using an EM approach due to the presence of rounding and censoring. It also allows users to estimate the proportion of true null hypotheses in the presence of covariates, using a regression framework, as per @BocaEtAl2015.

The package is loaded using:

library(swfdr)

Estimating the science-wise false discovery rate

The science-wise false discovery rate (swfdr) is defined in @JagerEtAl2013 as the rate that published research results are false positives. It is based on using reported p-values reported in biomedical journals and assuming that, for the p-values below 0.05, those corresponding to false positives are distributed as U$(0, 0.05)$, while those corresponding to true positives are distributed as tBeta$(\alpha, \beta; 0.05)$, where $\alpha$ and $\beta$ are unknown and tBeta is a Beta distribution truncated at $0.05$. The estimation of the swfdr is complicated by truncation (e.g. reporting p < 0.01) and rounding (e.g. p-values are often rounded to two significant digits). An EM algorithm is used to estimate $\alpha, \beta$, as well as the swfdr.

Example: Estimate the swfdr based on p-values from biomedical journals

We include a dataset containing 15,653 p-values from articles in 5 biomedical journals (American Journal of Epidemiology, BMJ, Jama, Lancet, New England Journal of Medicine), over 11 years (2000-2010). This is obtained from web-scraping, using the code at \url{https://github.com/jtleek/swfdr/blob/master/getPvalues.R} and is already loaded in the package.

colnames(journals_pVals)

A description of the variables in journals_pValscan be found on its help page. In particular, pvalue gives the p-value, pvalueTruncated is a flag for whether it is truncated, year is the year of publication, and journal the journal.

table(journals_pVals$year)
table(journals_pVals$journal)

The calculateSwfdr function

This function estimates the swfdr. It inputs the following parameters:

Given that it runs an EM algorithm, it is somewhat computationally intensive. We show an example of applying it to all the p-values from the abstracts for articles published in the American Journal of Epidemiology in 2015. First, we subset the journals_pVals and only consider the p-values below $0.05$, as in @JagerEtAl2013:

journals_pVals1 <- dplyr::filter(journals_pVals,
                                 year == 2005,
                                 journal == "American Journal of Epidemiology",
                                 pvalue < 0.05)

dim(journals_pVals1)

Next, we define vectors corresponding to the truncation status and the rouding status (defined as rounding to 2 significant digits) and use these vectors, along with the vector of p-values, and the number of EM iterations, as inputs to the calculateSwfdr function:

tt <- data.frame(journals_pVals1)[,2]
rr <- rep(0,length(tt))
rr[tt == 0] <- (data.frame(journals_pVals1)[tt==0,1] == 
                  round(data.frame(journals_pVals1)[tt==0,1],2))
pVals <- data.frame(journals_pVals1)[,1]
resSwfdr <- calculateSwfdr(pValues = pVals, 
                           truncated = tt, 
                           rounded = rr, numEmIterations=100)
names(resSwfdr)

The following values are returned:

Results from example dataset

For the example dataset we considered, the results are as follows:

resSwfdr

Thus, the estimated swfdr for papers published in American Journal of Epidemiology in 2005 is r round(resSwfdr$pi0,3) i.e. r round(resSwfdr$pi0 * 100,1)\% of the discoveries - defined as associations with p < 0.05 - are expected to be false discoveries.

Estimating the proportion of true null hypothesis in the presence of covariates

As in @BocaEtAl2015, we denote by $\pi_0(x)$ the proportion of true null hypotheses as a function of a covariate $x$. This is estimated based on a list of p-values $p_1,\ldots,p_m$ corresponding to a set of null hypotheses, $H_{01},\ldots,H_{0m}$, and a design matrix $X$. The design matrix considers relevant meta-data, which could be valuable for improving estimatong of the prior probability that a hypothesis is true or false.

Example: Adjust for sample size and allele frequency in BMI GWAS meta-analysis

We consider an example from the meta-analysis of data from a genome-wide association study (GWAS) for body mass index (BMI) from @LockeEtAl2015. A subset of this data, corresponding to 50,000 single nucleotide polymorphisms (SNPs) is already loaded with the package.

head(BMI_GIANT_GWAS_sample)
dim(BMI_GIANT_GWAS_sample)

A description of the variables in BMI_GIANT_GWAS_sample can be found on its help page. In particular, p gives the p-values for the association between the SNPs and BMI; N gives the total sample size considered in the study of a particular SNP; and Freq_MAF_Hapmap gives the frequency of the minor (less frequent allele) for a particular SNP in Hapmap. The column Freq_MAF_Int_Hapmap provides 3 approximately equal intervals for the Hapmap MAFs:

table(BMI_GIANT_GWAS_sample$Freq_MAF_Int_Hapmap)

The lm_pi0 function

This function estimates $\pi_0(x)$. It inputs the following parameters:

To apply it to the BMI GWAS dataset, we first create the design matrix, using natural cubic splines with 5 degrees of freedom to model N and 3 discrete categories for the MAFs:

X <- model.matrix(~ splines::ns(N,5) + Freq_MAF_Int_Hapmap, data = BMI_GIANT_GWAS_sample)[,-1]
head(X)

We then run the lm_pi0 function:

pi0x <- lm_pi0(BMI_GIANT_GWAS_sample$p, X=X)
names(pi0x)

The following values are returned:

Results from BMI GWAS meta-analysis example

We first add the estimates of $\pi_0(x)$ for $\lambda=0.8$, $\lambda=0.9$, and the final smoothed value to the BMI_GIANT_GWAS_sample object:

BMI_GIANT_GWAS_sample$fitted0.8 <- pi0x$pi0.lambda[,round(pi0x$lambda,2)==0.8]
BMI_GIANT_GWAS_sample$fitted0.9 <- pi0x$pi0.lambda[,round(pi0x$lambda,2)==0.9]
BMI_GIANT_GWAS_sample$fitted.final.smooth <- pi0x$pi0

We next create a long data frame so that we can use the plotting tools in ggplot2:

library(reshape2)
ldf <- reshape2::melt(BMI_GIANT_GWAS_sample,
                      id.vars=colnames(BMI_GIANT_GWAS_sample)[-grep("fitted",
                                                                    colnames(BMI_GIANT_GWAS_sample))],
                      value.name = "pi0",variable.name = "lambda")
ldf$lambda <- as.character(ldf$lambda)
ldf$lambda[ldf$lambda=="fitted0.8"] <- "lambda=0.8"
ldf$lambda[ldf$lambda=="fitted0.9"] <- "lambda=0.9"
ldf$lambda[ldf$lambda=="fitted.final.smooth"] <- "final smoothed pi0(x)"

head(ldf)

The plot of the estimates of $\pi_0(x)$ against the sample size $N$, stratified by the MAF categories can thus be obtained:

library(ggplot2)
ggplot(ldf, aes(x=N, y=pi0))+
  geom_line(aes(col=Freq_MAF_Int_Hapmap, linetype=lambda)) +
  ylab("Estimated proportion of nulls") +
  guides(color=guide_legend(title="MAF in HapMap CEU population"),
         linetype=guide_legend(title="Estimate"))

References



leekgroup/fdrreg documentation built on Dec. 11, 2020, 10:54 a.m.