In zrmacc/LFDR: Local False Discovery Rate Estimation

knitr::opts_chunk$set(echo=T, warning=F, message=F, cache=F, results='hold');
Eval = require(ggplot2)&require(reshape2);

Mixture Density
Null Density
Local FDR

Mixture Density

Density Estimation

Consider the following two-component normal mixture:

$$ \pi_{0}N(\mu_{0},\sigma_{0}^2) + (1-\pi_{0})N(\mu_{1},\sigma_{1}^{2}) $$ Below, $2\times 10^{3}$ observations are drawn from the two-component normal mixture with null proportion $\pi_{0}=0.95$, null parameters $(\mu_{0}=0,\sigma_{0}^{2}=1)$, and alternative parameters $(\mu_{1}=2,\sigma_{1}^{2}=1)$:

library(LFDR);
set.seed(100);
# Data generation
D = rNM(n=2e3,k=2,pi=c(0.95,0.05),m=c(0,2),v=c(1,1));
# True component
Comp = D[,1];
# Observation
z = D[,2];

An exponential family density of order $d=5$ is fit to the sample by using 1. a natural spline basis, with knots on the sample quantiles, and 2. an orthogonal polynomial basis.

# Fit density via natural splines
fit.ns = Fit.EF(z=z,b=100,d=5,method="ns");
# Fit density via orthogonal polynomials
fit.poly = Fit.EF(z=z,b=100,d=5,method="poly");

# Plotting fitted mixture density

# Observed data
D = data.frame("z"=z);
# Evaluation grid
x = seq(from=1.1*min(z),to=1.1*max(z),length.out=1001);
G = data.frame("x"=x,"NS"=fit.ns$f(x),"Poly"=fit.poly$f(x));
G = melt(G,id.vars="x"); 
colnames(G) = c("x","Method","y");
# Plot
Palette = c(rgb(230,190,0,max=255),rgb(0,120,200,max=255));
q = ggplot() + theme_bw();
q = q + geom_histogram(data=D,aes(x=z,y=..density..),fill="gray",alpha=0.8,bins=50);
q = q + geom_line(data=G,aes(x=x,y=y,color=Method));
q = q + scale_color_manual(values=Palette);
q = q + theme(legend.position=c(0.8,0.8));
q = q + labs("x"="Observation","y"="Density");
q = q + ggtitle("Fitted Mixture Densities");
print(q);

Density Sampling

A sample $(z_{i}^{*})$ of size $n=10^{4}$ is drawn from fitted exponential family using the accept-reject (AR) method. The proposal distribution is normal, centered on the mean of the $(z_{i})$, with twice the standard deviation of the $(z_{i})$.

# Sample fitted density
S = Sample.EF(z=z,f=fit.poly$f,m=mean(z),v=2*var(z),n=1e4,parallel=F);

The AR sample $(z_{i}^{*})$ allows for estimation of moments for the fitted exponential family. To match the generative model, the AR sample should have mean $2\pi_{0} = 0.1$ and variance $1+4\pi_{0}(1-\pi_{0}) \approx 1.2$:

# Mean of AR sample
round(mean(S),digits=2);
# Variance of AR sample
round(var(S),digits=2);

Empirical Null Distribution

Estimation

The empirical null distribution is taken as normal, however the location and scale are not required to have the theoretical values of $\mu_{0}=0$ and $\sigma_{0}^{2}=1$. Rather, these parameters, and the proportion of observations $\pi_{0}$ arising from the null density, are estimated by central matching CM or maximum likelihood ML.

Estimation requires specification of a null neighborhood $A_{0} = [L,U]$. Observations falling in the null neighborhood are assumed to have arisen from the null density. In central matching, $A_{0}$ is partitioned as $L = \xi_{0} < \cdots < \xi_{K} = U$. The log mixture density $\ln\hat{f}$ is evaluated over the partition points, and the evaluations $\ln\hat{f}{i}$ are regressed on a quadratic function of the input. In maximum likelihood, $(\mu{0},\sigma_{0}^{2})$ are estimated by fitting a truncated normal distribution to $A_{0}$. The null proportion is obtained from $(\mu_{0},\sigma_{0}^{2})$ and the proportion of $z$ scores falling in $A_{0}$.

# Estimation by central matching
M0 = eNull(z=z,L=-2,U=2,method="CM",f=fit.poly$f);
# Estimation by maximum likelihood
M1 = eNull(z=z,L=-2,U=2,method="ML");
# Tabulate estimates
nFit = round(rbind(as.numeric(M0),as.numeric(M1)),digits=2);
colnames(nFit) = c("Mean","Var","pi0");
print(nFit);

Local FDR

The local false discovery rate (lfdr) at $\zeta$ estimates the posterior probability that an observation with value $\zeta$ arose from the null component of the mixture:

$$ \text{lfdr}(\zeta) = \frac{\pi_{0}f_{0}(\zeta)}{\pi_{0}f_{0}(\zeta) + (1-\pi_{0})f_{1}(\zeta)} $$

The tail false discovery rate (tFDR) at $\zeta$ estimates the posterior probability that an observations with value as or more extreme than $\zeta$ arose from the null component of the mixture:

$$ \text{tFDR}(\zeta) = E[\text{fdr}(Z)|Z\geq \zeta] = \frac{\pi_{0}S_{0}(\zeta)}{\pi_{0}S_{0}(\zeta)+(1-\pi_{0})S_{1}(\zeta)} $$

The function FDR estimates the local and tail false discovery rate functions using the observations z, a fitted mixture density $f(z)=\pi_{0}f_{0}(z)+(1-\pi_{0})f_{1}(z)$, and the parameters $(\mu_{0},\sigma_{0}^{2},\pi_{0})$ of the empirical null. Two functions are returned, lfdr and tFDR for calculating the local and tail false discovery rates, respectively.

# Estimate FDR curves
L = FDR(z=z,f=fit.poly$f,b=80,m=M1$m0,v=M1$v0,p0=M1$p0);
# Local FDR
lfdr = L$lfdr;
# Tail FDR
tFDR = L$tFDR;
# Evaluations for the observed data
l0 = lfdr(z);
t0 = tFDR(z);

# Evaluation grid
x = seq(from=1.1*min(z),to=1.1*max(z),length.out=1001);
# FDRs
H = data.frame("x"=x,"lfdr"=lfdr(x),"tFDR"=tFDR(x));
H = melt(H,id.vars="x");
colnames(H) = c("x","FDR","y");

# Plot
Palette = c(rgb(230,190,0,max=255),rgb(0,120,200,max=255));
q = ggplot() + theme_bw();
q = q + geom_line(data=H,aes(x=x,y=y,color=FDR));
q = q + geom_hline(yintercept=0.2,color="gray",linetype="dashed");
q = q + scale_color_manual(values=Palette);
q = q + theme(legend.position=c(0.9,0.8));
q = q + labs("x"="Observation","y"="Rate");
q = q + ggtitle("Estimated Local and Tail FDRs");
print(q);