knitr::opts_chunk$set(echo=T, warning=F, message=F, cache=F, results='hold'); Eval = require(ggplot2)&require(reshape2);
Consider the following two-component normal mixture:
$$ \pi_{0}N(\mu_{0},\sigma_{0}^2) + (1-\pi_{0})N(\mu_{1},\sigma_{1}^{2}) $$ Below, $2\times 10^{3}$ observations are drawn from the two-component normal mixture with null proportion $\pi_{0}=0.95$, null parameters $(\mu_{0}=0,\sigma_{0}^{2}=1)$, and alternative parameters $(\mu_{1}=2,\sigma_{1}^{2}=1)$:
library(LFDR); set.seed(100); # Data generation D = rNM(n=2e3,k=2,pi=c(0.95,0.05),m=c(0,2),v=c(1,1)); # True component Comp = D[,1]; # Observation z = D[,2];
An exponential family density of order $d=5$ is fit to the sample by using 1. a natural spline basis, with knots on the sample quantiles, and 2. an orthogonal polynomial basis.
# Fit density via natural splines fit.ns = Fit.EF(z=z,b=100,d=5,method="ns"); # Fit density via orthogonal polynomials fit.poly = Fit.EF(z=z,b=100,d=5,method="poly");
# Plotting fitted mixture density # Observed data D = data.frame("z"=z); # Evaluation grid x = seq(from=1.1*min(z),to=1.1*max(z),length.out=1001); G = data.frame("x"=x,"NS"=fit.ns$f(x),"Poly"=fit.poly$f(x)); G = melt(G,id.vars="x"); colnames(G) = c("x","Method","y"); # Plot Palette = c(rgb(230,190,0,max=255),rgb(0,120,200,max=255)); q = ggplot() + theme_bw(); q = q + geom_histogram(data=D,aes(x=z,y=..density..),fill="gray",alpha=0.8,bins=50); q = q + geom_line(data=G,aes(x=x,y=y,color=Method)); q = q + scale_color_manual(values=Palette); q = q + theme(legend.position=c(0.8,0.8)); q = q + labs("x"="Observation","y"="Density"); q = q + ggtitle("Fitted Mixture Densities"); print(q);
A sample $(z_{i}^{*})$ of size $n=10^{4}$ is drawn from fitted exponential family using the accept-reject (AR) method. The proposal distribution is normal, centered on the mean of the $(z_{i})$, with twice the standard deviation of the $(z_{i})$.
# Sample fitted density S = Sample.EF(z=z,f=fit.poly$f,m=mean(z),v=2*var(z),n=1e4,parallel=F);
The AR sample $(z_{i}^{*})$ allows for estimation of moments for the fitted exponential family. To match the generative model, the AR sample should have mean $2\pi_{0} = 0.1$ and variance $1+4\pi_{0}(1-\pi_{0}) \approx 1.2$:
# Mean of AR sample round(mean(S),digits=2); # Variance of AR sample round(var(S),digits=2);
The empirical null distribution is taken as normal, however the location and scale are not required to have the theoretical values of $\mu_{0}=0$ and $\sigma_{0}^{2}=1$. Rather, these parameters, and the proportion of observations $\pi_{0}$ arising from the null density, are estimated by central matching CM
or maximum likelihood ML
.
Estimation requires specification of a null neighborhood $A_{0} = [L,U]$. Observations falling in the null neighborhood are assumed to have arisen from the null density. In central matching, $A_{0}$ is partitioned as $L = \xi_{0} < \cdots < \xi_{K} = U$. The log mixture density $\ln\hat{f}$ is evaluated over the partition points, and the evaluations $\ln\hat{f}{i}$ are regressed on a quadratic function of the input. In maximum likelihood, $(\mu{0},\sigma_{0}^{2})$ are estimated by fitting a truncated normal distribution to $A_{0}$. The null proportion is obtained from $(\mu_{0},\sigma_{0}^{2})$ and the proportion of $z$ scores falling in $A_{0}$.
# Estimation by central matching M0 = eNull(z=z,L=-2,U=2,method="CM",f=fit.poly$f); # Estimation by maximum likelihood M1 = eNull(z=z,L=-2,U=2,method="ML"); # Tabulate estimates nFit = round(rbind(as.numeric(M0),as.numeric(M1)),digits=2); colnames(nFit) = c("Mean","Var","pi0"); print(nFit);
The local false discovery rate (lfdr) at $\zeta$ estimates the posterior probability that an observation with value $\zeta$ arose from the null component of the mixture:
$$ \text{lfdr}(\zeta) = \frac{\pi_{0}f_{0}(\zeta)}{\pi_{0}f_{0}(\zeta) + (1-\pi_{0})f_{1}(\zeta)} $$
The tail false discovery rate (tFDR) at $\zeta$ estimates the posterior probability that an observations with value as or more extreme than $\zeta$ arose from the null component of the mixture:
$$ \text{tFDR}(\zeta) = E[\text{fdr}(Z)|Z\geq \zeta] = \frac{\pi_{0}S_{0}(\zeta)}{\pi_{0}S_{0}(\zeta)+(1-\pi_{0})S_{1}(\zeta)} $$
The function FDR
estimates the local and tail false discovery rate functions using the observations z
, a fitted mixture density $f(z)=\pi_{0}f_{0}(z)+(1-\pi_{0})f_{1}(z)$, and the parameters $(\mu_{0},\sigma_{0}^{2},\pi_{0})$ of the empirical null. Two functions are returned, lfdr
and tFDR
for calculating the local and tail false discovery rates, respectively.
# Estimate FDR curves L = FDR(z=z,f=fit.poly$f,b=80,m=M1$m0,v=M1$v0,p0=M1$p0); # Local FDR lfdr = L$lfdr; # Tail FDR tFDR = L$tFDR; # Evaluations for the observed data l0 = lfdr(z); t0 = tFDR(z);
# Evaluation grid x = seq(from=1.1*min(z),to=1.1*max(z),length.out=1001); # FDRs H = data.frame("x"=x,"lfdr"=lfdr(x),"tFDR"=tFDR(x)); H = melt(H,id.vars="x"); colnames(H) = c("x","FDR","y"); # Plot Palette = c(rgb(230,190,0,max=255),rgb(0,120,200,max=255)); q = ggplot() + theme_bw(); q = q + geom_line(data=H,aes(x=x,y=y,color=FDR)); q = q + geom_hline(yintercept=0.2,color="gray",linetype="dashed"); q = q + scale_color_manual(values=Palette); q = q + theme(legend.position=c(0.9,0.8)); q = q + labs("x"="Observation","y"="Rate"); q = q + ggtitle("Estimated Local and Tail FDRs"); print(q);
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.