title: "repfdr: Package for Replicability Analysis for Multiple Studies of High Dimension" author: "Barak Brill, Ruth Heller, Shay Yaacoby, Daniel Yekutieli" date: "2017 - 07 - 17" output: pdf_document toc: TRUE vignette: | %\VignetteIndexEntry{repfdr} %\VignetteEncoding{UTF-8}
The repfdr package implements the eBayes replicability methodology presented in "Replicability analysis for genome-wide association studies" (Heller and Yekutieli, 2014). This vignette describes the methodology and its outputs, the structure of the package and workflow (data preprocessing, model fitting and output structure for statistical inference), and provides working examples. The Vignette also describes how the methodology may be extended for replicability analysis for a large number of studies. An appendix presents the theoretical aspects for the use of repfdr in this extended setting.
repfdr may be applied in any analysis of the same $M$ set of hypotheses across $N$ studies. For concreteness, in this vignette following Heller and Yekutieli (2014) we consider the example that the multiple studies are $N$ GWAS studying association between the same set of $M$ SNPs to the same phenotype in $N$ different populations. The null hypothesis in each GWAS for each SNP is that the SNP is unassociated to the phenotype in this particular study population. The alternative hypothesis is that the SNP is associated to the phenotype. Heller and Yekutieli (2014) also consider the case where there are two directional alternative hypotheses in each GWAS for each SNP that the SNP is either positively or negatively associated to the phenotype.
Benjamini and Heller (2008) introduced replicability as an extension of multiple testing of the meta-analysis hypotheses. The hypotheses in replicability analysis are with regard to the association status for a single SNP across the $N$ study populations. An example of a replicability null hypothesis is SNP is unassociated with the phenotype in at least $N-1$ study populations. For $u = 0 \cdots N$, Benjamini and Heller (2008) define the $u$ partial conjunction null hypothesis "SNP is associated with the phenotype in at most $u$ populations". For multiple testing of $u$ replicability, Benjamini Heller (2008) suggest computing a p-value for testing the $u$ partial conjunction null hypothesis for each SNP and applying the Benjamini-Hochberg FDR controlling testing procedure to test the $M$ null hypotheses. Heller and Yekutieli (2014) introduced an eBayes FDR controlling methodology for replicability analysis that is shown to be considerably more powerful than the frequentist approach of Benjamini and Heller (2008), in large settings such as GWAS where the eBayes parameters can be well estimated.
The GWAS are indexed by $i = 1,...,N$ and the SNP are indexed by $j = 1,...,M$. $H_{i, j}$ is an indicator variable corresponding to the hypothesis regarding the association of SNP $j$ with the phenotype in GWAS $i$. $H_{i, j}$ may take values in either ${ 0, 1 }$ or ${ -1, 0, 1 }$. For the first option $H_{i,j} = 0$ if SNP $j$ is unassociated with the phenotype in Study $i$ and $H_{i,j} = 1$ if SNP $j$ is associated with the phenotype in Study $i$. The second option allows to further discriminate between the case that SNP $j$ is positively associated with the phenotype ($H_{i,j} = 1$) and the case that SNP $j$ is negatively associated with the phenotype ($H_{i,j} = -1$). The association status of SNP $j$ across the $N$ GWAS is specified by the $n$-vector $H_{j} = ( H_{1, j} \cdots H_{n, j})$. The $u$ partial conjunction null hypothesis is defined $H^0_u = { h : \sum_i I(h_i = 0 ) \le u }$. Let $\pi(h)$ denote the proportion of SNP with association status $h$: $$ \pi(h) := # { j : H_j = h } / M. $$ For $i = 1 \cdots N$, summing $\pi(h)$ over the other $N-1$ studies, yields $\pi_{h_i, i}$ the proportion of SNPs with $H_{i, j} = h_i$.
The input for repfdr is a matrix $Z_{i,j}$, where $Z_{i,j}$ is the Z-score testing the null hypothesis that SNP $j$ is unassociated with the phenotype in GWAS $i$. We assume that conditionally on $H_{i, j}$,$Z_{i,j}$ are independently distributed and that in each study the z-score distribution is the same for all SNPs with the same association status. We further assume that the distribution of $Z_{i,j} | H_{i, j} = 0$ is $N(0,1)$. We denote the distribution of $Z_{i,j} | H_{i, j} = 1$ by $f_{1,i}$ and denote the distribution of $Z_{i,j} | H_{i, j} = -1$ by $f_{-1, i}$. Thus the marginal distribution of the z-scores in Study $i$ is assumed to be drawn from the mixture distribution: $$Z_{i,j} \sim \pi_{0,i} \cdot N(0,1) + \pi_{1,i} \cdot f_{i,1} + \pi_{-1,i} \cdot f_{i,-1}$$
In the Heller and Yekutieli (2014) Bayesian framework for replicability the parameter is $H_{j}$. The prior distribution for $H_{j}$ is
$\pi(h) = \Pr( H_j = h)$. For $Z_j = (Z_{1, j} \cdots Z_{I, j})$, the likelihood is
$$ f( z_j | h) = \Pi_{i=1}^I f_{h_i, i} (z_{i, j}).$$
The posterior probability is given by
$$ \pi(h | z_j) = \Pr( H_j = h | Z_j = z_j ) = \frac{ f( z_j | h) \pi(h)}
{ \sum_h f( z_j | h) \pi(h) }. $$
The local Bayes FDR for the $u$ partial conjunction null hypothesis is defined
$$ fdr_{H^0_u} (z_j) := \Pr(H_{j} \in H^0_u | Z_j = z_j)
= \sum_{h \in H^0_u} \pi(h | z_j). $$
The Bayes FDR corresponding to a rejection region ${\cal Z}$ a subset of $R^I$,
such that observing $Z_j \in {\cal Z}$ implies rejection of the null hypothesis
$H_{i , j} \in H^0_u$, is defined
$$ Fdr_{H^0_u} ({\cal Z}) := \Pr( H_{j} \in H^0_u | Z_j \in {\cal Z}). $$
Where the relation between the local Bayes FDR and the FDR is
$$ Fdr_{H^0_u} ({\cal Z}) = E_{ Z_j \in {\cal Z}} \ fdr^{H^0_u} (z_j). $$
Heller and Yekutieli (2014) further show that the rejection region that yields maximal power of all rejection regions with the a nominal Bayes FDR level $q$ is given by
$$ {\cal Z}{Bayes, H^0_u} = { z_j : fdr{H^0_u} (z_j) \le t(q) }. $$
repfdr uses the empirical distribution of the z-scores in each study to estimate $f_{i,1}$ and $f_{i,-1}$ in the $N$ studies. Once the marginal z-score distributions are specified it is possible to express the complete likelihood for the entire z-score matrix as a function of the set of SNP association status proportions ${ \pi(h): \forall h }$. repfdr uses an EM algorithm to find the set of SNP association status proportions maximizing the complete likelihood.
Once the likelihood and prior are assessed, repfdr computes the local Bayes FDR for each SNP. The SNP are then sorted by their local FDR values, where ${\cal Z}j$ is the indices of the SNP with local FDR values smaller than that of $Z_j$. The estimated replicability FDR for SNP $j$ reported in repfdr is given by cumulative mean of the local FDR values $$ Fdr{H^0_u} (Z_j) := \frac{ \sum_{k \in {\cal Z}j} fdr{H^0_u} (z_k)} { | {\cal Z}_j | } $$
repfdr
package workflow is as follows:
Users are given diagnostic plots at this stage for examining the goodness of fit:
$\hat{\pi}_{0,i}$ - Estimator of fraction of null hypotheses in the ith study.
$\hat{f}{1,i}$ and $\hat{f}{-1,i}$ - Estimators non null densities of Z scores under positive and negative association, respectivly.
This step is performed with the function ztobins
or twosided.PValues.tobins
.
If the diagnostic plots show poor fit, repfdr
should not be called with the output from ztobins
or twosided.PValues.tobins
. This step is represented in the flow chart below, along with the parameters for the preprocessing functions.
Note that these functions can be bypassed when using the main repfdr
function, by providing as input to repfdr
the user's own estimates of the probability of falling in each bin with each association status.
knitr::include_graphics("repfdr_structure.png")
repfdr
. Input includes the estimated probability of each association status in each bin, as well as a matrix of bin membership for each SNP in each study (see the help page of ztobins
and twosided.PValues.tobins
for details on the outputs pdf.binned.z
and binned.z.mat
)The function computes the $\hat{\pi}_{h}$ matrix, along with the fdr and Fdr.
The function allows several methods of analysis:
Replicability - tests for the claim that a signal is present in at least two studies for a SNP ($H_{i,j} = 1 or -1,$ for at least two studies, with the same directionality).
Meta-Analysis - tests for the claim that a signal is present in at least one study ($H_{i,j} = 1 or -1,$ for at least one $i$)
user-defined - e.g., $u$-Replicability, which tests for the claim that a signal is present in at least $u+1$ studies.
See the help page for the function repfdr
, for details on how to select the type of inference required, along with other parameters for the algorithm.
repfdr
function in step 2 (see function help page for additional details) or using a specific call to the functon ldr
.The following is an example for running a complete procedure via repfdr. We first simulate out data, two studies with 5e4 SNPs each. the first 4000 SNPs are non null, with signal sampled from $N(0,3^2)$. Normal noise is added for the measured Z score in each non-null SNP.
library(repfdr) # Generate two studies,first 1000 SNPS are non null set.seed(1) my_zmat = matrix(rnorm(50000*2),ncol = 2) n1 = 4000 Signal = rnorm(n1,0,3) my_zmat[1:(n1),1] = Signal + rnorm(n1,0,0.5) my_zmat[1:(n1),2] = Signal + rnorm(n1,0,0.5)
We used ztobins
with the default settings take into account directionality of the signal.
The non-null probabilities $\hat{f}{i,-1}$ and $\hat{f}{i,1}$ are displayed in red and orange curves, respectively, in the diagnostic plot histograms.
A second plot for each study is a Q-Q plot of the observed quantiles versus the standard normal quantiles (the black curve). The red line is the quantiles of a normal with estimated moments versus the standard normal quantiles. The dashed black line is $y=x$. A single black dot marks the $(0,0)$ point on the Q-Q plot. A lack of fit between the red line (or data) to the black dashed line indicates data under the null (no signal) may not be standard normal.
If the null distribution is not standard normal, data has to be transformed in a way which make the test statistics standard normal under the null prior to input into ztobins
. The section for model misspecification shows an example of model misfit, that could be identified by the user with these diagnostic plots.
# perform discretization, visualize results res = ztobins(zmat = my_zmat, plot.diagnostics = T,df = 20,n.bins = 50,trim.z = F)
We perform replicability analysis via the repfdr
function.
# perform replicability analysis repfdr.res = repfdr(res$pdf.binned.z, res$binned.z.mat, non.null = 'replication', control = em.control(tol = 1e-4,verbose = F))
We take a look at the results:
#probability for each null/non null combination repfdr.res$Pi #fdr and Fdr for each SNP head(repfdr.res$mat) # computing local FDR ldr.res = ldr(res$pdf.binned.z, res$binned.z.mat,repfdr.res$Pi) #results for local ldr matrix dim(ldr.res) # ldr for first three SNPs: each row is a different configuration, SNPs 1,2, and 3 in columns 3,4 and 5, respectively. ldr.res[,(1:(3+2))] #how many SNPs are to be reported at q=0.05 length(which(repfdr.res$mat[,2] <= 0.05))
We now show how two sided P-Values can be used for input into repfdr
via the twosided.PValues.tobins
function. This function should be used only for non-directional inference (e.g., identifying that at least two studies contain signal, regardless of the association direction).
#### #DATA GENERATION #### # we generate a dataset with p=10000 pvalues for two studies, # p1=300 of which are non null: set.seed(1) p = 10000 p1 = 1000 z1 = (rnorm(p)) z2 = (rnorm(p)) temp = rnorm(p1, 0,2) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) zmat.example = cbind(z1,z2) #convert to 2-sided P-Values pmat.example = 1-(pnorm(abs(zmat.example)) - pnorm(-1*abs(zmat.example))) #### #Discretization for two sided P-values. #### twosided.pval.res = twosided.PValues.tobins(pmat.example, plot.diagnostics = T,trim.z = T) repfdr.res = repfdr(twosided.pval.res$pdf.binned.z, twosided.pval.res$binned.z.mat, non.null = 'replication', control = em.control(tol = 1e-4,verbose = F)) length(which(repfdr.res$mat[,2] <= 0.05))
We show two examples users should learn to avoid: Spline misfit (ztobins
unable to estiamte density of data) and non-normal null ditribution in data.
We generate a sample with a complex and non sparse non null signal. We show how ztobins
warns the user regarding failure in density estimation. A typical solution is to use more degrees of freedom in spline estimation.
set.seed(1) p = 10000 p1 = 1000 z1 = (rnorm(p)) z2 = (rnorm(p)) temp = 1*rchisq(p1, 2,2) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) res = ztobins(zmat = cbind(z1,z2), plot.diagnostics = T,n.bins = 50)
Warning messages are also found in the reported object
res$PlotWarnings
The problem can be solved by using a higher number of degrees of freedom for spline estimation:
res = ztobins(zmat = cbind(z1,z2), plot.diagnostics = F,df = 15, n.bins = 50)
We generate data where the null distribution is not standard normal, and show that this can be detected via the Q-Q plot.
set.seed(1) p = 10000 p1 = 50 z1 = (rnorm(p)) z2 = (rnorm(p)) z1 = z1 * (0.5+0.2*(abs(z1))^2); z1 = z1/ sd(z1) z2 = z2 * (0.5+0.2*(abs(z2))^2); z2 = z2/ sd(z2) temp = rnorm(p1, 0,2.5) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) res = ztobins(zmat = cbind(z1,z2), plot.diagnostics = T,n.bins = 50)
For $N$ independent studies, the computation can be made efficient if they are known to be clustered into $C$ independent clusters, i.e., where the probability of association configuration in one cluster is independent of the probability of an association configuration in another cluster. For example, in GWAS, if one can group the independent studies to groups such that pleiotropy occurs only within clusters, not across clusters.
The workflow of the package for analysis of clustered studies is similar to the single cluster (i.e. no clusters) case. The workflow is depicted in the diagram below. Initially, the density of test statistics in each study is estimated via ztobins
. Next, repfdr
is called, with the argument clusters
. clusters
is a vector, partitioning the studies into independent clusters (4 clusters, of size 2, in the example below). repfdr
is called for each cluster separately. The results for all clusters are aggregated via the algorithm described in the appendix (repfdr on clusters algorithm).
knitr::include_graphics("repfdr_clusters.png")
The following example shows how repfdr can be used for performing analysis on clusters. We show an example with 8 clusters, 2 studies in each cluster. 1000 SNPs are chosen to be non null in each cluster, independently of other clusters.
We generate the data:
#function generates data with 10^5 SNPS, 1000 are non null in each cluster. demo_data = function(seed=1,p=100000,p1=1000){ set.seed(seed) Signal = rep(F,p) tau = 2 m_studies = 16 X = matrix(rnorm(p*m_studies),nrow = p) for(i in 1:8){ current_signal = sample(1:p,p1,F) Signal[current_signal] = T temp1 = rnorm(p1,0,tau) X[current_signal , 2*i-1] = temp1 + rnorm(p1,0,0.5) X[current_signal , 2*i ] = temp1 + rnorm(p1,0,0.5) } pX = 2*pnorm(abs(X),lower.tail = F) ret = list() ret$X = X ret$pX = pX ret$Signal = Signal return(ret) } dat = demo_data() pX = dat$pX X = dat$X Signal = dat$Signal
We call ztobins
:
ztb = ztobins(X,df = 20) pdf.binned.z = ztb$pdf.binned.z binned.z.mat = ztb$binned.z.mat
We now run repfdr on clusters, where we input the cluster membership.
# we have 8 clusters, two observations in each cluster clusters = as.integer(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)) res = repfdr(pdf.binned.z,binned.z.mat,non.null = 'replication', clusters = clusters, clusters.ldr.report = 'ALL', clusters.verbose = T, control = em.control(verbose = F,tol = 1e-4)) #These are the local FDR for each SNP dim(res$mat) head(res$mat[which(Signal),]) # These are the local FDR for each non null combination and SNP dim(res$ldr) View(res$ldr) c# Pi matrix: dim(res$Pi) head(res$Pi) # we display the results #how many are significant after correcting to multiplicity #saving results for plot, due to runtime restrictions on CRAN #ind=which(-log10(res$mat[,2])>1) #Fdr = res$mat[ind,2] #Signal.plot = Signal[ind] #Manhattan_plot_data = data.frame(ind = ind,Fdr = Fdr,Signal = Signal.plot) #save(Manhattan_plot_data,file = '/home/barak/repfdr_Project/repfdr/vignettes/Man_Plot.Rdata')
res$ldr
provides the estimated posterior probability that the number of studies with association status $\left( -1, 0, 1 \right)$ is exactly $\left( x, y, z \right)$ if nr.association = 3
, or $\left( 0,1\right)$ is exactly $\left( x, y \right)$ if nr.association =2
. For example, $\hat{Pr}\left(\left( #h_i = -1, #h_i = 0, #h_i = 1\right) = \left(2,14,0\right)\right) = 4.2\cdot 10^{-7}$ for SNP1 in the table
#knitr::include_graphics("h_clusters.jpg") #save(plot_ldr,file = '/home/barak/repfdr_Project/repfdr/vignettes/plot_ldr.Rdata') load('plot_ldr.Rdata') knitr::kable(plot_ldr,format = "latex",digits = 9)
We draw a manhattan plot, in blue are the SNPs with non null signal. Red SNPs are false positives. Manhattan plot has been truncated at $-log10(Fdr) = 1$.
load('Man_Plot.Rdata') #Manhattan plots col_vec = rep(1,nrow(Manhattan_plot_data)) + 3*Manhattan_plot_data$Signal + 1*(!Manhattan_plot_data$Signal & Manhattan_plot_data$Fdr<=0.05) plot(Manhattan_plot_data$ind,-log10(Manhattan_plot_data$Fdr),ylab = '-log(Fdr)',xlab = 'SNP Index', col = col_vec, main = 'Manhattan plot ( truncated at -log10(Fdr) = 1) ',cex=0.3,pch=20) abline(h = -log10(0.05),col = 2) Reported = which(Manhattan_plot_data$Fdr <= 0.05) length(Reported) FP = which(Reported %in% which(!Manhattan_plot_data$Signal)) length(FP)
Where we have 1154 reported SNPs, 57 of which are false positive.
We describe the algorithm for the case nr.association=3. For the nr.association=2 case the implementation is similar.
Let $bin_{c,i} (j)$ be the bin number of the jth SNP, in the ith study of the cth cluster. The number of studies in the c cluster will be $N_c$. The number of clusters will be $C$. The total number of studies will be $N = \sum_{i=1}^{C} N_i$
Let $\hat{f}{-1,c,i} (b),\hat{f}{0,c,i} (b),\hat{f}_{1,c,i} (b)$ be the discretized density estimations, for the ith study in the cth cluster, for the null, positive association and negative association hypothesis. The argument b represents the density of the bth cell in the partition of the Z axis.
Both $bin_{c,i} (j)$ and $\hat{f}{-1,c,i} (b),\hat{f}{0,c,i} (b),\hat{f}_{1,c,i} (b)$ are given by the function 'ztobins'.
$\hat{f}{0,c,i} (bin{c,i} (j))$ is therefore the probability estimate for the bin of the jth snp in the appropriate study and cluster, under the null. Similarly we have $\hat{f}{1,c,i} (bin{c,i} (j))$ and $\hat{f}{-1,c,i} (bin{c,i} (j))$
The outcome of the RepFdr procedure inside cluster $c$, is a matrix of the form:
\begin{center} \emph{Table 1: An illustration of all possible hypotheses states for Nc studies, as well as their denoted probability. The sum of the probabilities is 1} \end{center}
| Study 1 | Study 2 | ... | Study $N_c$ | $\pi$ | |:------:|:-----:|:------:|:------:|:------:| | 0 | 0 | 0 over remaining in row | 0 | $\pi_{(0,0,…,0)}$ | | 1 | 0 | 0 over remaining cols in row | 0 | $\pi_{(1,0,…,0)}$ | | -1 | 0 | 0 over remaining in row | 0 | $\pi_{(-1,0,…,0)}$ | | ... | ... | ...| ... | ... | | -1 | -1 | -1 over remaining in row | 0 | $\pi_{(-1,0,…,0)}$ |
This table has $3^{N_{c}}$ rows (or $2^{N_{c}}$ for the nr.association=2 case)
We now perform an aggregation using the above table, for the computation of the Local FDR for the jth SNP. The aggregation is to a table of the form: \begin{center} \emph{Table 2: All possible hypotheses states combinations so that exactly k are -1, l are 0, and m are 1, for the $N_c$ studies, as well as their denoted local fdr for feature j.} \end{center}
| Study 1 | Study 2 | ... | Study N_c | |------|-----|------|------| | $N_c$ | $0$ | $0$ | $lfdr_{(N_c,0,0)}^{(c)} (j)$ | | $N_c-1$ | $1$ | $0$ | $lfdr_{N_c-1,1,0}^{(c)} (j)$ | | $N_c-2$ | $1$ | $1$ | $lfdr_{N_c-2,1,1}^{(c)} (j)$ | | ... | ... | ...| ... | | $0$ | $0$ | $N_c$ | $lfdr_{0,0,N_c}^{(c)} (j)$ |
With the rows being a coarser partition of the prior $\pi$ .The first three columns give the partition of $\pi$ to the total number of studies which are positive non null, null or negative non null. The last column gives the aggregated probability for the coarser partition (the local fdr, per cluster, for all hypotheses having this number of positive non nulls, negative non nulls and nulls, no matter which are which).
Let $G_{k,l,m}$ be the set of all vectors of size $N_c$ composed of exactly $k$ entries being -1,$l$ entries being 0, $m$ entries being 1. Let $u_i$ be the state for the ith study (i.e., $u_i \in {-1,0,1}$) of the vector $\hat{u}=(u_1,…,u_{N_{c}} )\in G_{k,l,m}$
The probability that feature $j$ has exactly $k$ hypotheses states -1, $l$ hypotheses states 0, and $m$ hypotheses states 1 among the $N_c$ studies in cluster $c$, given the binned z-scores $bin_{c,1} (j),…,bin_{c,N_c } (j)$, is estimated to be:
$$ lfdr^{c}{k,l,m}(j) = \sum{\vec{u} \in G_{k,l,m}}^{N_c}[\pi_{\vec{u}} \cdot \prod_{i=1}^{N_c}\hat{f}{u_i,c,i}(bin{c,i}(j))] $$
Table 2 has ${N_c + 2 \choose 2}$ rows (splitting $N_c$ indistinct balls into three bins). The equivalent table for the nr.association=2 case, has ${N_c + 1 \choose 1} = N_c + 1$ rows (splitting $N_c$ indistinct balls into two bins).
All the local fdrs in the last column of Table 2 can be computed by a single pass on the RepFdr $\pi$ matrix in Table 1 since each of the rows of Table 1 belongs to a single row in Table 2, and all the vectors belonging to a row in Table 2 are in Table 1: one allocates the matrix (for all k,l,m entries) for $lfdr^{(c)}$ , and aggregates the probabilities $\pi_{\vec{u}} \cdot \prod_{i=1}^{N_c}\hat{f}{u_i,c,i}(bin{c,i}(j))$ for the rows of Table 1 to their appropriate cells in the $lfdr^{(c)}$ column of Table 2. Table 2 for the jth feature in cluster c will be denoted by $LFDR^{(c)}(j)$.
The per cluster $LFDR^{(c)}(j)$ tables are combined, by an outer product, multiplying the lfdr probabilities of different clusters: The $LFDR$ table (no c superscript, this is for aggregation over clusters):
\begin{center} \emph{Table 3:All possible hypotheses states combinations so that exacktly k are -1, l are 0, and m are 1, for the N studies, as well as their denoted local fdr, for feature j.} \end{center}
| # Studies:-1 | # Studies:0 | # Studies:1 | $lfdr$ | |------|-----|------|------| | N | 0 | 0 | $lfdr_{N_c,0,0}^{(c)}$ | | N - 1 | 1 | 0 | $lfdr_{N_c-1,1,0}^{(c)}$ | | N - 2 | 1 | 1 | $lfdr_{N_c-2,1,1}^{(c)}$| | ... | ... | ...| ... | | 0 | 0 | N | $lfdr_{0,0,N_c}^{(c)}$|
The table is computed by an iterative algorithm, described as follows. For simplicity of notation, we define the operator $M(T^{(1)},T^{(2)})$ which combines two tables, e.g., $T^{(1)}=LFDR^{(c_1 )} (j)$ and $T^{(2)}=LFDR^{(c_2 )} (j)$. The definition of M is given in algorithm form, rather than equation.
We will denote by $T_{k_1,-1}^{(1)}$ the Table entry in the $k_1$ th row and first column, i.e., the number of studies with hypothesis state -1. Similarly, $T_{k_1,0}^{(1) }$ and $T_{k_1,1}^{(1)}$ are the entries in the second and third columns, respectively, of the $k_1$ th row.
The probability that feature $j$ has exactly $k=T_{k_1,-1}^{(1)}$ hypotheses states -1, $l=T_{k_1,0}^{(1)}$ hypotheses states 0, and $m=T_{k_1,1}^{(1)}$ hypotheses states 1 among the studies from which $T^{(1)}$ is composed, is denoted by $T_{k_1,lfdr}^{(1) }=lfdr_{k,l,m}^{(1)} (j)$.
\newpage
cat('\n\r)
\pagebreak
$\qquad$ Algorithm for computing $M(T^{(1)},T^{(2)})$
$(1)\qquad\qquad Generate\, new\, lfdr\, table\, with\, no\, rows\, , named\, T^{(3)}$.
$(2)\qquad\qquad For\, k_{1}=1\, to\, nr.rows(T^{(1)}):$
$(2.1)\qquad\qquad\qquad For\, k_{2}=1\, to\, nr.rows(T^{(2)}):$
$(2.1.1)\qquad\qquad\qquad\qquad c_{-1} \leftarrow T_{k_1,-1}^{(1)}+T_{k_2,-1}^{(2)}$
$(2.1.2)\qquad\qquad\qquad\qquad c_0 \leftarrow T_{k_1,0}^{(1)}+T_{k_2,0}^{(2)}$
$(2.1.3)\qquad\qquad\qquad\qquad c_1 \leftarrow T_{k_1,1}^{(1)}+T_{k_2,1}^{(2)}$
$(2.1.4)\qquad\qquad\qquad\qquad if \,T^{(3)} \,has\, a\, k_3\, such\, that\, (T_{k_3,-1}^{(3)},T_{k_3,0}^{(3)},T_{k_3,1}^{(3) })=(c_{-1},c_0,c_1)$
$(2.1.4.1)\qquad\qquad\qquad\qquad\qquad T_{{k_3},lfdr}^{(3)} \leftarrow T_{{k_3},lfdr}^{(3)}+ T_{k_1,lfdr}^{(1)}\cdot T_{k_2,lfdr}^{(2)}$
$(2.1.5)\qquad\qquad\qquad\qquad else$
$(2.1.5.1)\qquad\qquad\qquad\qquad\qquad k_{new} \leftarrow nr.rows(T^{(3)}) + 1$
$(2.1.5.2)\qquad\qquad\qquad\qquad\qquad T_{k_{new},-1}^{(3)},T_{k_{new},0}^{(3)},T_{k_{new},1}^{(3)} \leftarrow (c_{-1},c_0,c_1)$
$(2.1.5.3)\qquad\qquad\qquad\qquad\qquad T_{k_{new},lfdr}^{(3)} \leftarrow T_{k_1,lfdr}^{(1)} \cdot T_{k_2,lfdr}^{(2)}$
$(3)\qquad\qquad Return\, T^{(3)}$
In practice, it is best to implement lfdr tables with preallocated space and an indexing function from $(c_{-1},c_0,c_1)$ to a row number. Another option is the use of a "hashmap " (in our package, implementation is with a standard "hashmap" from STL). Using the operator $M(T^{(1)},T^{(2)})$, the procedure of computing LFDR is written as:
$\qquad$Algorithm for computing $LFDR(j)$ over all clusters
$(1)\qquad LFDR(j) \leftarrow M(LFDR^{(1)} (j),LFDR^{(2)} (j))$
$(2)\qquad for\; i=3\; to\; C:$
$(2.1) \qquad \qquad LFDR(j) \leftarrow M(LFDR(j),LFDR^{(i)} (j))$
This is the core of the iterative algorithm, breaking down the computation of the lfdr to iterative executions of the same operator.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.