Using the *confSAM* R package" In confSAM: Estimates and Bounds for the False Discovery Proportion, by Permutation


Citing confSAM

If you use the \Rpackage{confSAM} package, please cite @hemerik2018false.

Introduction

The package \Rpackage{confSAM} is used for multiple hypothesis testing. It provides confidence bounds for the false discovery proportion in the context of SAM [@tusher2001significance].

The false discovery proportion

Suppose hypotheses $H_1,...,H_m$ are tested by calculating corresponding p-values $p_1,..,p_m$ and rejecting the hypotheses with small p-values. The number of false positive findings is then the number of hypotheses that are rejected even though they are true. The False Discovery Proportion (FDP) is this number divided by the total number of rejected hypotheses.

Instead of calculating a p-value for each hypothesis, it is also possible to calculate other test statistics, $T_1,...,T_m$, say. One could then reject all hypotheses with test statistics e.g. exceeding some constant. (Possibly with a different constant for each hypothesis.)

In multiple testing it is often of interest to estimate how many of the rejected hypotheses are false findings. This is equivalent to estimating the FDP. The package \Rpackage{confSAM} allows estimation of this quantity.

As is usually the case with estimating quantities, providing a point estimate is not enough. What is also important is providing a confidence interval, so that one has e.g. $95\%$ confidence that the quantity of interest lies in the interval. The package \Rpackage{confSAM} allows not only estimating the FDP, but also providing a confidence interval for it. More precisely, the package provides an confidence upper bound for the FDP, so that the user has e.g. $95\%$ confidence that the FDP is between zero and this bound.

The package \Rpackage{confSAM} incorporates different methods for providing estimates and upper bounds. The methods vary in complexity and computational intensity. In the following it is explained how these methods can be used with the function \Rfunction{confSAM}.

Use of permutations

The methods in this package can be used if the joint distribution of the part of the data corresponding to the true hypotheses, is invariant under a group of permutations. For example, suppose that each test statistic $T_i$, $1\leq i \leq m$, depends on some $n$-dimensional vector of obervations. Suppose for example that such a vector contains $n$ gene expression level measurements: $n/2$ from cases and $n/2$ from controls. If the joint distribution of the gene expression levels corresponding to the true hypotheses is the same for cases and controls, then permuting the cases and controls does not change the joint distribution of the part of the data under the null. In that case the methods in this package can be used. For the precise formulation of this assumption, see @hemerik2018false, Assumption 1.

Designs with more than two groups or other transformations than permutations are also possible. See @hemerik2018false for general theory.

Basic estimate and bound

For the function \Rfunction{confSAM}, essentially the only input required is a matrix of p-values (or other test statistics). Every row of the matrix should correspond to a (random) permutation of the data.

Obtaining the matrix of test statistics

The \Rpackage{samr} package contains a function \Rfunction{samr} that allows computation of the test statistics as defined in their paper [@tusher2001significance]. More precisely, the object \Robject{tt} that \Rfunction{samr} returns, contains test statistics for the original data. Further, the object \Robject{ttstar0} contains a matrix of (unsorted) test statistics for the permuted version of the data. These objects can be used as input for our function \Rfunction{confSAM} (\Robject{ttstar0} should first be transposed).

Here we will not use \Rpackage{samr} to compute test statistics, but compute test statistics ourselves. As example data to work with, we consider the nki70 dataset from the \Rpackage{penalized} package.

library(penalized)
data(nki70)


This survival data set concerns 144 lymph node positive breast cancer patients. For each patient there is a time variable and an event indicator variable (metastasis-free survival), as well as 70 gene expression level measurements. Using confSAM we will test the hypotheses $H_1,...,H_{70}$ where $H_i$ is the hypothesis that the expression level of gene $i$ is not associated with the survival curve.

To be able to use \Rfunction{confSAM}, we now construct the required matrix of p-values. We will use random permutations, i.e. random reshufflings of the 144 vectors of gene expression levels. Hence we first set the seed.

library(survival)
set.seed(21983)
w<-100 # number of random permutations
pvalues <- matrix(nr=w,nc=70)
survobj <- Surv(time=nki70$time, event=nki70$event)

#compute the 70 p-values for each random permutation
for(j in 1:w){
if(j==1){
permdata <- nki70 #original data
}
else{
permdata <- nki70[sample(nrow(nki70)),] #randomly shuffle the rows
}
for (i in 1:70) {
form <- as.formula(paste("survobj ~ ", names(nki70)[i+7] ))
coxobj <- coxph(form, data=permdata)
sumcoxobj <- summary(coxobj)

References

Try the confSAM package in your browser

Any scripts or data that you put into this service are public.

confSAM documentation built on May 2, 2019, 2:08 a.m.