inst/staticdocs/faq/FAQ3.r

# Prior to this question, please allow us to talk about the procedure of how to identify gene-active networks/subgraphs. <span style="font-weight:bold; color:#F87217; text-decoration:underline">The dnet package takes as input a list of genes with the significance information (p-values or FDR), and superposes these genes onto a gene interaction network.</span> From this network with imposed node information, dnet sets up the pipeline to search for a maximum-scoring subgraph. Given the threshold of tolerable p-value, it gives positive scores for nodes with p-values below the threshold (nodes of interest), and negative scores for nodes with threshold-above p-values (intolerable). After score transformation, the search for a maximum scoring subgraph is deduced to find the connected subgraph that is enriched with positive-score nodes, allowing for a few negative-score nodes as linkers. This objective is met through minimum spanning tree finding and post-processing. The solver is deterministic, only determined by the given tolerable p-value threshold. For identification of the subgraph with a desired number of nodes, an iterative procedure is also developed to fine-tune tolerable thresholds. This pipeline procedure is implemented in the function dNetPipeline.

# <span style="font-weight:bold; color:#F87217; text-decoration:underline">The calculation of gene significance depends on the nature of input data (especially samples under consideration).</span> It can be: 1) samples that are grouped, 2) samples collected as a time series, and 3) samples in a clinical setting (with survival data).

# For <span style="font-weight:bold; color:#F87217; text-decoration:underline">grouped samples</span>, the the p-value significance for genes can be calculated using the package 'limma' which is widely used for idnetification of differentially expressed genes. The example for this can be found in the <a href="http://supfam.org/dnet/demo-Hiratani.html">Demo for multilayer replication timing dataset</a> page.

# For <span style="font-weight:bold; color:#F87217; text-decoration:underline">time-course samples</span>, the dnet package has developed a singular value decomposition-based method to obtain gene significance (see dSVDsignif). The example for this can be found in the <a href="http://supfam.org/dnet/demo-Fang.html">Demo for time-course expression dataset</a> page.

# For <span style="font-weight:bold; color:#F87217; text-decoration:underline">survival samples</span>, survival analysis should be applied using Cox proportional hazards model, from which Cox hazard ratio and associated p-value for each gene can be calculated. This can be done via the 'survival' package, and is demonstrated in the <a href="http://supfam.org/dnet/demo-TCGA.html">Demo for TCGA mutation and survival dataset</a> page.

###########################################################################
# Below we will use human embryo dataset from Fang et al to show how to calculate the gene signficance. This involves six successive developmental stages (S9-S14) with three replicates (R1-R3) for each stage, including:
## Fang: an expression matrix of 5,441 genes/probesets X 18 samples;
## Fang.geneinfo: a matrix of 5,441 X 3 containing gene/probeset information;
## Fang.sampleinfo: a matrix of 18 X 3 containing sample information.
###########################################################################

## import data containing three variables ('Fang', 'Fang.geneinfo' and 'Fang.sampleinfo')
data(Fang)
data <- Fang
fdata <- as.data.frame(Fang.geneinfo[,2:3])
rownames(fdata) <- Fang.geneinfo[,1]
pdata <- as.data.frame(Fang.sampleinfo[,2:3])
rownames(pdata) <- Fang.sampleinfo[,1]

## for the ease of managing this dataset, we create the 'eset' object:
colmatch <- match(rownames(pdata),colnames(data))
rowmatch <- match(rownames(fdata),rownames(data))
data <- data[rowmatch,colmatch]
eset <- new("ExpressionSet", exprs=as.matrix(data), phenoData=as(pdata,"AnnotatedDataFrame"), featureData=as(fdata,"AnnotatedDataFrame"))

## Since this dataset is generated by Affymetrix, so data at the probeset level should be converted to data at the gene level. For this, we define a function 'prob2gene' to convert probeset-centric to entrezgene-centric expression levels:
prob2gene <- function(eset){
    fdat <- fData(eset)
    tmp <- as.matrix(unique(fdat[c("EntrezGene", "Symbol")]))
    forder <- tmp[order(as.numeric(tmp[,1])),]
    forder <- forder[!is.na(forder[,1]),]
    rownames(forder) <- forder[,2]
    system.time({
        dat <- exprs(eset)
        edat <- matrix(data=NA, nrow=nrow(forder), ncol=ncol(dat))
        for (i in 1:nrow(forder)){
            ind <- which(fdat$EntrezGene==as.numeric(forder[i,1]))
            if (length(ind) == 1){
                edat[i,] <- dat[ind,]
            }else{
                edat[i,] <- apply(dat[ind,],2,mean)
            }
        }
    })
    
    rownames(edat) <- rownames(forder) # as gene symbols
    colnames(edat) <- rownames(pData(eset))
    esetGene <- new('ExpressionSet',exprs=data.frame(edat),phenoData=as(pData(eset),"AnnotatedDataFrame"),featureData=as(data.frame(forder),"AnnotatedDataFrame"))
    return(esetGene)
}
## 'esetGene' stores the gene-level expression data
esetGene <- prob2gene(eset)
esetGene

## Next, we use 'limma' package to calcluate gene signficance according to two sample groups (S10 versus S9).
### define the design matrix in a order manner
all <- as.vector(pData(esetGene)$Stage)
level <- levels(factor(all))
index_level <- sapply(level, function(x) which(all==x)[1])
level_sorted <- all[sort(index_level, decreasing=F)]
design <- sapply(level_sorted, function(x) as.numeric(all==x)) # Convert a factor column to multiple boolean columns
### a linear model is fitted for every gene by the function lmFit
fit <- lmFit(exprs(esetGene), design)
### define a contrast matrix
contrasts <- dContrast(level_sorted, contrast.type="sequential")
contrast.matrix <- makeContrasts(contrasts=contrasts$each, levels=design)
colnames(contrast.matrix) <- contrasts$name
### computes moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
### for p-value
pvals <- as.matrix(fit2$p.value)
### for adjusted p-value
adjpvals <- sapply(1:ncol(pvals),function(x) {
    p.adjust(pvals[,x], method="BH")
})
colnames(adjpvals) <- colnames(pvals)
### num of differentially expressed genes
apply(adjpvals<1e-2, 2, sum)
### extract gene signficance according to two sample groups (S10 versus S9):
pvalue_S10_S9 <- pvals[,1]
hist(pvalue_S10_S9, 100)

## In addition to pair-wise comparisons, we also use this dataset for calculating gene signficance across the time course.
D <- as.matrix(exprs(esetGene))
D <- D - as.matrix(apply(D,1,mean),ncol=1)[,rep(1,ncol(D))]
fdr <- dSVDsignif(data=D, num.eigen=NULL, pval.eigen=1e-2, signif="fdr", orient.permutation="row", num.permutation=100, fdr.procedure="stepup", verbose=T)
hist(fdr, 100)
hfang-bristol/dnet documentation built on Feb. 23, 2020, 2:06 p.m.