Calculate Pathway Activation
Combine individual gene differential expresseion for each pathway
A QSarray object, as generated by makeComparison
A list of pathways to be compared. See description for more details.
The number of points at which to sample the convoluted t-distribution. See Details for more information on appropriate values for this parameter.
If false, print a "." after every fifth pathway, as a way to track progress.
This function convolutes individual gene t-distributions into a single PDF for each gene set.
The geneSets parameter can either be provided as a vector describing a single gene set, or a list of vectors representing a group of gene sets (such as the ones available from Broad's Molecular Signatures Database). Each pathway must be a character vector with entries matching the row names of eset. If a pathway does not contain any values matching the rownames of eset, a warning will be printed, and the function will return NAs for the values of that pathway.
By default the parameter n.points is set to 2^12, or 4096 points, which will give very accurate p-values in most cases. Sampling at more points will increase the accuracy of the resulting p-values, but will also linearly increase the amount of time needed to calculate the result. With larger sample sizes, as few as 1/4 this number of points can be used without seriously affecting the accuracy of the resulting p-values, however when there are a small number of samples (i.e. less than 8 samples total), the t-distribution must be sampled over a much wider range, and the number of points needed for sampling should be increased accordingly. It is recommended that when running
aggregateGeneSet with less than 8 samples the number of points be increased to at least 2^15 or 2^16. It may also be useful to test higher values of this parameter, as it can often result in a much more significant p-value with small sample sizes.
The PDF for each individual gene set is generated by using numerical convolution applied to the individual gene PDFs. Briefly, a Fast Fourier Transform (FFT) is calculated for each individual gene PDF to arrive at a k-component vector. The product of each component across all of the genes is then taken to arrive at a new k-component vector for the gene set. The real part of the resulting product is then transformed back to a PDF using a reverse FFT, and assured to be normalized and centered around zero. The mean of the combined PDF is simply the mean fold change of the individual genes. The range for sampling is determined by the lowest degrees of freedom of the individual genes, such that at most 10^-8 of the cumulative distribution at the tails are excluded (i.e., assumed to be 0). For example, when nu = 3, the range is (-480,480), and when nu = 120, the range is (-6,6).
Technically, the output of this step is the PDF of the sum of differences in expressions over all genes in the gene set under the assumption that the genes are independent. In order to estimate the mean differential expression PDF, this distribution is scaled by a factor of 1/N, where N is the number of genes in the gene set. The resulting PDFs of the input gene sets are stored as a matrix in path.PDF slot of the returned QSarray object. However, the x-coordinates for these PDFs are not stored in the QSarray object, and must be calculated using the getXcoords function.
A QSarray object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
##create example data eset = matrix(rnorm(500*20),500,20, dimnames=list(1:500,1:20)) labels = c(rep("A",10),rep("B",10)) ##first 30 genes are differentially expressed eset[1:30, labels=="B"] = eset[1:30, labels=="B"] + 1 ##compare the two groups geneResults = makeComparison(eset, labels, "B-A") ##aggregate data for gene sets geneSets = list(set1=1:30, set2=31:60) set.results = aggregateGeneSet(geneResults, geneSets)
Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.