Differential expression analysis using impulse models

Share:

Description

Fits an impulse model to time course data and uses this model as a basis to detect differentially expressed (DE) genes. If a single time course data set is given, DE genes are detected over time, whereas if an additional control time course data set is present, DE genes are detected between both datasets.

Usage

1
2
3
4
5
impulse_DE(expression_table = NULL, annotation_table = NULL,
  colname_time = NULL, colname_condition = NULL,
  control_timecourse = FALSE, control_name = NULL, case_name = NULL,
  expr_type = "Array", plot_clusters = TRUE, n_iter = 100,
  n_randoms = 50000, n_process = 4, Q_value = 0.01, new_device = TRUE)

Arguments

expression_table

numeric matrix of expression values; genes should be in rows, samples in columns. Data should be properly normalized and log2-transformed as well as filtered for present or variable genes.

annotation_table

table providing co-variables for the samples including condition and time points. Time points must be numeric numbers.

colname_time

character string specifying the column name of the co-variable "Time" in annotation_table

colname_condition

character string specifying the column name of the co-variable "Condition" in annotation_table

control_timecourse

logical indicating whether a control time timecourse is part of the data set (TRUE) or not (FALSE). Default is FALSE.

control_name

character string specifying the name of the control condition in annotation_table.

case_name

character string specifying the name of the case condition in annotation_table. Should be set if more than two conditions are present in annotation_table.

expr_type

character string with allowed values "Array" or "Seq". Default is "Array".

plot_clusters

logical indicating whether to plot the clusters (TRUE) or not (FALSE). Default is TRUE.

n_iter

numeric value specifying the number of iterations, which are performed to fit the impulse model to the clusters. Default is 100.

n_randoms

numeric value specifying the number of generated randomized background iterations, which are used for differential expression analysis. Default is 50000 and this value should not be decreased.

n_process

numeric value indicating the number of processes, which can be used on the machine to run calculations in parallel. Default is 4. The specified value is internally changed to min(detectCores() - 1, n_process) using the detectCores function from the package parallel to avoid overload.

Q_value

numeric value specifying the cutoff to call genes significantly differentially expressed after FDR correction (adjusted p-value). Default is 0.01.

new_device

logical indicating whether each plot should be plotted into a new device (TRUE) or not (FALSE). Default is TRUE.

Details

ImpulseDE is based on the impulse model proposed by Chechik and Koller, which reflects a two-step behavior of genes within a cell responding to environmental changes (Chechik and Koller, 2009). To detect differentially expressed genes, a five-step workflow is followed:

  1. The genes are clustered into a limited number of groups using k-means clustering. If plot_clusters = TRUE, the clusters are plotted.

  2. The impulse model is fitted to the mean expression profiles of the clusters. The best parameter sets are then used for the next step.

  3. The impulse model is fitted to each gene separately using the parameter sets from step 2 as optimal start point guesses.

  4. The impulse model is fitted to a randomized dataset (bootstrap), which is essential to detect significantly differentially expressed genes (Storey et al., 2005).

  5. Detection of differentially expressed genes utilizing the fits to the real and randomized data sets. FDR-correction is performed to obtain adjusted p-values (Benjamini and Hochberg, 1995).

Value

List containing the following elements:

  • impulse_fit_results List containing fitted values and model parameters:

    • impulse_parameters_case Matrix of fitted impulse model parameters and sum of squared fitting errors for the case dataset. If a control time course is present, corresponding list entries will exist for the control and the combined dataset as well (named impulse_parameters_control and impulse_parameters_combined, respectively).

    • impulse_fits_case Matrix of impulse values calculated based on the analyzed time points and the fitted model parameters for the combined dataset. If a control time course is present, corresponding list entries will exist for the control and the combined dataset as well (named impulse_fits_control and impulse_fits_combined, respectively).

  • DE_results List containg the results from the differential expression analysis:

    • DE_genes data.frame containing the names of genes being called as differentially expressed according to the specified cutoff Q_value together with the adjusted p-values.

    • pvals_and_flags data.frame containing all gene names together with the adjusted p-values and flags for differential expression according to additional tests.

  • clustering_results List containing the clustering results:

    • kmeans_clus_case Numeric vector of clusters IDs, to which the genes were finally assigned.

    • cluster_means_case Matrix containing the mean expression values for each cluster (taken over all genes assigned to a cluster).

    • pre_clus_case Numeric number of clusters determined after the first (preliminary) clustering step.

    • fine_clus_case Numeric number of final clusters determined after the second clustering step.

    If a control time course is present, those four list entries will exist correspondingly for the control and the combined dataset as well (ending with _control and _combined instead of _case, respectively).

Author(s)

Jil Sander

References

Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol., 57, 289-300.

Storey, J.D. et al. (2005) Significance analysis of time course microarray experiments. Proc. Natl. Acad. Sci. USA, 102, 12837-12841.

Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M., Sotheran, E., Gaiba, A., Wild, D.L., Falciani, F. (2004) Modeling T-cell activation using gene expression profiling and state-space models. Bioinformatics, 20(9), 1361-72.

Chechik, G. and Koller, D. (2009) Timing of Gene Expression Responses to Envi-ronmental Changes. J. Comput. Biol., 16, 279-290.

Yosef, N. et al. (2013) Dynamic regulatory network controlling TH17 cell differentiation. Nature, 496, 461-468.

See Also

plot_impulse, calc_impulse.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#' Install package longitudinal and load it
library(longitudinal)
#' Attach datasets
data(tcell)
#' check dimension of data matrix of interest
dim(tcell.10)
#' generate a proper annotation table
annot <- as.data.frame(cbind("Time" =
   sort(rep(get.time.repeats(tcell.10)$time,10)),
   "Condition" = "activated"), stringsAsFactors = FALSE)
#' Time columns must be numeric
annot$Time <- as.numeric(annot$Time)
#' rownames of annotation table must appear in data table
rownames(annot) = rownames(tcell.10)
#' apply ImpulseDE in single time course mode
#' since genes must be in rows, transpose data matrix using t()
#' For the example, reduce iterations to 10, randomizations to 50, number of
#' genes to 20 and number of used processors to 1:
impulse_results <- impulse_DE(t(tcell.10)[1:20,], annot, "Time", "Condition",
   n_iter = 10, n_randoms = 50, n_process = 1)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.