MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

Documented in groupComparison MSstatsGroupComparison MSstatsGroupComparisonOutput MSstatsGroupComparisonSingleProtein MSstatsPrepareForGroupComparison

#' Whole plot testing
#'
#' @param contrast.matrix comparison between conditions of interests.
#' @param data name of the (output of dataProcess function) data set.
#' @param save_fitted_models logical, if TRUE, fitted models will be added to
#' the output.
#' @param log_base base of the logarithm used in dataProcess.
#' @param numberOfCores Number of cores for parallel processing. When > 1, 
#' a logfile named `MSstats_groupComparison_log_progress.log` is created to 
#' track progress. Only works for Linux & Mac OS. Default is 1.
#' @inheritParams .documentFunction
#'
#' @details
#' contrast.matrix : comparison of interest. Based on the levels of conditions, specify 1 or -1 to the conditions of interests and 0 otherwise. The levels of conditions are sorted alphabetically. Command levels(QuantData$FeatureLevelData$GROUP_ORIGINAL) can illustrate the actual order of the levels of conditions.
#' The underlying model fitting functions are lm and lmer for the fixed effects model and mixed effects model, respectively.
#' The input of this function is the quantitative data from function (dataProcess).
#'
#' @return A list with the following components:
#' \describe{
#'   \item{ComparisonResult}{A `data.frame` containing the results of the statistical testing for each protein. The columns include:
#'     \describe{
#'       \item{Protein}{The name of the protein for which the comparison is made.}
#'       \item{Label}{The label of the comparison, typically derived from the `contrast.matrix`.}
#'       \item{log2FC}{The log2 fold change between the conditions being compared. The base of the logarithm is specified by the `log_base` parameter.
#'          \itemize{
#'              \item{`log2FC = Inf` or `-Inf`: This occurs when one condition has entirely missing measurements for a protein, resulting in an undefined ratio.}
#'              \item{`log2FC` is a numeric value but all other columns are `NA`: This occurs when there is only one sample per condition. Fold change can be estimated, but variance cannot be estimated, so no statistical testing is possible.}
#'          }
#'       }
#'       \item{SE}{The standard error of the log2 fold change estimate. May be `NA` when variance cannot be estimated (e.g., when only one sample per group).}
#'       \item{Tvalue}{The t-statistic value for the comparison. May be `NA` when variance cannot be estimated (e.g., when only one sample per group).}
#'       \item{DF}{The degrees of freedom associated with the t-statistic. A value of 0 indicates that, although variance could be estimated, the total number of observations is too small to support hypothesis testing.}
#'       \item{pvalue}{The p-value for the statistical test of the comparison. Applicable if degrees of freedom is greater than 0}
#'       \item{adj.pvalue}{The adjusted p-value using the Benjamini-Hochberg method for controlling the false discovery rate.}
#'       \item{issue}{Any issues encountered during the comparison.  NA indicates no issues. "oneConditionMissing" occurs when data for one of the conditions being compared is entirely missing for a particular protein.}
#'       \item{MissingPercentage}{The percentage of missing features for a given protein across all runs. This column is included only if missing values were imputed.}
#'       \item{ImputationPercentage}{The percentage of features that were imputed for a given protein across all runs. This column is included only if missing values were imputed.}
#'     }
#'   }
#'   \item{ModelQC}{A `data.frame` containing quality control data used to fit models for group comparison. The columns include:
#'     \describe{
#'       \item{RUN}{Identifier for the specific MS run.}
#'       \item{Protein}{Identifier for the protein.}
#'       \item{ABUNDANCE}{Summarized intensity for the protein in a given run.}
#'       \item{originalRUN}{Original run identifier before any processing.}
#'       \item{GROUP}{Experimental group identifier.}
#'       \item{SUBJECT}{Subject identifier within the experimental group.}
#'       \item{TotalGroupMeasurements}{Total number of feature measurements for the protein in the given group.}
#'       \item{NumMeasuredFeatures}{Number of features measured for the protein in the given run.}
#'       \item{MissingPercentage}{Percentage of missing feature values for the protein in the given run.}
#'       \item{more50missing}{Logical indicator of whether more than 50 percent of the features values are missing for the protein in the given run.}
#'       \item{NumImputedFeature}{Number of features for which values were imputed due to missing or censored data for the protein in the given run.}
#'       \item{residuals}{Contains the differences between the observed values and the values predicted by the fitted model. }
#'       \item{fitted}{The predicted values obtained from the model for a protein measurement for a given run in the dataset. }
#'     }
#'   }
#'   \item{FittedModel}{A list of fitted models for each protein. This is included only if `save_fitted_models` is set to TRUE. Each element of the list corresponds to a protein and contains the fitted model object.}
#' }
#' 
#' @export 
#' @import lme4
#' @import limma
#' @importFrom data.table rbindlist
#'
#' @examples
#' # Consider quantitative data (i.e. QuantData) from yeast study with ten time points of interests, 
#' # three biological replicates, and no technical replicates. 
#' # It is a time-course experiment and we attempt to compare differential abundance
#' # between time 1 and 7 in a set of targeted proteins. 
#' # In this label-based SRM experiment, MSstats uses the fitted model with expanded scope of 
#' # Biological replication.  
#' QuantData <- dataProcess(SRMRawData, use_log_file = FALSE)
#' head(QuantData$FeatureLevelData)
#' levels(QuantData$ProteinLevelData$GROUP)
#' comparison <- matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1)
#' row.names(comparison) <- "T7-T1"
#' groups = levels(QuantData$ProteinLevelData$GROUP)
#' colnames(comparison) <- groups[order(as.numeric(groups))]
#' # Tests for differentially abundant proteins with models:
#' # label-based SRM experiment with expanded scope of biological replication.
#' testResultOneComparison <- groupComparison(contrast.matrix=comparison, data=QuantData,
#'                                            use_log_file = FALSE)
#' # table for result
#' testResultOneComparison$ComparisonResult
#'
groupComparison = function(contrast.matrix, data, 
                           save_fitted_models = TRUE, log_base = 2,
                           use_log_file = TRUE, append = FALSE, 
                           verbose = TRUE, log_file_path = NULL, 
                           numberOfCores = 1
) {
    MSstatsConvert::MSstatsLogsSettings(use_log_file, append, verbose, 
                                        log_file_path, 
                                        "MSstats_groupComparison_log_")
    getOption("MSstatsLog")("INFO", "MSstats - groupComparison function")
    labeled = data.table::uniqueN(data$FeatureLevelData$Label) > 1
    split_summarized = MSstatsPrepareForGroupComparison(data)
    repeated = checkRepeatedDesign(data)
    samples_info = getSamplesInfo(data)
    groups = unique(data$ProteinLevelData$GROUP)
    contrast_matrix = MSstatsContrastMatrix(contrast.matrix, groups)
    getOption("MSstatsLog")("INFO",
                            "== Start to test and get inference in whole plot")
    getOption("MSstatsMsg")("INFO",
                            " == Start to test and get inference in whole plot ...")
    testing_results = MSstatsGroupComparison(split_summarized, contrast_matrix,
                                             save_fitted_models, repeated, samples_info, 
                                             numberOfCores)
    getOption("MSstatsLog")("INFO",
                            "== Comparisons for all proteins are done.")
    getOption("MSstatsMsg")("INFO",
                            " == Comparisons for all proteins are done.")
    MSstatsGroupComparisonOutput(testing_results, data, log_base)
}


#' Prepare output for dataProcess for group comparison
#' 
#' @param summarization_output output of dataProcess
#' 
#' @return list of run-level data for each protein in the input. 
#' This list has a "has_imputed" attribute that indicates if missing values
#' were imputed in the input dataset.
#' 
#' @export
#' 
#' @examples
#' QuantData <- dataProcess(SRMRawData, use_log_file = FALSE)
#' group_comparison_input = MSstatsPrepareForGroupComparison(QuantData)
#' length(group_comparison_input) # list of length equal to number of proteins
#' # in protein-level data of QuantData
#' head(group_comparison_input[[1]])
MSstatsPrepareForGroupComparison = function(summarization_output) {
    has_imputed = is.element("NumImputedFeature", colnames(summarization_output$ProteinLevelData))
    summarized = data.table::as.data.table(summarization_output$ProteinLevelData)
    summarized = .checkGroupComparisonInput(summarized)
    labeled = nlevels(summarization_output$FeatureLevelData$LABEL) > 1
    
    getOption("MSstatsLog")("INFO", paste0("labeled = ", labeled))
    getOption("MSstatsLog")("INFO", "scopeOfBioReplication = expanded")
    output = split(summarized, summarized$Protein)
    attr(output, "has_imputed") = has_imputed
    output
}


#' Group comparison
#' 
#' @param summarized_list output of MSstatsPrepareForGroupComparison
#' @param contrast_matrix contrast matrix
#' @param save_fitted_models if TRUE, fitted models will be included in the output
#' @param repeated logical, output of checkRepeatedDesign function
#' @param samples_info data.table, output of getSamplesInfo function
#' @param numberOfCores Number of cores for parallel processing. When > 1, 
#' a logfile named `MSstats_groupComparison_log_progress.log` is created to 
#' track progress. Only works for Linux & Mac OS.
#'
#' 
#' @export
#' 
#' @examples
#' QuantData <- dataProcess(SRMRawData, use_log_file = FALSE)
#' group_comparison_input = MSstatsPrepareForGroupComparison(QuantData)
#' levels(QuantData$ProteinLevelData$GROUP)
#' comparison <- matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1)
#' row.names(comparison) <- "T7-T1"
#' groups = levels(QuantData$ProteinLevelData$GROUP)
#' colnames(comparison) <- groups[order(as.numeric(groups))]
#' samples_info = getSamplesInfo(QuantData)
#' repeated = checkRepeatedDesign(QuantData)
#' group_comparison = MSstatsGroupComparison(group_comparison_input, comparison,
#'                                           FALSE, repeated, samples_info)
#' length(group_comparison) # list of length equal to number of proteins
#' group_comparison[[1]][[1]] # data used to fit linear model
#' group_comparison[[1]][[2]] # comparison result
#' group_comparison[[2]][[3]] # NULL, because we set save_fitted_models to FALSE
#' 
MSstatsGroupComparison = function(summarized_list, contrast_matrix,
                                  save_fitted_models, repeated, samples_info, 
                                  numberOfCores = 1) {
    if (numberOfCores > 1) {
        return(.groupComparisonWithMultipleCores(summarized_list, contrast_matrix, 
                                                 save_fitted_models, repeated, 
                                                 samples_info, numberOfCores))
    } else {
        return(.groupComparisonWithSingleCore(summarized_list, contrast_matrix, 
                                              save_fitted_models, repeated, 
                                              samples_info))
    }
}


#' Create output of group comparison based on results for individual proteins
#' 
#' @param input output of MSstatsGroupComparison function
#' @param summarization_output output of dataProcess function
#' @param log_base base of the logarithm used in fold-change calculation
#' 
#' @importFrom stats p.adjust
#' 
#' @export
#' 
#' @return list, same as the output of `groupComparison`
#' 
#' @examples 
#' QuantData <- dataProcess(SRMRawData, use_log_file = FALSE)
#' group_comparison_input = MSstatsPrepareForGroupComparison(QuantData)
#' levels(QuantData$ProteinLevelData$GROUP)
#' comparison <- matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1)
#' row.names(comparison) <- "T7-T1"
#' groups = levels(QuantData$ProteinLevelData$GROUP)
#' colnames(comparison) <- groups[order(as.numeric(groups))]
#' samples_info = getSamplesInfo(QuantData)
#' repeated = checkRepeatedDesign(QuantData)
#' group_comparison = MSstatsGroupComparison(group_comparison_input, comparison,
#'                                           FALSE, repeated, samples_info)
#' group_comparison_final = MSstatsGroupComparisonOutput(group_comparison,
#'                                                       QuantData)
#' group_comparison_final[["ComparisonResult"]] 
#'                                                     
MSstatsGroupComparisonOutput = function(input, summarization_output, log_base = 2) {
    adj.pvalue = pvalue = issue = NULL
    
    has_imputed = is.element("NumImputedFeature", colnames(summarization_output$ProteinLevelData))
    model_qc_data = lapply(input, function(x) x[[1]])
    comparisons = lapply(input, function(x) x[[2]])
    fitted_models = lapply(input, function(x) x[[3]])
    comparisons = data.table::rbindlist(comparisons, fill = TRUE)
    comparisons[, adj.pvalue := p.adjust(pvalue, method = "BH"),
                by = "Label"]
    logFC_colname = paste0("log", log_base, "FC")
    comparisons[, adj.pvalue := ifelse(!is.na(issue) & 
                                           issue == "oneConditionMissing", 
                                       0, adj.pvalue)]
    data.table::setnames(comparisons, "logFC", logFC_colname)
    qc = rbindlist(model_qc_data, fill = TRUE)
    cols = c("Protein", "Label", logFC_colname, "SE", "Tvalue", "DF", 
             "pvalue", "adj.pvalue", "issue", "MissingPercentage",
             "ImputationPercentage")
    if (!has_imputed) {
        cols = cols[1:10]
    }
    getOption("MSstatsLog")("INFO", "The output for groupComparison is ready.")
    list(ComparisonResult = as.data.frame(comparisons)[, cols],
         ModelQC = as.data.frame(qc),
         FittedModel = fitted_models)   
}


#' Group comparison for a single protein
#' 
#' @param single_protein data.table with summarized data for a single protein
#' @param contrast_matrix contrast matrix
#' @param repeated if TRUE, repeated measurements will be modeled
#' @param groups unique labels of experimental conditions
#' @param samples_info number of runs per group
#' @param save_fitted_models if TRUE, fitted model will be saved.
#' If not, it will be replaced with NULL
#' @param has_imputed TRUE if missing values have been imputed
#' 
#' @export
#' 
#' @examples 
#' QuantData <- dataProcess(SRMRawData, use_log_file = FALSE)
#' group_comparison_input <- MSstatsPrepareForGroupComparison(QuantData)
#' levels(QuantData$ProteinLevelData$GROUP)
#' comparison <- matrix(c(-1,0,0,0,0,0,1,0,0,0),nrow=1)
#' row.names(comparison) <- "T7-T1"
#' groups = levels(QuantData$ProteinLevelData$GROUP)
#' colnames(comparison) <- groups[order(as.numeric(groups))]
#' samples_info <- getSamplesInfo(QuantData)
#' repeated <- checkRepeatedDesign(QuantData)
#' single_output <- MSstatsGroupComparisonSingleProtein(
#'   group_comparison_input[[1]], comparison, repeated, groups, samples_info,
#'   FALSE, TRUE)
#' single_output # same as a single element of MSstatsGroupComparison output
#' 
MSstatsGroupComparisonSingleProtein = function(single_protein, contrast_matrix,
                                               repeated, groups, samples_info,
                                               save_fitted_models,
                                               has_imputed) {
    single_protein = .prepareSingleProteinForGC(single_protein)
    is_single_subject = .checkSingleSubject(single_protein)
    has_tech_reps = .checkTechReplicate(single_protein)
    
    fitted_model = try(.fitModelSingleProtein(single_protein, contrast_matrix,
                                              has_tech_reps, is_single_subject,
                                              repeated, groups, samples_info,
                                              save_fitted_models, has_imputed),
                       silent = TRUE)
    if (inherits(fitted_model, "try-error")) {
        result = list(list(Protein = unique(single_protein$Protein),
                           Label = row.names(contrast_matrix),
                           logFC = NA, SE = NA, Tvalue = NA,
                           DF = NA, pvalue = NA, issue = NA), NULL)
    } else {
        result = fitted_model
    }
    list(single_protein, result[[1]], result[[2]])
}

MeenaChoi/MSstats documentation built on June 9, 2025, 7:59 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MeenaChoi/MSstats
Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

R/groupComparison.R
In MeenaChoi/MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

Defines functions MSstatsGroupComparisonSingleProtein MSstatsGroupComparisonOutput MSstatsGroupComparison MSstatsPrepareForGroupComparison groupComparison

Documented in groupComparison MSstatsGroupComparison MSstatsGroupComparisonOutput MSstatsGroupComparisonSingleProtein MSstatsPrepareForGroupComparison

R Package Documentation

Browse R Packages

We want your feedback!

MeenaChoi/MSstats Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

R/groupComparison.R In MeenaChoi/MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

Defines functions MSstatsGroupComparisonSingleProtein MSstatsGroupComparisonOutput MSstatsGroupComparison MSstatsPrepareForGroupComparison groupComparison

Documented in groupComparison MSstatsGroupComparison MSstatsGroupComparisonOutput MSstatsGroupComparisonSingleProtein MSstatsPrepareForGroupComparison

R Package Documentation

Browse R Packages

We want your feedback!

MeenaChoi/MSstats
Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

R/groupComparison.R
In MeenaChoi/MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments