proDA: Main function to fit the probabilistic dropout model

View source: R/proDA.R

proDAR Documentation

Main function to fit the probabilistic dropout model

Description

The function fits a linear probabilistic dropout model and infers the hyper-parameters for the location prior, the variance prior, and the dropout curves. In addition it infers for each protein the coefficients that best explain the observed data and the associated uncertainty.

Usage

proDA(
  data,
  design = ~1,
  col_data = NULL,
  reference_level = NULL,
  data_is_log_transformed = TRUE,
  moderate_location = TRUE,
  moderate_variance = TRUE,
  location_prior_df = 3,
  n_subsample = nrow(data),
  max_iter = 20,
  epsilon = 0.001,
  verbose = FALSE,
  ...
)

Arguments

data

a matrix like object (matrix(), SummarizedExperiment(), or anything that can be cast to SummarizedExperiment() (eg. 'MSnSet', 'eSet', ...)) with one column per sample and one row per protein. Missing values should be coded as NA.

design

a specification of the experimental design that is used to fit the linear model. It can be a model.matrix() with one row for each sample and one column for each coefficient. It can also be a formula with the entries referring to global objects, columns in the col_data argument or columns in the colData(data) if data is a SummarizedExperiment. Thirdly, it can be a vector that for each sample specifies the condition of that sample. Default: ~ 1, which means that all samples are treated as if they are in the same condition.

col_data

a data.frame with one row for each sample in data. Default: NULL

reference_level

a string that specifies which level in a factor coefficient is used for the intercept. Default: NULL

data_is_log_transformed

the raw intensities from mass spectrometry experiments have a linear mean-variance relation. This is undesirable and can be removed by working on the log scale. The easiest way to find out if the data is already log- transformed is to see if the intensities are in the range of '0' to '100' in which case they are transformed or if they rather are between '1e5' to '1e12', in which case they are not. Default: TRUE

moderate_location, moderate_variance

boolean values to indicate if the location and the variances are moderated. Default: TRUE

location_prior_df

the number of degrees of freedom used for the location prior. A large number (> 30) means that the prior is approximately Normal. Default: 3

n_subsample

the number of proteins that are used to estimate the hyper-parameter. Reducing this number can speed up the fitting, but also mean that the final estimate is less precise. By default all proteins are used. Default: nrow(data)

max_iter

the maximum of iterations proDA() tries to converge to the hyper-parameter estimates. Default: 20

epsilon

if the remaining error is smaller than epsilon the model has converged. Default: 1e-3

verbose

boolean that signals if the method prints messages during the fitting. Default: FALSE

...

additional parameters for the construction of the 'proDAFit' object

Details

By default, the method is moderating the locations and the variance of each protein estimate. The variance moderation is fairly standard in high-throughput experiments and can boost the power to detect differentially abundant proteins. The location moderation is important to handle the edge case where in one condition a protein is not observed in any sample. In addition it can help to get more precise estimates of the difference between conditions. Unlike 'DESeq2', which moderates the coefficient estimates (ie. the "betas") to be centered around zero, 'proDA' penalizes predicted intensities that strain far from the other observed intensities.

Value

An object of class 'proDAFit'. The object contains information on the hyper-parameters and feature parameters, the convergence, the experimental design etc. Internally, it is a sub-class of SummarizedExperiment which means the object is subsettable. The '$'-operator is overloaded for this object to make it easy to discover applicable functions.

Examples


# Quick start

# Import the proDA package if you haven't already done so
# library(proDA)
set.seed(1)
syn_data <- generate_synthetic_data(n_proteins = 10)
fit <- proDA(syn_data$Y, design = syn_data$groups)
fit
result_names(fit)
test_diff(fit, Condition_1 - Condition_2)

# SummarizedExperiment
se <- generate_synthetic_data(n_proteins = 10,
                     return_summarized_experiment = TRUE)
se
proDA(se, design = ~ group)

# Design using model.matrix()
data_mat <- matrix(rnorm(5 * 10), nrow=10)
colnames(data_mat) <- paste0("sample", 1:5)
annotation_df <- data.frame(names = paste0("sample", 1:5),
                     condition = c("A", "A", "A", "B", "B"),
                     age = rnorm(5, mean=40, sd=10))

design_mat <- model.matrix(~ condition + age,
                           data=annotation_df)
design_mat
proDA(data_mat, design_mat, col_data = annotation_df)



const-ae/proDA documentation built on Oct. 31, 2023, 9:39 p.m.