pd_lm: Fit a single linear probabilistic dropout model
In proDA: Differential Abundance Analysis of Label-Free Mass Spectrometry Data

Description Usage Arguments Value Examples

The function works similar to the classical lm but with special handling of NA's. Whereas lm usually just ignores response value that are missing, pd_lm applies a probabilistic dropout model, that assumes that missing values occur because of the dropout curve. The dropout curve describes for each position the chance that that a value is missed. A negative dropout_curve_scale means that the lower the intensity was, the more likely it is to miss the value.

pd_lm(
  formula,
  data = NULL,
  subset = NULL,
  dropout_curve_position,
  dropout_curve_scale,
  location_prior_mean = NULL,
  location_prior_scale = NULL,
  variance_prior_scale = NULL,
  variance_prior_df = NULL,
  location_prior_df = 3,
  method = c("analytic_hessian", "analytic_grad", "numeric"),
  verbose = FALSE
)

`formula`	a formula that specifies a linear model
`data`	an optional data.frame whose columns can be used to specify the `formula`
`subset`	an optional selection vector for data to subset it
`dropout_curve_position`	the value where the chance to observe a value is 50%. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional.
`dropout_curve_scale`	the width of the dropout curve. Smaller values mean that the sigmoidal curve is steeper. Can either be a single value that is repeated for each row or a vector with one element for each row. Not optional.
`location_prior_mean, location_prior_scale`	the optional mean and variance of the prior around which the predictions are supposed to scatter. If no value is provided no location regularization is applied.
`variance_prior_scale, variance_prior_df`	the optional scale and degrees of freedom of the variance prior. If no value is provided no variance regularization is applied.
`location_prior_df`	The degrees of freedom for the t-distribution of the location prior. If it is large (> 30) the prior is approximately Normal. Default: 3
`method`	one of 'analytic_hessian', 'analytic_gradient', or 'numeric'. If 'analytic_hessian' the `nlminb` optimization routine is used, with the hand derived first and second derivative. Otherwise, `optim` either with or without the first derivative is used.
`verbose`	boolean that signals if the method prints informative messages. Default: `FALSE`.

a list with the following entries

coefficients: a named vector with the fitted values
coef_variance_matrix: a p*p matrix with the variance associated with each coefficient estimate
n_approx: the estimated "size" of the data set (n_hat - variance_prior_df)
df: the estimated degrees of freedom (n_hat - p)
s2: the estimated unbiased variance
n_obs: the number of response values that were not 'NA'

  # Without missing values
  y <- rnorm(5, mean=20)
  lm(y ~ 1)
  pd_lm(y ~ 1,
        dropout_curve_position = NA,
        dropout_curve_scale = NA)

  # With some missing values
  y <- c(23, 21.4, NA)
  lm(y ~ 1)
  pd_lm(y ~ 1,
        dropout_curve_position = 19,
        dropout_curve_scale = -1)


  # With only missing values
  y <- c(NA, NA, NA)
  # lm(y ~ 1)  # Fails
  pd_lm(y ~ 1,
        dropout_curve_position = 19,
        dropout_curve_scale = -1,
        location_prior_mean = 21,
        location_prior_scale = 3,
        variance_prior_scale = 0.1,
        variance_prior_df = 2)