linear_correct: Correct data for the effects of selected covariates.
In mirvie/mirmodels: Models Built by the Informatics Team at Mirvie

Description Usage Arguments Details Value Examples

This function uses linear models to regress away the effects of selected covariates on selected columns of a data frame. One may optionally specify variables whose effects are considered real or of interest and their effects will not be regressed away (only effects orthogonal to those will be regressed away).

linear_correct(
  training_data,
  testing_data = NULL,
  correct_cols,
  correct_for_cols,
  keep_effect_cols = NULL,
  robust = TRUE
)

`training_data`	A data frame containing one sample per row. The correction is learned and applied on this data.
`testing_data`	A data frame. The testing counterpart to `training_data`. The correction that is learned on `training_data` is applied here. It's fine to pass this argument as `NULL`, in which case no testing data correction takes place.
`correct_cols`	A character vector. The names of the columns that are to be altered (corrected). These columns must all be numeric.
`correct_for_cols`	A character vector. The names of the columns that are to be corrected for. These columns must all be numeric, factor or logical.
`keep_effect_cols`	A character vector. The names of the column specifying variables whose effects should not be regressed away. These columns must all be numeric, factor or logical. If there are no columns whose effects should not be regressed away, pass this argument as `NULL`.
`robust`	A flag. Use robust linear model `MASS::rlm()`? Can only be used with `type = 1`.

If keep_effect_cols is NULL, this function is just a wrapper around multi_lm() and multi_resids() with reset_mean_med = TRUE. That is, for each variable in correct_cols, a linear model is fit with the variables correct_for_cols as explanatory variables. Then the residuals from this model (reset about their original mean or median) are kept as the corrected values of those variables.

If keep_effect_cols is not NULL, then first, for each variable in correct_cols, a linear model is fit with the variables keep_effect_cols as explanatory variables. The fitted variables from these models are remembered as the effects of these keep_effect_cols on correct_cols. The residuals from these models are then components of correct_cols which can't be explained by keep_effect_cols. With these residuals, the effects of correct_for_cols are regressed away, and what remains is added onto the fitted values from the modelling of correct_cols with keep_effect_cols.

Columns in training_data that are not specified in correct_cols, correct_for_cols, or keep_effect_cols will be returned unchanged.

A list with elements named training_data and testing_data. The corrected data.

if (rlang::is_installed("mirmisc")) {
  data <- get_combined_cohort_data(
    c("bw", "ga", "io", "kl", "pm", "pt", "rs"),
    cpm = FALSE, log2 = TRUE, tot_counts = TRUE,
    gene_predicate = ~ median(.) > 0
  )
  res <- linear_correct(
    data,
    correct_cols = mirmisc::get_df_gene_names(data),
    correct_for_cols  = c("log2_tot_counts"),
    keep_effect_cols = "meta_collectionga"
  )
}