regression: performs the specified regression on the data

View source: R/regression.R

regressionR Documentation

performs the specified regression on the data

Description

The purpose of regression is to perform a regression on the data across the range of independant and dependant variables provided. If m

Usage

regression(
  input_dt,
  indep_list,
  base_file_name = "regression_output",
  clear_readme = TRUE,
  combined_group_name = "All",
  dep_vars = NULL,
  dep_var_families = NULL,
  event_clm = "OS_e",
  fdr_method = NULL,
  inclusion_list = list(),
  model_comparison_list = NULL,
  model_function = function(dep_var = "", indep_vars = "NULL") {
     paste0("glm(",
    dep_var, " ~ ", paste0("`", indep_vars, "`", collapse = " + "),
    ", data = model_dt)")
 },
  my_grouping = NULL,
  output_dir = ".",
  sample_clm = get_default_sample_key(),
  save_models = FALSE,
  time_clm = "OS_d",
  write_files = TRUE,
  include_dep_var_in_prediction_name = FALSE
)

Arguments

input_dt

A data.table that includes all columns of data needed to do the analysis: names(indep_list), dep_vars (for glm), names(inclusion_list), event_clm & time_clm (for coxph), names(unique(unlist(model_comparison_list)), and my_grouping.

indep_list

Required named list of column names to use as the independent variable. Names of the list will be used to name the output stats. Example:
my_indep_list = list( # no default \cr TRA_Chao1 = c("TRA_Chao1"),
TRB_Chao1 = c("TRB_Chao1"),
TCR_Chao1 = c("TRA_Chao1", "TRB_Chao1")
)

base_file_name

Character string to prefix the names of the output files.

combined_group_name

Character string to call the combinded groups catagory.

dep_vars

Character vector containing the coumn names. Example:
codemy_dep_vars = c("Age", "SNV_Log2_Neoantigens", "Indel_Log2_Neoantigens")

dep_var_families

This character vector should contain the names of the families to add to the model. This isn't used for coxph, since the dependent variable for coxph is survival. Possible values here should be of the form: 'Gamma("identity")' or 'gaussian' and can include anything accepted glm.

event_clm

For coxph. The name of the column from which to draw the event information. The column should only contain integers of 1 and 0. If specified, this column needs to be present in input_dt.

fdr_method

Deprecated. Multiple PValue columns made this overly complicated. Just use binfotron::calc_fdr separately. stats::p.adjust.methods.

inclusion_list

List to specify the samples that should be kept. For example list(pathology_T_stage = c('T1', 'T2', 'T3'), is_asian = c(TRUE)) would drop samples in which the value for the column named 'pathology_T_stage' was not either 'T1', 'T2', or 'T3'. Samples must also have 'is_asian' equal to TRUE. The names of this list should be column names for input_dt

model_comparison_list

Optional named list of column names that should be used for a full and reduced model comparison. Every group of coulms on this list will be run against each dep_vars indep_list combination. The reduced model will only include the items on the list. The full modle will include the independent varible(s) as well. The names of this list will be what the model comparison will be called. The values of this list should be column names in input_dt. Example:
model_comparison_list = list(
Age = c("Age"),
Tissue = c("Tissue"),
Combined = c("Age", "Tissue")
)

model_function

A function to return the model. Important to set data = model_dt in the function. Do not set glm family. This will be added based on dep_var_families. See examples below. glm example:
model_function =function(dep_var = "", indep_vars = "NULL"){
paste0("glm(", dep_var, " ~ ", paste0(indep_vars, collapse = " + "), ", data = model_dt)")
}

coxph example:
model_function = function(dep_var = "", indep_vars = "NULL"){
paste0("coxph(Surv(", time_clm,", ", event_clm, ") ~ ", paste0(indep_vars, collapse = " + "), ", data = model_dt)")
}

my_grouping

This string is the name of the column you want to use to split the data into groups. If specified, this column needs to present in input_dt.

output_dir

Path to the output directory. The parent directory to the path must exist.

sample_clm

String to indicate the name of the column for sample names. Only used to output predictions.

save_models

Boolean on whether you would like to save the models in individual rds files named <base_file_name>_<group_name>_<indep_list_name> .

time_clm

For coxph. The name of the column from which to draw the time information. If specified, this column needs to present in input_dt.

write_files

Boolean on whether you would like to write the output files.

fdr_by_columns

Deprecated. Multiple PValue columns made this overly complicated. Just use binfotron::calc_fdr separately.

fdr_by_columns_for_model_comp

Deprecated. Multiple PValue columns made this overly complicated. Just use binfotron::calc_fdr separately.

Details

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ regression ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This function utilizes one of either glm or coxph methods.

Value

List containing several outputs:

  1. stats - data.table with the results of the model output

  2. model_comp - data.table with the full vs reduced model comparisons if model_comparison_list is provided

  3. readme - An output of the comparisons made.

Writes

  • stats file

  • model_comp file if model_comparison_list is provided.

Todos

  • Support GAMLSS and ability to determine its own family??

  • Stats output sometimes has blank lines in it

Limitations

  • Haven't fixed to run with ordinals.

  • Haven't tried models with interactions with it yet


Benjamin-Vincent-Lab/binfotron documentation built on Oct. 1, 2024, 8:33 p.m.