create_design: a function to create a design for PLMM modeling

View source: R/create_design.R

create_designR Documentation

a function to create a design for PLMM modeling

Description

a function to create a design for PLMM modeling

Usage

create_design(data_file = NULL, rds_dir = NULL, X = NULL, y = NULL, ...)

Arguments

data_file

For filebacked data (data from process_plink() or process_delim()), this is the filepath to the processed data. Defaults to NULL (this argument does not apply for in-memory data).

rds_dir

For filebacked data, this is the filepath to the directory/folder where you want the design to be saved. Note: do not include/append the name you want for the to-be-created file – the name is the argument new_file, passed to create_design_filebacked(). Defaults to NULL (this argument does not apply for in-memory data).

X

For in-memory data (data in a matrix or data frame), this is the design matrix. Defaults to NULL (this argument does not apply for filebacked data).

y

For in-memory data, this is the numeric vector representing the outcome. Defaults to NULL (this argument does not apply for filebacked data). Note: it is the responsibility of the user to ensure that the rows in X and the corresponding elements of y have the same row order, i.e., observations must be in the same order in both the design matrix and in the outcome vector.

...

Additional arguments to pass to create_design_filebacked() or create_design_in_memory(). See the documentation for those helper functions for details.

Details

This function is a wrapper for the other create_design...() inner functions; all arguments included here are passed along to the create_design...() inner function that matches the type of the data being supplied. Note which arguments are optional and which ones are not.

Additional arguments for all filebacked data:

  • new_file User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder.

  • feature_id Optional: A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. - for PLINK data: a string specifying an ID column of the PLINK .fam file. Options are "IID" (default) and "FID" - for all other filebacked data: a character vector of unique identifiers (IDs) for each row of the feature data (i.e., the data processed with process_delim()) - if left NULL (default), X is assumed to have the same row-order as add_outcome. Note: if this assumption is made in error, calculations downstream will be incorrect. Pay close attention here.

  • add_outcome A data frame or matrix with two columns: and ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric.

  • outcome_id A string specifying the name of the ID column in 'add_outcome'

  • outcome_col A string specifying the name of the phenotype column in 'add_outcome'

  • na_outcome_vals Optional: a vector of numeric values used to code NA values in the outcome. Defaults to c(-9, NA_integer) (the -9 matches PLINK conventions).

  • overwrite Optional: logical - should existing .rds files be overwritten? Defaults to FALSE.

  • logfile Optional: name of the '.log' file to be written – Note: do not append a .log to the filename; this is done automatically.

  • quiet Optional: logical - should messages to be printed to the console be silenced? Defaults to FALSE

Additional arguments specific to PLINK data:

  • add_predictor Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named.

  • predictor_id Optional (for PLINK data only): A string specifying the name of the column in 'add_predictor' with sample IDs. Required if 'add_predictor' is supplied. The names will be used to subset and align this external covariate with the supplied PLINK data.

Additional arguments specific to delimited file data:

  • unpen Optional: an character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, your delimited file must have column names.

Additional arguments for in-memory data:

  • unpen Optional: an character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names.

Value

A filepath to an object of class plmm_design, which is a named list with the design matrix, outcome, penalty factor vector, and other details needed for fitting a model. This list is stored as an .rds file for filebacked data, so in the filebacked case a string with the path to that file is returned. For in-memory data, the list itself is returned.

Examples


## Example 1: matrix data in-memory ##
admix_design <- create_design(X = admix$X, y = admix$y, unpen = "Snp1")

## Example 2: delimited data ##
# process delimited data
temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
 data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
 rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)

# prepare outcome data
colon_outcome <- read.delim(find_example_data(path = "colon2_outcome.txt"))

# create a design
colon_design <- create_design(data_file = colon_dat, rds_dir = temp_dir, new_file = "std_colon2",
add_outcome = colon_outcome, outcome_id = "ID", outcome_col = "y", unpen = "sex",
overwrite = TRUE, logfile = "test.log")

# look at the results
colon_rds <- readRDS(colon_design)
str(colon_rds)

## Example 3: PLINK data ##

# process PLINK data
temp_dir <- tempdir()
unzip_example_data(outdir = temp_dir)

plink_data <- process_plink(data_dir = temp_dir,
  data_prefix = "penncath_lite",
  rds_dir = temp_dir,
  rds_prefix = "imputed_penncath_lite",
  # imputing the mode to address missing values
  impute_method = "mode",
  # overwrite existing files in temp_dir
  # (you can turn this feature off if you need to)
  overwrite = TRUE,
  # turning off parallelization - leaving this on causes problems knitting this vignette
  parallel = FALSE)

# get outcome data
penncath_pheno <- read.csv(find_example_data(path = 'penncath_clinical.csv'))

outcome <- data.frame(FamID = as.character(penncath_pheno$FamID),
                  CAD = penncath_pheno$CAD)

unpen_predictors <- data.frame(FamID = as.character(penncath_pheno$FamID),
                               sex = penncath_pheno$sex,
                               age = penncath_pheno$age)


# create design where sex and age are always included in the model
pen_design <- create_design(data_file = plink_data,
  feature_id = "FID",
  rds_dir = temp_dir,
  new_file = "std_penncath_lite",
  add_outcome = outcome,
  outcome_id = "FamID",
  outcome_col = "CAD",
  add_predictor = unpen_predictors,
  predictor_id = "FamID",
  logfile = "design",
  # again, overwrite if needed; use with caution
  overwrite = TRUE)

# examine the design - notice the components of this object
pen_design_rds <- readRDS(pen_design)





plmmr documentation built on April 4, 2025, 12:19 a.m.