prepare.training.validation.datasets: Prepare training and validation datasets

View source: R/prepare.training.validation.datasets.R

prepare.training.validation.datasetsR Documentation

Prepare training and validation datasets

Description

Computes per-patient pathway-derived network impact scores across all input datasets, independently

Usage

prepare.training.validation.datasets(
  data.directory = ".",
  output.directory = ".",
  data.types = c("mRNA"),
  data.types.ordinal = c("cna"),
  min.ordinal.threshold = c(cna = 3),
  centre.data = "median",
  p.threshold = 0.5,
  feature.selection.datasets = NULL,
  datasets = NULL,
  truncate.survival = 100,
  networks.database = "default",
  write.normed.datasets = TRUE,
  subset = NULL
)

Arguments

data.directory

Path to the directory containing datasets as specified by datasets

output.directory

Path to the output folder where intermediate and results files will be saved

data.types

A vector of molecular datatypes to load. Defaults to c('mRNA')

data.types.ordinal

A vector of molecular datatypes to be treated as ordinal. Defaults to c('cna')

min.ordinal.threshold

A named vector specifying minimum percent threshold for each ordinal data type to be used prior to estimating coefficients. Coefficient for features not satisfying minimum threshold will not be estimated, and set to 0. Defaults to cna threshold as 3 percent

centre.data

A character string specifying the centre value to be used for scaling data. Valid values are: 'median', 'mean', or a user defined numeric threshold e.g. '0.3' when modelling methylation beta values. This value is used for both scaling as well as for dichotomising data for estimating univariate betas from Cox model. Defaults to 'median'

p.threshold

Cox P value threshold to be applied for selecting features (e.g. genes) which will contribute to patient risk score estimation. Defaults to 0.5

feature.selection.datasets

A vector containing names of datasets used for feature selection in function derive.network.features()

datasets

A vector containing names of all the datasets to be later used for training and validation purposes

truncate.survival

A numeric value specifying survival truncation in years. Defaults to 100 years which effectively means no truncation

networks.database

Name of the pathway networks database. Default to NCI PID/Reactome/Biocarta i-e "default"

write.normed.datasets

A toggle to control whether processed mRNA and survival data should be written to file

subset

A list with a Field and Entry component specifying a subset of patients to be selected whose annotation Field matches Entry

Value

The output files are stored under output.directory/output/

Author(s)

Syed Haider

Examples


# get data directory 
data.directory <- get.program.defaults()[["test.data.dir"]];

# initialise params
output.directory <- tempdir();
data.types <- c("mRNA");
feature.selection.datasets <- c("Breastdata1");
training.datasets <- c("Breastdata1");
validation.datasets <- c("Breastdata1", "Breastdata2");

# preparing training and validation datasets.
# Normalisation & patientwise subnet feature scores
prepare.training.validation.datasets(
  data.directory = data.directory,
  output.directory = output.directory,
  data.types =  data.types,
  feature.selection.datasets = feature.selection.datasets,
  datasets = unique(c(training.datasets, validation.datasets)),
  networks.database = "test"
  );


SIMMS documentation built on April 24, 2022, 5:06 p.m.