dot-parse_initial_settings: Internal function for parsing settings required to parse the...
In familiar: End-to-End Automated Machine Learning and Model Evaluation

.parse_initial_settings

R Documentation

Internal function for parsing settings required to parse the input data and define the experiment

Description

This function parses settings required to parse the data set, e.g. determine which columns are identfier columns, what column contains outcome data, which type of outcome is it?

Usage

.parse_initial_settings(config = NULL, ...)

Arguments

config

A list of settings, e.g. from an xml file.

...

Arguments passed on to .parse_experiment_settings

batch_id_column

(recommended) Name of the column containing batch or cohort identifiers. This parameter is required if more than one dataset is provided, or if external validation is performed.

In familiar any row of data is organised by four identifiers:

The batch identifier batch_id_column: This denotes the group to which a set of samples belongs, e.g. patients from a single study, samples measured in a batch, etc. The batch identifier is used for batch normalisation, as well as selection of development and validation datasets.
The sample identifier sample_id_column: This denotes the sample level, e.g. data from a single individual. Subsets of data, e.g. bootstraps or cross-validation folds, are created at this level.
The series identifier series_id_column: Indicates measurements on a single sample that may not share the same outcome value, e.g. a time series, or the number of cells in a view.
The repetition identifier: Indicates repeated measurements in a single series where any feature values may differ, but the outcome does not. Repetition identifiers are always implicitly set when multiple entries for the same series of the same sample in the same batch that share the same outcome are encountered.

sample_id_column

(recommended) Name of the column containing sample or subject identifiers. See batch_id_column above for more details.

If unset, every row will be identified as a single sample.

series_id_column

(optional) Name of the column containing series identifiers, which distinguish between measurements that are part of a series for a single sample. See batch_id_column above for more details.

If unset, rows which share the same batch and sample identifiers but have a different outcome are assigned unique series identifiers.

development_batch_id

(optional) One or more batch or cohort identifiers to constitute data sets for development. Defaults to all, or all minus the identifiers in validation_batch_id for external validation. Required if external validation is performed and validation_batch_id is not provided.

validation_batch_id

(optional) One or more batch or cohort identifiers to constitute data sets for external validation. Defaults to all data sets except those in development_batch_id for external validation, or none if not. Required if development_batch_id is not provided.

outcome_name

(optional) Name of the modelled outcome. This name will be used in figures created by familiar.

If not set, the column name in outcome_column will be used for binomial, multinomial, count and continuous outcomes. For other outcomes (survival and competing_risk) no default is used.

outcome_column

(recommended) Name of the column containing the outcome of interest. May be identified from a formula, if a formula is provided as an argument. Otherwise an error is raised. Note that survival and competing_risk outcome type outcomes require two columns that indicate the time-to-event or the time of last follow-up and the event status.

outcome_type

(recommended) Type of outcome found in the outcome column. The outcome type determines many aspects of the overall process, e.g. the available feature selection methods and learners, but also the type of assessments that can be conducted to evaluate the resulting models. Implemented outcome types are:

binomial: categorical outcome with 2 levels.
multinomial: categorical outcome with 2 or more levels.
count: Poisson-distributed numeric outcomes.
continuous: general continuous numeric outcomes.
survival: survival outcome for time-to-event data.

If not provided, the algorithm will attempt to obtain outcome_type from contents of the outcome column. This may lead to unexpected results, and we therefore advise to provide this information manually.

Note that competing_risk survival analysis are not fully supported, and is currently not a valid choice for outcome_type.

class_levels

(optional) Class levels for binomial or multinomial outcomes. This argument can be used to specify the ordering of levels for categorical outcomes. These class levels must exactly match the levels present in the outcome column.

event_indicator

(recommended) Indicator for events in survival and competing_risk analyses. familiar will automatically recognise 1, true, t, y and yes as event indicators, including different capitalisations. If this parameter is set, it replaces the default values.

censoring_indicator

(recommended) Indicator for right-censoring in survival and competing_risk analyses. familiar will automatically recognise 0, false, f, n, no as censoring indicators, including different capitalisations. If this parameter is set, it replaces the default values.

competing_risk_indicator

(recommended) Indicator for competing risks in competing_risk analyses. There are no default values, and if unset, all values other than those specified by the event_indicator and censoring_indicator parameters are considered to indicate competing risks.

signature

(optional) One or more names of feature columns that are considered part of a specific signature. Features specified here will always be used for modelling. Ranking from feature selection has no effect for these features.

novelty_features

(optional) One or more names of feature columns that should be included for the purpose of novelty detection.

exclude_features

(optional) Feature columns that will be removed from the data set. Cannot overlap with features in signature, novelty_features or include_features.

include_features

(optional) Feature columns that are specifically included in the data set. By default all features are included. Cannot overlap with exclude_features, but may overlap signature. Features in signature and novelty_features are always included. If both exclude_features and include_features are provided, include_features takes precedence, provided that there is no overlap between the two.

reference_method

(optional) Method used to set reference levels for categorical features. There are several options:

auto (default): Categorical features that are not explicitly set by the user, i.e. columns containing boolean values or characters, use the most frequent level as reference. Categorical features that are explicitly set, i.e. as factors, are used as is.
always: Both automatically detected and user-specified categorical features have the reference level set to the most frequent level. Ordinal features are not altered, but are used as is.
never: User-specified categorical features are used as is. Automatically detected categorical features are simply sorted, and the first level is then used as the reference level. This was the behaviour prior to familiar version 1.3.0.

experimental_design

(required) Defines what the experiment looks like, e.g. cv(bt(fs,20)+mb,3,2)+ev for 2 times repeated 3-fold cross-validation with nested feature selection on 20 bootstraps and model-building, and external validation. The basic workflow components are:

fs: (required) feature selection step.
mb: (required) model building step.
ev: (optional) external validation. Note that internal validation due to subsampling will always be conducted if the subsampling methods create any validation data sets.

The different components are linked using +.

Different subsampling methods can be used in conjunction with the basic workflow components:

bs(x,n): (stratified) .632 bootstrap, with n the number of bootstraps. In contrast to bt, feature pre-processing parameters and hyperparameter optimisation are conducted on individual bootstraps.
bt(x,n): (stratified) .632 bootstrap, with n the number of bootstraps. Unlike bs and other subsampling methods, no separate pre-processing parameters or optimised hyperparameters will be determined for each bootstrap.
cv(x,n,p): (stratified) n-fold cross-validation, repeated p times. Pre-processing parameters are determined for each iteration.
lv(x): leave-one-out-cross-validation. Pre-processing parameters are determined for each iteration.
ip(x): imbalance partitioning for addressing class imbalances on the data set. Pre-processing parameters are determined for each partition. The number of partitions generated depends on the imbalance correction method (see the imbalance_correction_method parameter). Imbalance partitioning does not generate validation sets.

As shown in the example above, sampling algorithms can be nested.

The simplest valid experimental design is fs+mb, which corresponds to a TRIPOD type 1a analysis. Type 1b analyses are only possible using bootstraps, e.g. bt(fs+mb,100). Type 2a analyses can be conducted using cross-validation, e.g. cv(bt(fs,100)+mb,10,1). Depending on the origin of the external validation data, designs such as fs+mb+ev or cv(bt(fs,100)+mb,10,1)+ev constitute type 2b or type 3 analyses. Type 4 analyses can be done by obtaining one or more familiarModel objects from others and applying them to your own data set.

Alternatively, the experimental_design parameter may be used to provide a path to a file containing iterations, which is named ⁠####_iterations.RDS⁠ by convention. This path can be relative to the directory of the current experiment (experiment_dir), or an absolute path. The absolute path may thus also point to a file from a different experiment.

imbalance_correction_method

(optional) Type of method used to address class imbalances. Available options are:

full_undersampling (default): All data will be used in an ensemble fashion. The full minority class will appear in each partition, but majority classes are undersampled until all data have been used.
random_undersampling: Randomly undersamples majority classes. This is useful in cases where full undersampling would lead to the formation of many models due major overrepresentation of the largest class.

This parameter is only used in combination with imbalance partitioning in the experimental design, and ip should therefore appear in the string that defines the design.

imbalance_n_partitions

(optional) Number of times random undersampling should be repeated. 10 undersampled subsets with balanced classes are formed by default.

Details

Three variants of parameters exist:

required: this parameter is required and must be set by the user.
recommended: not setting this parameter might cause an error to be thrown, dependent on other input.
optional: these parameters have default values that may be altered if required.