summon_familiar: Perform end-to-end machine learning and data analysis
In familiar: End-to-End Automated Machine Learning and Model Evaluation

summon_familiar

R Documentation

Perform end-to-end machine learning and data analysis

Description

Perform end-to-end machine learning and data analysis

Usage

summon_familiar(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  config = NULL,
  config_id = 1L,
  verbose = TRUE,
  .stop_after = "evaluation",
  ...
)

Arguments

`formula`	An R formula. The formula can only contain feature names and dot (`.`). The `*` and `+1` operators are not supported as these refer to columns that are not present in the data set. Use of the formula interface is optional.
`data`	A `data.table` object, a `data.frame` object, list containing multiple `data.table` or `data.frame` objects, or paths to data files. `data` should be provided if no file paths are provided to the `data_files` argument. If both are provided, only `data` will be used. All data is expected to be in wide format, and ideally has a sample identifier (see `sample_id_column`), batch identifier (see `cohort_column`) and outcome columns (see `outcome_column`). In case paths are provided, the data should be stored as `csv`, `rds` or `RData` files. See documentation for the `data_files` argument for more information.
`experiment_data`	Experimental data may provided in the form of
`cl`	Cluster created using the `parallel` package. This cluster is then used to speed up computation through parallelisation. When a cluster is not provided, parallelisation is performed by setting up a cluster on the local machine. This parameter has no effect if the `parallel` argument is set to `FALSE`.
`config`	List containing configuration parameters, or path to an `xml` file containing these parameters. An empty configuration file can obtained using the `get_xml_config` function. All parameters can also be set programmatically. These supersede any arguments derived from the configuration list.
`config_id`	Identifier for the configuration in case the list or `xml` table indicated by `config` contains more than one set of configurations.
`verbose`	Indicates verbosity of the results. Default is TRUE, and all messages and warnings are returned.
`.stop_after`	Variable for internal use.
`...`	Arguments passed on to `.parse_file_paths`, `.parse_experiment_settings`, `.parse_setup_settings`, `.parse_preprocessing_settings`, `.parse_feature_selection_settings`, `.parse_model_development_settings`, `.parse_hyperparameter_optimisation_settings`, `.parse_evaluation_settings` `project_dir` (optional) Path to the project directory. `familiar` checks if the directory indicated by `experiment_dir` and data files in `data_file` are relative to the `project_dir`. `experiment_dir` (recommended) Path to the directory where all intermediate and final results produced by `familiar` are written to. The `experiment_dir` can be a path relative to `project_dir` or an absolute path. In case no project directory is provided and the experiment directory is not on an absolute path, a directory will be created in the temporary R directory indicated by `tempdir()`. This directory is deleted after closing the R session or once data analysis has finished. All information will be lost afterwards. Hence, it is recommended to provide either `experiment_dir` as an absolute path, or provide both `project_dir` and `experiment_dir`. `data_file` (optional) Path to files containing data that should be analysed. The paths can be relative to `project_dir` or absolute paths. An error will be raised if the file cannot be found. The following types of data are supported. `csv` files containing column headers on the first row, and samples per row. `csv` files are read using `data.table::fread`. `rds` files that contain a `data.table` or `data.frame` object. `rds` files are imported using `base::readRDS`. `RData` files that contain a single `data.table` or `data.frame` object. `RData` files are imported using `base::load`. All data are expected in wide format, with sample information organised row-wise. More than one data file can be provided. `familiar` will try to combine data files based on column names and identifier columns. Alternatively, data can be provided using the `data` argument. These data are expected to be `data.frame` or `data.table` objects or paths to data files. The latter are handled in the same way as file paths provided to `data_file`. `batch_id_column` (recommended) Name of the column containing batch or cohort identifiers. This parameter is required if more than one dataset is provided, or if external validation is performed. In familiar any row of data is organised by four identifiers: The batch identifier `batch_id_column`: This denotes the group to which a set of samples belongs, e.g. patients from a single study, samples measured in a batch, etc. The batch identifier is used for batch normalisation, as well as selection of development and validation datasets. The sample identifier `sample_id_column`: This denotes the sample level, e.g. data from a single individual. Subsets of data, e.g. bootstraps or cross-validation folds, are created at this level. The series identifier `series_id_column`: Indicates measurements on a single sample that may not share the same outcome value, e.g. a time series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single series where any feature values may differ, but the outcome does not. Repetition identifiers are always implicitly set when multiple entries for the same series of the same sample in the same batch that share the same outcome are encountered. `sample_id_column` (recommended) Name of the column containing sample or subject identifiers. See `batch_id_column` above for more details. If unset, every row will be identified as a single sample. `series_id_column` (optional) Name of the column containing series identifiers, which distinguish between measurements that are part of a series for a single sample. See `batch_id_column` above for more details. If unset, rows which share the same batch and sample identifiers but have a different outcome are assigned unique series identifiers. `development_batch_id` (optional) One or more batch or cohort identifiers to constitute data sets for development. Defaults to all, or all minus the identifiers in `validation_batch_id` for external validation. Required if external validation is performed and `validation_batch_id` is not provided. `validation_batch_id` (optional) One or more batch or cohort identifiers to constitute data sets for external validation. Defaults to all data sets except those in `development_batch_id` for external validation, or none if not. Required if `development_batch_id` is not provided. `outcome_name` (optional) Name of the modelled outcome. This name will be used in figures created by `familiar`. If not set, the column name in `outcome_column` will be used for `binomial`, `multinomial`, `count` and `continuous` outcomes. For other outcomes (`survival` and `competing_risk`) no default is used. `outcome_column` (recommended) Name of the column containing the outcome of interest. May be identified from a formula, if a formula is provided as an argument. Otherwise an error is raised. Note that `survival` and `competing_risk` outcome type outcomes require two columns that indicate the time-to-event or the time of last follow-up and the event status. `outcome_type` (recommended) Type of outcome found in the outcome column. The outcome type determines many aspects of the overall process, e.g. the available feature selection methods and learners, but also the type of assessments that can be conducted to evaluate the resulting models. Implemented outcome types are: `binomial`: categorical outcome with 2 levels. `multinomial`: categorical outcome with 2 or more levels. `count`: Poisson-distributed numeric outcomes. `continuous`: general continuous numeric outcomes. `survival`: survival outcome for time-to-event data. If not provided, the algorithm will attempt to obtain outcome_type from contents of the outcome column. This may lead to unexpected results, and we therefore advise to provide this information manually. Note that `competing_risk` survival analysis are not fully supported, and is currently not a valid choice for `outcome_type`. `class_levels` (optional) Class levels for `binomial` or `multinomial` outcomes. This argument can be used to specify the ordering of levels for categorical outcomes. These class levels must exactly match the levels present in the outcome column. `event_indicator` (recommended) Indicator for events in `survival` and `competing_risk` analyses. `familiar` will automatically recognise `1`, `true`, `t`, `y` and `yes` as event indicators, including different capitalisations. If this parameter is set, it replaces the default values. `censoring_indicator` (recommended) Indicator for right-censoring in `survival` and `competing_risk` analyses. `familiar` will automatically recognise `0`, `false`, `f`, `n`, `no` as censoring indicators, including different capitalisations. If this parameter is set, it replaces the default values. `competing_risk_indicator` (recommended) Indicator for competing risks in `competing_risk` analyses. There are no default values, and if unset, all values other than those specified by the `event_indicator` and `censoring_indicator` parameters are considered to indicate competing risks. `signature` (optional) One or more names of feature columns that are considered part of a specific signature. Features specified here will always be used for modelling. Ranking from feature selection has no effect for these features. `novelty_features` (optional) One or more names of feature columns that should be included for the purpose of novelty detection. `exclude_features` (optional) Feature columns that will be removed from the data set. Cannot overlap with features in `signature`, `novelty_features` or `include_features`. `include_features` (optional) Feature columns that are specifically included in the data set. By default all features are included. Cannot overlap with `exclude_features`, but may overlap `signature`. Features in `signature` and `novelty_features` are always included. If both `exclude_features` and `include_features` are provided, `include_features` takes precedence, provided that there is no overlap between the two. `reference_method` (optional) Method used to set reference levels for categorical features. There are several options: `auto` (default): Categorical features that are not explicitly set by the user, i.e. columns containing boolean values or characters, use the most frequent level as reference. Categorical features that are explicitly set, i.e. as factors, are used as is. `always`: Both automatically detected and user-specified categorical features have the reference level set to the most frequent level. Ordinal features are not altered, but are used as is. `never`: User-specified categorical features are used as is. Automatically detected categorical features are simply sorted, and the first level is then used as the reference level. This was the behaviour prior to familiar version 1.3.0. `experimental_design` (required) Defines what the experiment looks like, e.g. `cv(bt(fs,20)+mb,3,2)+ev` for 2 times repeated 3-fold cross-validation with nested feature selection on 20 bootstraps and model-building, and external validation. The basic workflow components are: `fs`: (required) feature selection step. `mb`: (required) model building step. `ev`: (optional) external validation. Note that internal validation due to subsampling will always be conducted if the subsampling methods create any validation data sets. The different components are linked using `+`. Different subsampling methods can be used in conjunction with the basic workflow components: `bs(x,n)`: (stratified) .632 bootstrap, with `n` the number of bootstraps. In contrast to `bt`, feature pre-processing parameters and hyperparameter optimisation are conducted on individual bootstraps. `bt(x,n)`: (stratified) .632 bootstrap, with `n` the number of bootstraps. Unlike `bs` and other subsampling methods, no separate pre-processing parameters or optimised hyperparameters will be determined for each bootstrap. `cv(x,n,p)`: (stratified) `n`-fold cross-validation, repeated `p` times. Pre-processing parameters are determined for each iteration. `lv(x)`: leave-one-out-cross-validation. Pre-processing parameters are determined for each iteration. `ip(x)`: imbalance partitioning for addressing class imbalances on the data set. Pre-processing parameters are determined for each partition. The number of partitions generated depends on the imbalance correction method (see the `imbalance_correction_method` parameter). Imbalance partitioning does not generate validation sets. As shown in the example above, sampling algorithms can be nested. The simplest valid experimental design is `fs+mb`, which corresponds to a TRIPOD type 1a analysis. Type 1b analyses are only possible using bootstraps, e.g. `bt(fs+mb,100)`. Type 2a analyses can be conducted using cross-validation, e.g. `cv(bt(fs,100)+mb,10,1)`. Depending on the origin of the external validation data, designs such as `fs+mb+ev` or `cv(bt(fs,100)+mb,10,1)+ev` constitute type 2b or type 3 analyses. Type 4 analyses can be done by obtaining one or more `familiarModel` objects from others and applying them to your own data set. Alternatively, the `experimental_design` parameter may be used to provide a path to a file containing iterations, which is named `⁠####_iterations.RDS⁠` by convention. This path can be relative to the directory of the current experiment (`experiment_dir`), or an absolute path. The absolute path may thus also point to a file from a different experiment. `imbalance_correction_method` (optional) Type of method used to address class imbalances. Available options are: `full_undersampling` (default): All data will be used in an ensemble fashion. The full minority class will appear in each partition, but majority classes are undersampled until all data have been used. `random_undersampling`: Randomly undersamples majority classes. This is useful in cases where full undersampling would lead to the formation of many models due major overrepresentation of the largest class. This parameter is only used in combination with imbalance partitioning in the experimental design, and `ip` should therefore appear in the string that defines the design. `imbalance_n_partitions` (optional) Number of times random undersampling should be repeated. 10 undersampled subsets with balanced classes are formed by default. `parallel` (optional) Enable parallel processing. Defaults to `TRUE`. When set to `FALSE`, this disables all parallel processing, regardless of specific parameters such as `parallel_preprocessing`. However, when `parallel` is `TRUE`, parallel processing of different parts of the workflow can be disabled by setting respective flags to `FALSE`. `parallel_nr_cores` (optional) Number of cores available for parallelisation. Defaults to 2. This setting does nothing if parallelisation is disabled. `restart_cluster` (optional) Restart nodes used for parallel computing to free up memory prior to starting a parallel process. Note that it does take time to set up the clusters. Therefore setting this argument to `TRUE` may impact processing speed. This argument is ignored if `parallel` is `FALSE` or the cluster was initialised outside of familiar. Default is `FALSE`, which causes the clusters to be initialised only once. `cluster_type` (optional) Selection of the cluster type for parallel processing. Available types are the ones supported by the parallel package that is part of the base R distribution: `psock` (default), `fork`, `mpi`, `nws`, `sock`. In addition, `none` is available, which also disables parallel processing. `backend_type` (optional) Selection of the backend for distributing copies of the data. This backend ensures that only a single master copy is kept in memory. This limits memory usage during parallel processing. Several backend options are available, notably `socket_server`, and `none` (default). `socket_server` is based on the callr package and R sockets, comes with `familiar` and is available for any OS. `none` uses the package environment of familiar to store data, and is available for any OS. However, `none` requires copying of data to any parallel process, and has a larger memory footprint. `server_port` (optional) Integer indicating the port on which the socket server or RServe process should communicate. Defaults to port 6311. Note that ports 0 to 1024 and 49152 to 65535 cannot be used. `feature_max_fraction_missing` (optional) Numeric value between `0.0` and `0.95` that determines the meximum fraction of missing values that still allows a feature to be included in the data set. All features with a missing value fraction over this threshold are not processed further. The default value is `0.30`. `sample_max_fraction_missing` (optional) Numeric value between `0.0` and `0.95` that determines the maximum fraction of missing values that still allows a sample to be included in the data set. All samples with a missing value fraction over this threshold are excluded and not processed further. The default value is `0.30`. `filter_method` (optional) One or methods used to reduce dimensionality of the data set by removing irrelevant or poorly reproducible features. Several method are available: `none` (default): None of the features will be filtered. `low_variance`: Features with a variance below the `low_var_minimum_variance_threshold` are filtered. This can be useful to filter, for example, genes that are not differentially expressed. `univariate_test`: Features undergo a univariate regression using an outcome-appropriate regression model. The p-value of the model coefficient is collected. Features with coefficient p or q-value above the `univariate_test_threshold` are subsequently filtered. `robustness`: Features that are not sufficiently robust according to the intraclass correlation coefficient are filtered. Use of this method requires that repeated measurements are present in the data set, i.e. there should be entries for which the sample and cohort identifiers are the same. More than one method can be used simultaneously. Features with singular values are always filtered, as these do not contain information. `univariate_test_threshold` (optional) Numeric value between `1.0` and `0.0` that determines which features are irrelevant and will be filtered by the `univariate_test`. The p or q-values are compared to this threshold. All features with values above the threshold are filtered. The default value is `0.20`. `univariate_test_threshold_metric` (optional) Metric used with the to compare the `univariate_test_threshold` against. The following metrics can be chosen: `p_value` (default): The unadjusted p-value of each feature is used for to filter features. `q_value`: The q-value (Story, 2002), is used to filter features. Some data sets may have insufficient samples to compute the q-value. The `qvalue` package must be installed from Bioconductor to use this method. `univariate_test_max_feature_set_size` (optional) Maximum size of the feature set after the univariate test. P or q values of features are compared against the threshold, but if the resulting data set would be larger than this setting, only the most relevant features up to the desired feature set size are selected. The default value is `NULL`, which causes features to be filtered based on their relevance only. `low_var_minimum_variance_threshold` (required, if used) Numeric value that determines which features will be filtered by the `low_variance` method. The variance of each feature is computed and compared to the threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if `low_variance` is used. `low_var_max_feature_set_size` (optional) Maximum size of the feature set after filtering features with a low variance. All features are first compared against `low_var_minimum_variance_threshold`. If the resulting feature set would be larger than specified, only the most strongly varying features will be selected, up to the desired size of the feature set. The default value is `NULL`, which causes features to be filtered based on their variance only. `robustness_icc_type` (optional) String indicating the type of intraclass correlation coefficient (`1`, `2` or `3`) that should be used to compute robustness for features in repeated measurements. These types correspond to the types in Shrout and Fleiss (1979). The default value is `1`. `robustness_threshold_metric` (optional) String indicating which specific intraclass correlation coefficient (ICC) metric should be used to filter features. This should be one of: `icc`: The estimated ICC value itself. `icc_low` (default): The estimated lower limit of the 95% confidence interval of the ICC, as suggested by Koo and Li (2016). `icc_panel`: The estimated ICC value over the panel average, i.e. the ICC that would be obtained if all repeated measurements were averaged. `icc_panel_low`: The estimated lower limit of the 95% confidence interval of the panel ICC. `robustness_threshold_value` (optional) The intraclass correlation coefficient value that is as threshold. The default value is `0.70`. `transformation_method` (optional) The transformation method used to change the distribution of the data to be more normal-like. The following methods are available: `none`: This disables transformation of features. `yeo_johnson`: Transformation using the location and scale invariant version of the Yeo-Johnson transformation (Yeo and Johnson, 2000; Zwanenburg and Löck, 2023). `yeo_johnson_robust` (default): A robust version of `yeo_johnson`. This method is less sensitive to outliers. `yeo_johnson_conventional`: As `yeo_johnson`, but without optimisation of location and scale parameters. This method is equivalent to the original transformation proposed by Yeo and Johnson (2001). `box_cox`: Transformation using the location and scale invariant version of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck, 2023). `box_cox_robust`: A robust version of `yeo_johnson`. This method is less sensitive to outliers. `box_cox_conventional`: As `box_cox`, but without optimisation of location and scale parameters. This method is equivalent to the original transformation proposed by Box and Cox (1964). This method requires strictly positive feature values. Transformation requires the `power.transform` package. Only features that contain numerical data are transformed. Transformation parameters obtained in development data are stored within `featureInfo` objects for later use with validation data sets. `transformation_optimisation_criterion` (optional) Transformation parameters are optimised using a criterion, conventionally maximum-likelihood-estimation. `power.transform` implements multiple optimisation criteria, of which the following are available: `mle` (default): Optimisation using maximum likelihood estimation. `cramer_von_mises`: Optimisation using the Cramér-von Mises criterion. Zwanenburg and Löck (2023) found that this criterion was relatively robust against outliers. `transformation_gof_test_p_value` (optional) Not all transformations will lead to features that are roughly normally distributed. Zwanenburg and Löck (2023) established a empirical goodness-of-fit test for central normality. This parameter sets the significance for rejecting the null-hypothesis that a feature distribution is centrally normal. When the null-hypothesis is rejected, no transformation is performed. The default value is `NULL`, which disables the test. `normalisation_method` (optional) The normalisation method used to improve the comparability between numerical features that may have very different scales. The following normalisation methods can be chosen: `none`: This disables feature normalisation. `standardisation`: Features are normalised by subtraction of their mean values and division by their standard deviations. This causes every feature to be have a center value of 0.0 and standard deviation of 1.0. `standardisation_trim`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `standardisation_winsor`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `standardisation_robust` (default): A robust version of `standardisation` that relies on computing Huber's M-estimators for location and scale. `normalisation`: Features are normalised by subtraction of their minimum values and division by their ranges. This maps all feature values to a `[0, 1]` interval. `normalisation_trim`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `normalisation_winsor`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `quantile`: Features are normalised by subtraction of their median values and division by their interquartile range. `mean_centering`: Features are centered by substracting the mean, but do not undergo rescaling. Only features that contain numerical data are normalised. Normalisation parameters obtained in development data are stored within `featureInfo` objects for later use with validation data sets. `batch_normalisation_method` (optional) The method used for batch normalisation. Available methods are: `none` (default): This disables batch normalisation of features. `standardisation`: Features within each batch are normalised by subtraction of the mean value and division by the standard deviation in each batch. `standardisation_trim`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `standardisation_winsor`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `standardisation_robust`: A robust version of `standardisation` that relies on computing Huber's M-estimators for location and scale within each batch. `normalisation`: Features within each batch are normalised by subtraction of their minimum values and division by their range in each batch. This maps all feature values in each batch to a `[0, 1]` interval. `normalisation_trim`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `normalisation_winsor`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `quantile`: Features in each batch are normalised by subtraction of the median value and division by the interquartile range of each batch. `mean_centering`: Features in each batch are centered on 0.0 by substracting the mean value in each batch, but are not rescaled. `combat_parametric`: Batch adjustments using parametric empirical Bayes (Johnson et al, 2007). `combat_p` leads to the same method. `combat_non_parametric`: Batch adjustments using non-parametric empirical Bayes (Johnson et al, 2007). `combat_np` and `combat` lead to the same method. Note that we reduced complexity from O(`n^2`) to O(`n`) by only computing batch adjustment parameters for each feature on a subset of 50 randomly selected features, instead of all features. Only features that contain numerical data are normalised using batch normalisation. Batch normalisation parameters obtained in development data are stored within `featureInfo` objects for later use with validation data sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation parameters are separately determined for these batches. Note that for both empirical Bayes methods, the batch effect is assumed to produce results across the features. This is often true for things such as gene expressions, but the assumption may not hold generally. When performing batch normalisation, it is moreover important to check that differences between batches or cohorts are not related to the studied endpoint. `imputation_method` (optional) Method used for imputing missing feature values. Two methods are implemented: `simple`: Simple replacement of a missing value by the median value (for numeric features) or the modal value (for categorical features). `lasso`: Imputation of missing value by lasso regression (using `glmnet`) based on information contained in other features. `simple` imputation precedes `lasso` imputation to ensure that any missing values in predictors required for `lasso` regression are resolved. The `lasso` estimate is then used to replace the missing value. The default value depends on the number of features in the dataset. If the number is lower than 100, `lasso` is used by default, and `simple` otherwise. Only single imputation is performed. Imputation models and parameters are stored within `featureInfo` objects for later use with validation data sets. `cluster_method` (optional) Clustering is performed to identify and replace redundant features, for example those that are highly correlated. Such features do not carry much additional information and may be removed or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011). The cluster method determines the algorithm used to form the clusters. The following cluster methods are implemented: `none`: No clustering is performed. `hclust` (default): Hierarchical agglomerative clustering. If the `fastcluster` package is installed, `fastcluster::hclust` is used (Muellner 2013), otherwise `stats::hclust` is used. `agnes`: Hierarchical clustering using agglomerative nesting (Kaufman and Rousseeuw, 1990). This algorithm is similar to `hclust`, but uses the `cluster::agnes` implementation. `diana`: Divisive analysis hierarchical clustering. This method uses divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990). `cluster::diana` is used. `pam`: Partioning around medioids. This partitions the data into $k$ clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected using the `silhouette` metric. `pam` is implemented using the `cluster::pam` function. Clusters and cluster information is stored within `featureInfo` objects for later use with validation data sets. This enables reproduction of the same clusters as formed in the development data set. `cluster_linkage_method` (optional) Linkage method used for agglomerative clustering in `hclust` and `agnes`. The following linkage methods can be used: `average` (default): Average linkage. `single`: Single linkage. `complete`: Complete linkage. `weighted`: Weighted linkage, also known as McQuitty linkage. `ward`: Linkage using Ward's minimum variance method. `diana` and `pam` do not require a linkage method. `cluster_cut_method` (optional) The method used to define the actual clusters. The following methods can be used: `silhouette`: Clusters are formed based on the silhouette score (Rousseeuw, 1987). The average silhouette score is computed from 2 to `n` clusters, with `n` the number of features. Clusters are only formed if the average silhouette exceeds 0.50, which indicates reasonable evidence for structure. This procedure may be slow if the number of features is large (>100s). `fixed_cut`: Clusters are formed by cutting the hierarchical tree at the point indicated by the `cluster_similarity_threshold`, e.g. where features in a cluster have an average Spearman correlation of 0.90. `fixed_cut` is only available for `agnes`, `diana` and `hclust`. `dynamic_cut`: Dynamic cluster formation using the cutting algorithm in the `dynamicTreeCut` package. This package should be installed to select this option. `dynamic_cut` can only be used with `agnes` and `hclust`. The default options are `silhouette` for partioning around medioids (`pam`) and `fixed_cut` otherwise. `cluster_similarity_metric` (optional) Clusters are formed based on feature similarity. All features are compared in a pair-wise fashion to compute similarity, for example correlation. The resulting similarity grid is converted into a distance matrix that is subsequently used for clustering. The following metrics are supported to compute pairwise similarities: `mutual_information` (default): normalised mutual information. `mcfadden_r2`: McFadden's pseudo R-squared (McFadden, 1974). `cox_snell_r2`: Cox and Snell's pseudo R-squared (Cox and Snell, 1989). `nagelkerke_r2`: Nagelkerke's pseudo R-squared (Nagelkerke, 1991). `spearman`: Spearman's rank order correlation. `kendall`: Kendall rank correlation. `pearson`: Pearson product-moment correlation. The pseudo R-squared metrics can be used to assess similarity between mixed pairs of numeric and categorical features, as these are based on the log-likelihood of regression models. In `familiar`, the more informative feature is used as the predictor and the other feature as the reponse variable. In numeric-categorical pairs, the numeric feature is considered to be more informative and is thus used as the predictor. In categorical-categorical pairs, the feature with most levels is used as the predictor. In case any of the classical correlation coefficients (`pearson`, `spearman` and `kendall`) are used with (mixed) categorical features, the categorical features are one-hot encoded and the mean correlation over all resulting pairs is used as similarity. `cluster_similarity_threshold` (optional) The threshold level for pair-wise similarity that is required to form clusters using `fixed_cut`. This should be a numerical value between 0.0 and 1.0. Note however, that a reasonable threshold value depends strongly on the similarity metric. The following are the default values used: `mcfadden_r2` and `mutual_information`: `0.30` `cox_snell_r2` and `nagelkerke_r2`: `0.75` `spearman`, `kendall` and `pearson`: `0.90` Alternatively, if the `⁠fixed cut⁠` method is not used, this value determines whether any clustering should be performed, because the data may not contain highly similar features. The default values in this situation are: `mcfadden_r2` and `mutual_information`: `0.25` `cox_snell_r2` and `nagelkerke_r2`: `0.40` `spearman`, `kendall` and `pearson`: `0.70` The threshold value is converted to a distance (1-similarity) prior to cutting hierarchical trees. `cluster_representation_method` (optional) Method used to determine how the information of co-clustered features is summarised and used to represent the cluster. The following methods can be selected: `best_predictor` (default): The feature with the highest importance according to univariate regression with the outcome is used to represent the cluster. `medioid`: The feature closest to the cluster center, i.e. the feature that is most similar to the remaining features in the cluster, is used to represent the feature. `mean`: A meta-feature is generated by averaging the feature values for all features in a cluster. This method aligns all features so that all features will be positively correlated prior to averaging. Should a cluster contain one or more categorical features, the `medioid` method will be used instead, as averaging is not possible. Note that if this method is chosen, the `normalisation_method` parameter should be one of `standardisation`, `standardisation_trim`, `standardisation_winsor` or `quantile`.' If the `pam` cluster method is selected, only the `medioid` method can be used. In that case 1 medioid is used by default. `parallel_preprocessing` (optional) Enable parallel processing for the preprocessing workflow. Defaults to `TRUE`. When set to `FALSE`, this will disable the use of parallel processing while preprocessing, regardless of the settings of the `parallel` parameter. `parallel_preprocessing` is ignored if `parallel=FALSE`. `fs_method` (required) Feature selection method to be used for determining variable importance. `familiar` implements various feature selection methods. Please refer to the vignette on feature selection methods for more details. More than one feature selection method can be chosen. The experiment will then repeated for each feature selection method. Feature selection methods determines the ranking of features. Actual selection of features is done by optimising the signature size model hyperparameter during the hyperparameter optimisation step. `fs_method_parameter` (optional) List of lists containing parameters for feature selection methods. Each sublist should have the name of the feature selection method it corresponds to. Most feature selection methods do not have parameters that can be set. Please refer to the vignette on feature selection methods for more details. Note that if the feature selection method is based on a learner (e.g. lasso regression), hyperparameter optimisation may be performed prior to assessing variable importance. `vimp_aggregation_method` (optional) The method used to aggregate variable importances over different data subsets, e.g. bootstraps. The following methods can be selected: `none`: Don't aggregate ranks, but rather aggregate the variable importance scores themselves. `mean`: Use the mean rank of a feature over the subsets to determine the aggregated feature rank. `median`: Use the median rank of a feature over the subsets to determine the aggregated feature rank. `best`: Use the best rank the feature obtained in any subset to determine the aggregated feature rank. `worst`: Use the worst rank the feature obtained in any subset to determine the aggregated feature rank. `stability`: Use the frequency of the feature being in the subset of highly ranked features as measure for the aggregated feature rank (Meinshausen and Buehlmann, 2010). `exponential`: Use a rank-weighted frequence of occurrence in the subset of highly ranked features as measure for the aggregated feature rank (Haury et al., 2011). `borda` (default): Use the borda count as measure for the aggregated feature rank (Wald et al., 2012). `enhanced_borda`: Use an occurrence frequency-weighted borda count as measure for the aggregated feature rank (Wald et al., 2012). `truncated_borda`: Use borda count computed only on features within the subset of highly ranked features. `enhanced_truncated_borda`: Apply both the enhanced borda method and the truncated borda method and use the resulting borda count as the aggregated feature rank. The feature selection methods vignette provides additional information. `vimp_aggregation_rank_threshold` (optional) The threshold used to define the subset of highly important features. If not set, this threshold is determined by maximising the variance in the occurrence value over all features over the subset size. This parameter is only relevant for `stability`, `exponential`, `enhanced_borda`, `truncated_borda` and `enhanced_truncated_borda` methods. `parallel_feature_selection` (optional) Enable parallel processing for the feature selection workflow. Defaults to `TRUE`. When set to `FALSE`, this will disable the use of parallel processing while performing feature selection, regardless of the settings of the `parallel` parameter. `parallel_feature_selection` is ignored if `parallel=FALSE`. `learner` (required) One or more algorithms used for model development. A sizeable number learners is supported in `familiar`. Please see the vignette on learners for more information concerning the available learners. `hyperparameter` (optional) List of lists containing hyperparameters for learners. Each sublist should have the name of the learner method it corresponds to, with list elements being named after the intended hyperparameter, e.g. `"glm_logistic"=list("sign_size"=3)` All learners have hyperparameters. Please refer to the vignette on learners for more details. If no parameters are provided, sequential model-based optimisation is used to determine optimal hyperparameters. Hyperparameters provided by the user are never optimised. However, if more than one value is provided for a single hyperparameter, optimisation will be conducted using these values. `novelty_detector` (optional) Specify the algorithm used for training a novelty detector. This detector can be used to identify out-of-distribution data prospectively. `detector_parameters` (optional) List lists containing hyperparameters for novelty detectors. Currently not used. `parallel_model_development` (optional) Enable parallel processing for the model development workflow. Defaults to `TRUE`. When set to `FALSE`, this will disable the use of parallel processing while developing models, regardless of the settings of the `parallel` parameter. `parallel_model_development` is ignored if `parallel=FALSE`. `optimisation_bootstraps` (optional) Number of bootstraps that should be generated from the development data set. During the optimisation procedure one or more of these bootstraps (indicated by `smbo_step_bootstraps`) are used for model development using different combinations of hyperparameters. The effect of the hyperparameters is then assessed by comparing in-bag and out-of-bag model performance. The default number of bootstraps is `50`. Hyperparameter optimisation may finish before exhausting the set of bootstraps. `optimisation_determine_vimp` (optional) Logical value that indicates whether variable importance is determined separately for each of the bootstraps created during the optimisation process (`TRUE`) or the applicable results from the feature selection step are used (`FALSE`). Determining variable importance increases the initial computational overhead. However, it prevents positive biases for the out-of-bag data due to overlap of these data with the development data set used for the feature selection step. In this case, any hyperparameters of the variable importance method are not determined separately for each bootstrap, but those obtained during the feature selection step are used instead. In case multiple of such hyperparameter sets could be applicable, the set that will be used is randomly selected for each bootstrap. This parameter only affects hyperparameter optimisation of learners. The default is `TRUE`. `smbo_random_initialisation` (optional) String indicating the initialisation method for the hyperparameter space. Can be one of `fixed_subsample` (default), `fixed`, or `random`. `fixed` and `fixed_subsample` first create hyperparameter sets from a range of default values set by familiar. `fixed_subsample` then randomly draws up to `smbo_n_random_sets` from the grid. `random` does not rely upon a fixed grid, and randomly draws up to `smbo_n_random_sets` hyperparameter sets from the hyperparameter space. `smbo_n_random_sets` (optional) Number of random or subsampled hyperparameters drawn during the initialisation process. Default: `100`. Cannot be smaller than `10`. The parameter is not used when `smbo_random_initialisation` is `fixed`, as the entire pre-defined grid will be explored. `max_smbo_iterations` (optional) Maximum number of intensify iterations of the SMBO algorithm. During an intensify iteration a run-off occurs between the current best hyperparameter combination and either 10 challenger combination with the highest expected improvement or a set of 20 random combinations. Run-off with random combinations is used to force exploration of the hyperparameter space, and is performed every second intensify iteration, or if there is no expected improvement for any challenger combination. If a combination of hyperparameters leads to better performance on the same data than the incumbent best set of hyperparameters, it replaces the incumbent set at the end of the intensify iteration. The default number of intensify iteration is `20`. Iterations may be stopped early if the incumbent set of hyperparameters remains the same for `smbo_stop_convergent_iterations` iterations, or performance improvement is minimal. This behaviour is suppressed during the first 4 iterations to enable the algorithm to explore the hyperparameter space. `smbo_stop_convergent_iterations` (optional) The number of subsequent convergent SMBO iterations required to stop hyperparameter optimisation early. An iteration is convergent if the best parameter set has not changed or the optimisation score over the 4 most recent iterations has not changed beyond the tolerance level in `smbo_stop_tolerance`. The default value is `3`. `smbo_stop_tolerance` (optional) Tolerance for early stopping due to convergent optimisation score. The default value depends on the square root of the number of samples (at the series level), and is `0.01` for 100 samples. This value is computed as `0.1 * 1 / sqrt(n_samples)`. The upper limit is `0.0001` for 1M or more samples. `smbo_time_limit` (optional) Time limit (in minutes) for the optimisation process. Optimisation is stopped after this limit is exceeded. Time taken to determine variable importance for the optimisation process (see the `optimisation_determine_vimp` parameter) does not count. The default is `NULL`, indicating that there is no time limit for the optimisation process. The time limit cannot be less than 1 minute. `smbo_initial_bootstraps` (optional) The number of bootstraps taken from the set of `optimisation_bootstraps` as the bootstraps assessed initially. The default value is `1`. The value cannot be larger than `optimisation_bootstraps`. `smbo_step_bootstraps` (optional) The number of bootstraps taken from the set of `optimisation_bootstraps` bootstraps as the bootstraps assessed during the steps of each intensify iteration. The default value is `3`. The value cannot be larger than `optimisation_bootstraps`. `smbo_intensify_steps` (optional) The number of steps in each SMBO intensify iteration. Each step a new set of `smbo_step_bootstraps` bootstraps is drawn and used in the run-off between the incumbent best hyperparameter combination and its challengers. The default value is `5`. Higher numbers allow for a more detailed comparison, but this comes with added computational cost. `optimisation_metric` (optional) One or more metrics used to compute performance scores. See the vignette on performance metrics for the available metrics. If unset, the following metrics are used by default: `auc_roc`: For `binomial` and `multinomial` models. `mse`: Mean squared error for `continuous` models. `msle`: Mean squared logarithmic error for `count` models. `concordance_index`: For `survival` models. Multiple optimisation metrics can be specified. Actual metric values are converted to an objective value by comparison with a baseline metric value that derives from a trivial model, i.e. majority class for binomial and multinomial outcomes, the median outcome for count and continuous outcomes and a fixed risk or time for survival outcomes. `optimisation_function` (optional) Type of optimisation function used to quantify the performance of a hyperparameter set. Model performance is assessed using the metric(s) specified by `optimisation_metric` on the in-bag (IB) and out-of-bag (OOB) samples of a bootstrap. These values are converted to objective scores with a standardised interval of `[-1.0, 1.0]`. Each pair of objective is subsequently used to compute an optimisation score. The optimisation score across different bootstraps is than aggregated to a summary score. This summary score is used to rank hyperparameter sets, and select the optimal set. The combination of optimisation score and summary score is determined by the optimisation function indicated by this parameter: `validation` or `max_validation` (default): seeks to maximise OOB score. `balanced`: seeks to balance IB and OOB score. `stronger_balance`: similar to `balanced`, but with stronger penalty for differences between IB and OOB scores. `validation_minus_sd`: seeks to optimise the average OOB score minus its standard deviation. `validation_25th_percentile`: seeks to optimise the 25th percentile of OOB scores, and is conceptually similar to `validation_minus_sd`. `model_estimate`: seeks to maximise the OOB score estimate predicted by the hyperparameter learner (not available for random search). `model_estimate_minus_sd`: seeks to maximise the OOB score estimate minus its estimated standard deviation, as predicted by the hyperparameter learner (not available for random search). `model_balanced_estimate`: seeks to maximise the estimate of the balanced IB and OOB score. This is similar to the `balanced` score, and in fact uses a hyperparameter learner to predict said score (not available for random search). `model_balanced_estimate_minus_sd`: seeks to maximise the estimate of the balanced IB and OOB score, minus its estimated standard deviation. This is similar to the `balanced` score, but takes into account its estimated spread. Additional detail are provided in the Learning algorithms and hyperparameter optimisation vignette. `hyperparameter_learner` (optional) Any point in the hyperparameter space has a single, scalar, optimisation score value that is a priori unknown. During the optimisation process, the algorithm samples from the hyperparameter space by selecting hyperparameter sets and computing the optimisation score value for one or more bootstraps. For each hyperparameter set the resulting values are distributed around the actual value. The learner indicated by `hyperparameter_learner` is then used to infer optimisation score estimates for unsampled parts of the hyperparameter space. The following models are available: `bayesian_additive_regression_trees` or `bart`: Uses Bayesian Additive Regression Trees (Sparapani et al., 2021) for inference. Unlike standard random forests, BART allows for estimating posterior distributions directly and can extrapolate. `gaussian_process` (default): Creates a localised approximate Gaussian process for inference (Gramacy, 2016). This allows for better scaling than deterministic Gaussian Processes. `random_forest`: Creates a random forest for inference. Originally suggested by Hutter et al. (2011). A weakness of random forests is their lack of extrapolation beyond observed values, which limits their usefulness in exploiting promising areas of hyperparameter space. `random` or `random_search`: Forgoes the use of models to steer optimisation. Instead, a random search is performed. `acquisition_function` (optional) The acquisition function influences how new hyperparameter sets are selected. The algorithm uses the model learned by the learner indicated by `hyperparameter_learner` to search the hyperparameter space for hyperparameter sets that are either likely better than the best known set (exploitation) or where there is considerable uncertainty (exploration). The acquisition function quantifies this (Shahriari et al., 2016). The following acquisition functions are available, and are described in more detail in the learner algorithms vignette: `improvement_probability`: The probability of improvement quantifies the probability that the expected optimisation score for a set is better than the best observed optimisation score `improvement_empirical_probability`: Similar to `improvement_probability`, but based directly on optimisation scores predicted by the individual decision trees. `expected_improvement` (default): Computes expected improvement. `upper_confidence_bound`: This acquisition function is based on the upper confidence bound of the distribution (Srinivas et al., 2012). `bayes_upper_confidence_bound`: This acquisition function is based on the upper confidence bound of the distribution (Kaufmann et al., 2012). `exploration_method` (optional) Method used to steer exploration in post-initialisation intensive searching steps. As stated earlier, each SMBO iteration step compares suggested alternative parameter sets with an incumbent best set in a series of steps. The exploration method controls how the set of alternative parameter sets is pruned after each step in an iteration. Can be one of the following: `single_shot` (default): The set of alternative parameter sets is not pruned, and each intensification iteration contains only a single intensification step that only uses a single bootstrap. This is the fastest exploration method, but only superficially tests each parameter set. `successive_halving`: The set of alternative parameter sets is pruned by removing the worst performing half of the sets after each step (Jamieson and Talwalkar, 2016). `stochastic_reject`: The set of alternative parameter sets is pruned by comparing the performance of each parameter set with that of the incumbent best parameter set using a paired Wilcoxon test based on shared bootstraps. Parameter sets that perform significantly worse, at an alpha level indicated by `smbo_stochastic_reject_p_value`, are pruned. `none`: The set of alternative parameter sets is not pruned. `smbo_stochastic_reject_p_value` (optional) The p-value threshold used for the `stochastic_reject` exploration method. The default value is `0.05`. `parallel_hyperparameter_optimisation` (optional) Enable parallel processing for hyperparameter optimisation. Defaults to `TRUE`. When set to `FALSE`, this will disable the use of parallel processing while performing optimisation, regardless of the settings of the `parallel` parameter. The parameter moreover specifies whether parallelisation takes place within the optimisation algorithm (`inner`, default), or in an outer loop ( `outer`) over learners, data subsamples, etc. `parallel_hyperparameter_optimisation` is ignored if `parallel=FALSE`. `evaluate_top_level_only` (optional) Flag that signals that only evaluation at the most global experiment level is required. Consider a cross-validation experiment with additional external validation. The global experiment level consists of data that are used for development, internal validation and external validation. The next lower experiment level are the individual cross-validation iterations. When the flag is `true`, evaluations take place on the global level only, and no results are generated for the next lower experiment levels. In our example, this means that results from individual cross-validation iterations are not computed and shown. When the flag is `false`, results are computed from both the global layer and the next lower level. Setting the flag to `true` saves computation time. `skip_evaluation_elements` (optional) Specifies which evaluation steps, if any, should be skipped as part of the evaluation process. Defaults to `none`, which means that all relevant evaluation steps are performed. It can have one or more of the following values: `none`, `false`: no steps are skipped. `all`, `true`: all steps are skipped. `auc_data`: data for assessing and plotting the area under the receiver operating characteristic curve are not computed. `calibration_data`: data for assessing and plotting model calibration are not computed. `calibration_info`: data required to assess calibration, such as baseline survival curves, are not collected. These data will still be present in the models. `confusion_matrix`: data for assessing and plotting a confusion matrix are not collected. `decision_curve_analyis`: data for performing a decision curve analysis are not computed. `feature_expressions`: data for assessing and plotting sample clustering are not computed. `feature_similarity`: data for assessing and plotting feature clusters are not computed. `fs_vimp`: data for assessing and plotting feature selection-based variable importance are not collected. `hyperparameters`: data for assessing model hyperparameters are not collected. These data will still be present in the models. `ice_data`: data for individual conditional expectation and partial dependence plots are not created. `model_performance`: data for assessing and visualising model performance are not created. `model_vimp`: data for assessing and plotting model-based variable importance are not collected. `permutation_vimp`: data for assessing and plotting model-agnostic permutation variable importance are not computed. `prediction_data`: predictions for each sample are not made and exported. `risk_stratification_data`: data for assessing and plotting Kaplan-Meier survival curves are not collected. `risk_stratification_info`: data for assessing stratification into risk groups are not computed. `univariate_analysis`: data for assessing and plotting univariate feature importance are not computed. `ensemble_method` (optional) Method for ensembling predictions from models for the same sample. Available methods are: `median` (default): Use the median of the predicted values as the ensemble value for a sample. `mean`: Use the mean of the predicted values as the ensemble value for a sample. This parameter is only used if `detail_level` is `ensemble`. `evaluation_metric` (optional) One or more metrics for assessing model performance. See the vignette on performance metrics for the available metrics. Confidence intervals (or rather credibility intervals) are computed for each metric during evaluation. This is done using bootstraps, the number of which depends on the value of `confidence_level` (Davison and Hinkley, 1997). If unset, the metric in the `optimisation_metric` variable is used. `sample_limit` (optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be less than 20. This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. `list("sample_similarity"=100, "permutation_vimp"=1000)`. This parameter can be set for the following data elements: `sample_similarity` and `ice_data`. `detail_level` (optional) Sets the level at which results are computed and aggregated. `ensemble`: Results are computed at the ensemble level, i.e. over all models in the ensemble. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the model performance of the ensemble model for each bootstrap. `hybrid` (default): Results are computed at the level of models in an ensemble. This means that, for example, bias-corrected estimates of model performance are directly computed using the models in the ensemble. If there are at least 20 trained models in the ensemble, performance is computed for each model, in contrast to `ensemble` where performance is computed for the ensemble of models. If there are less than 20 trained models in the ensemble, bootstraps are created so that at least 20 point estimates can be made. `model`: Results are computed at the model level. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the performance of the model for each bootstrap. Note that each level of detail has a different interpretation for bootstrap confidence intervals. For `ensemble` and `model` these are the confidence intervals for the ensemble and an individual model, respectively. That is, the confidence interval describes the range where an estimate produced by a respective ensemble or model trained on a repeat of the experiment may be found with the probability of the confidence level. For `hybrid`, it represents the range where any single model trained on a repeat of the experiment may be found with the probability of the confidence level. By definition, confidence intervals obtained using `hybrid` are at least as wide as those for `ensemble`. `hybrid` offers the correct interpretation if the goal of the analysis is to assess the result of a single, unspecified, model. `hybrid` is generally computationally less expensive then `ensemble`, which in turn is somewhat less expensive than `model`. A non-default `detail_level` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"="ensemble", "model_performance"="hybrid")`. This parameter can be set for the following data elements: `auc_data`, `decision_curve_analyis`, `model_performance`, `permutation_vimp`, `ice_data`, `prediction_data` and `confusion_matrix`. `estimation_type` (optional) Sets the type of estimation that should be possible. This has the following options: `point`: Point estimates. `bias_correction` or `bc`: Bias-corrected estimates. A bias-corrected estimate is computed from (at least) 20 point estimates, and `familiar` may bootstrap the data to create them. `bootstrap_confidence_interval` or `bci` (default): Bias-corrected estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The number of point estimates required depends on the `confidence_level` parameter, and `familiar` may bootstrap the data to create them. As with `detail_level`, a non-default `estimation_type` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"="bci", "model_performance"="point")`. This parameter can be set for the following data elements: `auc_data`, `decision_curve_analyis`, `model_performance`, `permutation_vimp`, `ice_data`, and `prediction_data`. `aggregate_results` (optional) Flag that signifies whether results should be aggregated during evaluation. If `estimation_type` is `bias_correction` or `bc`, aggregation leads to a single bias-corrected estimate. If `estimation_type` is `bootstrap_confidence_interval` or `bci`, aggregation leads to a single bias-corrected estimate with lower and upper boundaries of the confidence interval. This has no effect if `estimation_type` is `point`. The default value is equal to `TRUE` except when assessing metrics to assess model performance, as the default violin plot requires underlying data. As with `detail_level` and `estimation_type`, a non-default `aggregate_results` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"=TRUE, , "model_performance"=FALSE)`. This parameter exists for the same elements as `estimation_type`. `confidence_level` (optional) Numeric value for the level at which confidence intervals are determined. In the case bootstraps are used to determine the confidence intervals bootstrap estimation, `familiar` uses the rule of thumb `n = 20 / ci.level` to determine the number of required bootstraps. The default value is `0.95`. `bootstrap_ci_method` (optional) Method used to determine bootstrap confidence intervals (Efron and Hastie, 2016). The following methods are implemented: `percentile` (default): Confidence intervals obtained using the percentile method. `bc`: Bias-corrected confidence intervals. Note that the standard method is not implemented because this method is often not suitable due to non-normal distributions. The bias-corrected and accelerated (BCa) method is not implemented yet. `feature_cluster_method` (optional) Method used to perform clustering of features. The same methods as for the `cluster_method` configuration parameter are available: `none`, `hclust`, `agnes`, `diana` and `pam`. The value for the `cluster_method` configuration parameter is used by default. When generating clusters for the purpose of determining mutual correlation and ordering feature expressions, `none` is ignored and `hclust` is used instead. `feature_linkage_method` (optional) Method used for agglomerative clustering with `hclust` and `agnes`. Linkage determines how features are sequentially combined into clusters based on distance. The methods are shared with the `cluster_linkage_method` configuration parameter: `average`, `single`, `complete`, `weighted`, and `ward`. The value for the `cluster_linkage_method` configuration parameters is used by default. `feature_cluster_cut_method` (optional) Method used to divide features into separate clusters. The available methods are the same as for the `cluster_cut_method` configuration parameter: `silhouette`, `fixed_cut` and `dynamic_cut`. `silhouette` is available for all cluster methods, but `fixed_cut` only applies to methods that create hierarchical trees (`hclust`, `agnes` and `diana`). `dynamic_cut` requires the `dynamicTreeCut` package and can only be used with `agnes` and `hclust`. The value for the `cluster_cut_method` configuration parameter is used by default. `feature_similarity_metric` (optional) Metric to determine pairwise similarity between features. Similarity is computed in the same manner as for clustering, and `feature_similarity_metric` therefore has the same options as `cluster_similarity_metric`: `mcfadden_r2`, `cox_snell_r2`, `nagelkerke_r2`, `mutual_information`, `spearman`, `kendall` and `pearson`. The value used for the `cluster_similarity_metric` configuration parameter is used by default. `feature_similarity_threshold` (optional) The threshold level for pair-wise similarity that is required to form feature clusters with the `fixed_cut` method. This threshold functions in the same manner as the one defined using the `cluster_similarity_threshold` parameter. By default, the value for the `cluster_similarity_threshold` configuration parameter is used. Unlike for `cluster_similarity_threshold`, more than one value can be supplied here. `sample_cluster_method` (optional) The method used to perform clustering based on distance between samples. These are the same methods as for the `cluster_method` configuration parameter: `hclust`, `agnes`, `diana` and `pam`. The value for the `cluster_method` configuration parameter is used by default. When generating clusters for the purpose of ordering samples in feature expressions, `none` is ignored and `hclust` is used instead. `sample_linkage_method` (optional) The method used for agglomerative clustering in `hclust` and `agnes`. These are the same methods as for the `cluster_linkage_method` configuration parameter: `average`, `single`, `complete`, `weighted`, and `ward`. The value for the `cluster_linkage_method` configuration parameters is used by default. `sample_similarity_metric` (optional) Metric to determine pairwise similarity between samples. Similarity is computed in the same manner as for clustering, but `sample_similarity_metric` has different options that are better suited to computing distance between samples instead of between features. The following metrics are available. `gower` (default): compute Gower's distance between samples. By default, Gower's distance is computed based on winsorised data to reduce the effect of outliers (see below). `euclidean`: compute the Euclidean distance between samples. The underlying feature data for numerical features is scaled to the `[0,1]` range using the feature values across the samples. The normalisation parameters required can optionally be computed from feature data with the outer 5% (on both sides) of feature values trimmed or winsorised. To do so append `⁠_trim⁠` (trimming) or `⁠_winsor⁠` (winsorising) to the metric name. This reduces the effect of outliers somewhat. Regardless of metric, all categorical features are handled as for the Gower's distance: distance is 0 if the values in a pair of samples match, and 1 if they do not. `eval_aggregation_method` (optional) Method for aggregating variable importances for the purpose of evaluation. Variable importances are determined during feature selection steps and after training the model. Both types are evaluated, but feature selection variable importance is only evaluated at run-time. See the documentation for the `vimp_aggregation_method` argument for information concerning the different methods available. `eval_aggregation_rank_threshold` (optional) The threshold used to define the subset of highly important features during evaluation. See the documentation for the `vimp_aggregation_rank_threshold` argument for more information. `eval_icc_type` (optional) String indicating the type of intraclass correlation coefficient (`1`, `2` or `3`) that should be used to compute robustness for features in repeated measurements during the evaluation of univariate importance. These types correspond to the types in Shrout and Fleiss (1979). The default value is `1`. `stratification_method` (optional) Method for determining the stratification threshold for creating survival groups. The actual, model-dependent, threshold value is obtained from the development data, and can afterwards be used to perform stratification on validation data. The following stratification methods are available: `median` (default): The median predicted value in the development cohort is used to stratify the samples into two risk groups. For predicted outcome values that build a continuous spectrum, the two risk groups in the development cohort will be roughly equal in size. `mean`: The mean predicted value in the development cohort is used to stratify the samples into two risk groups. `mean_trim`: As `mean`, but based on the set of predicted values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `mean_winsor`: As `mean`, but based on the set of predicted values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `fixed`: Samples are stratified based on the sample quantiles of the predicted values. These quantiles are defined using the `stratification_threshold` parameter. `optimised`: Use maximally selected rank statistics to determine the optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to stratify samples into two optimally separated risk groups. One or more stratification methods can be selected simultaneously. This parameter is only relevant for `survival` outcomes. `stratification_threshold` (optional) Numeric value(s) signifying the sample quantiles for stratification using the `fixed` method. The number of risk groups will be the number of values +1. The default value is `c(1/3, 2/3)`, which will yield two thresholds that divide samples into three equally sized groups. If `fixed` is not among the selected stratification methods, this parameter is ignored. This parameter is only relevant for `survival` outcomes. `time_max` (optional) Time point which is used as the benchmark for e.g. cumulative risks generated by random forest, or the cutoff for Uno's concordance index. If `time_max` is not provided, but `evaluation_times` is, the largest value of `evaluation_times` is used. If both are not provided, `time_max` is set to the 98th percentile of the distribution of survival times for samples with an event in the development data set. This parameter is only relevant for `survival` outcomes. `evaluation_times` (optional) One or more time points that are used for assessing calibration in survival problems. This is done as expected and observed survival probabilities depend on time. If unset, `evaluation_times` will be equal to `time_max`. This parameter is only relevant for `survival` outcomes. `dynamic_model_loading` (optional) Enables dynamic loading of models during the evaluation process, if `TRUE`. Defaults to `FALSE`. Dynamic loading of models may reduce the overall memory footprint, at the cost of increased disk or network IO. Models can only be dynamically loaded if they are found at an accessible disk or network location. Setting this parameter to `TRUE` may help if parallel processing causes out-of-memory issues during evaluation. `parallel_evaluation` (optional) Enable parallel processing for hyperparameter optimisation. Defaults to `TRUE`. When set to `FALSE`, this will disable the use of parallel processing while performing optimisation, regardless of the settings of the `parallel` parameter. The parameter moreover specifies whether parallelisation takes place within the evaluation process steps (`inner`, default), or in an outer loop ( `outer`) over learners, data subsamples, etc. `parallel_evaluation` is ignored if `parallel=FALSE`.

Value

Nothing. All output is written to the experiment directory. If the experiment directory is in a temporary location, a list with all familiarModel, familiarEnsemble, familiarData and familiarCollection objects will be returned.

References

Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Yeo, I. & Johnson, R. A. A new family of power transformations to improve normality or symmetry. Biometrika 87, 954–959 (2000).
Box, G. E. P. & Cox, D. R. An analysis of transformations. J. R. Stat. Soc. Series B Stat. Methodol. 26, 211–252 (1964).
Raymaekers, J., Rousseeuw, P. J. Transforming variables to central normality. Mach Learn. (2021).
Park, M. Y., Hastie, T. & Tibshirani, R. Averaged gene expressions for regression. Biostatistics 8, 212–227 (2007).
Tolosi, L. & Lengauer, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 1986–1994 (2011).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007)
Kaufman, L. & Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis. (John Wiley & Sons, 2009).
Muellner, D. fastcluster: fast hierarchical, agglomerative clustering routines for R and Python. J. Stat. Softw. 53, 1–18 (2013).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720 (2008).
McFadden, D. Conditional logit analysis of qualitative choice behavior. in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press, 1974).
Cox, D. R. & Snell, E. J. Analysis of binary data. (Chapman and Hall, 1989).
Nagelkerke, N. J. D. A note on a general definition of the coefficient of determination. Biometrika 78, 691–692 (1991).
Meinshausen, N. & Buehlmann, P. Stability selection. J. R. Stat. Soc. Series B Stat. Methodol. 72, 417–473 (2010).
Haury, A.-C., Gestraud, P. & Vert, J.-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6, e28210 (2011).
Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. & Napolitano,A. An extensive comparison of feature ranking aggregation techniques in bioinformatics. in 2012 IEEE 13th International Conference on Information Reuse Integration (IRI) 377–384 (2012).
Hutter, F., Hoos, H. H. & Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. in Learning and Intelligent Optimization (ed. Coello, C. A. C.) 6683, 507–523 (Springer Berlin Heidelberg, 2011).
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 104, 148–175 (2016)
Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. W. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting. IEEE Trans. Inf. Theory 58, 3250–3265 (2012)
Kaufmann, E., Cappé, O. & Garivier, A. On Bayesian upper confidence bounds for bandit problems. in Artificial intelligence and statistics 592–600 (2012).
Jamieson, K. & Talwalkar, A. Non-stochastic Best Arm Identification and Hyperparameter Optimization. in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (eds. Gretton, A. & Robert, C. C.) vol. 51 240–248 (PMLR, 2016).
Gramacy, R. B. laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R. Journal of Statistical Software 72, 1–46 (2016)
Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric Machine Learning and Efficient Computation with Bayesian Additive Regression Trees: The BART R Package. Journal of Statistical Software 97, 1–66 (2021)
Davison, A. C. & Hinkley, D. V. Bootstrap methods and their application. (Cambridge University Press, 1997).
Efron, B. & Hastie, T. Computer Age Statistical Inference. (Cambridge University Press, 2016).
Lausen, B. & Schumacher, M. Maximally Selected Rank Statistics. Biometrics 48, 73 (1992).
Hothorn, T. & Lausen, B. On the exact distribution of maximally selected rank statistics. Comput. Stat. Data Anal. 43, 121–137 (2003).