extract_data: Internal function to create a familiarData object.
In familiar: End-to-End Automated Machine Learning and Model Evaluation

extract_data

R Documentation

Internal function to create a familiarData object.

Description

Compute various data related to model performance and calibration from the provided dataset and familiarEnsemble object and store it as a familiarData object.

Usage

extract_data(
  object,
  data,
  data_element = waiver(),
  is_pre_processed = FALSE,
  cl = NULL,
  time_max = waiver(),
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  ensemble_method = waiver(),
  stratification_method = waiver(),
  evaluation_times = waiver(),
  metric = waiver(),
  feature_cluster_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_linkage_method = waiver(),
  feature_similarity_metric = waiver(),
  feature_similarity_threshold = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  icc_type = waiver(),
  dynamic_model_loading = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)

Arguments

`object`	A `familiarEnsemble` object, which is an ensemble of one or more `familiarModel` objects.
`data`	A `dataObject` object, `data.table` or `data.frame` that constitutes the data that are assessed.
`data_element`	String indicating which data elements are to be extracted. Default is `all`, but specific elements can be specified to speed up computations if not all elements are to be computed. This is an internal parameter that is set by, e.g. the `export_model_vimp` method.
`is_pre_processed`	Flag that indicates whether the data was already pre-processed externally, e.g. normalised and clustered. Only used if the `data` argument is a `data.table` or `data.frame`.
`cl`	Cluster created using the `parallel` package. This cluster is then used to speed up computation through parallellisation.
`time_max`	Time point which is used as the benchmark for e.g. cumulative risks generated by random forest, or the cut-off value for Uno's concordance index. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects. Only used for `survival` outcomes.
`aggregation_method`	Method for aggregating variable importances for the purpose of evaluation. Variable importances are determined during feature selection steps and after training the model. Both types are evaluated, but feature selection variable importance is only evaluated at run-time. See the documentation for the `vimp_aggregation_method` argument in `summon_familiar` for information concerning the different available methods. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`rank_threshold`	The threshold used to define the subset of highly important features during evaluation. See the documentation for the `vimp_aggregation_rank_threshold` argument in `summon_familiar` for more information. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`ensemble_method`	Method for ensembling predictions from models for the same sample. Available methods are: `median` (default): Use the median of the predicted values as the ensemble value for a sample. `mean`: Use the mean of the predicted values as the ensemble value for a sample.
`stratification_method`	(optional) Method for determining the stratification threshold for creating survival groups. The actual, model-dependent, threshold value is obtained from the development data, and can afterwards be used to perform stratification on validation data. The following stratification methods are available: `median` (default): The median predicted value in the development cohort is used to stratify the samples into two risk groups. For predicted outcome values that build a continuous spectrum, the two risk groups in the development cohort will be roughly equal in size. `mean`: The mean predicted value in the development cohort is used to stratify the samples into two risk groups. `mean_trim`: As `mean`, but based on the set of predicted values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `mean_winsor`: As `mean`, but based on the set of predicted values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `fixed`: Samples are stratified based on the sample quantiles of the predicted values. These quantiles are defined using the `stratification_threshold` parameter. `optimised`: Use maximally selected rank statistics to determine the optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to stratify samples into two optimally separated risk groups. One or more stratification methods can be selected simultaneously. This parameter is only relevant for `survival` outcomes.
`evaluation_times`	One or more time points that are used for in analysis of survival problems when data has to be assessed at a set time, e.g. calibration. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects. Only used for `survival` outcomes.
`metric`	One or more metrics for assessing model performance. See the vignette on performance metrics for the available metrics. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_cluster_method`	The method used to perform clustering. These are the same methods as for the `cluster_method` configuration parameter: `none`, `hclust`, `agnes`, `diana` and `pam`. `none` cannot be used when extracting data regarding mutual correlation or feature expressions. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_cluster_cut_method`	The method used to divide features into separate clusters. The available methods are the same as for the `cluster_cut_method` configuration parameter: `silhouette`, `fixed_cut` and `dynamic_cut`. `silhouette` is available for all cluster methods, but `fixed_cut` only applies to methods that create hierarchical trees (`hclust`, `agnes` and `diana`). `dynamic_cut` requires the `dynamicTreeCut` package and can only be used with `agnes` and `hclust`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_linkage_method`	The method used for agglomerative clustering in `hclust` and `agnes`. These are the same methods as for the `cluster_linkage_method` configuration parameter: `average`, `single`, `complete`, `weighted`, and `ward`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_similarity_metric`	Metric to determine pairwise similarity between features. Similarity is computed in the same manner as for clustering, and `feature_similarity_metric` therefore has the same options as `cluster_similarity_metric`: `mcfadden_r2`, `cox_snell_r2`, `nagelkerke_r2`, `spearman`, `kendall` and `pearson`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`feature_similarity_threshold`	The threshold level for pair-wise similarity that is required to form feature clusters with the `fixed_cut` method. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`sample_cluster_method`	The method used to perform clustering based on distance between samples. These are the same methods as for the `cluster_method` configuration parameter: `hclust`, `agnes`, `diana` and `pam`. `none` cannot be used when extracting data for feature expressions. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`sample_linkage_method`	The method used for agglomerative clustering in `hclust` and `agnes`. These are the same methods as for the `cluster_linkage_method` configuration parameter: `average`, `single`, `complete`, `weighted`, and `ward`. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`sample_similarity_metric`	Metric to determine pairwise similarity between samples. Similarity is computed in the same manner as for clustering, but `sample_similarity_metric` has different options that are better suited to computing distance between samples instead of between features: `gower`, `euclidean`. The underlying feature data is scaled to the `[0, 1]` range (for numerical features) using the feature values across the samples. The normalisation parameters required can optionally be computed from feature data with the outer 5% (on both sides) of feature values trimmed or winsorised. To do so append `⁠_trim⁠` (trimming) or `⁠_winsor⁠` (winsorising) to the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`sample_limit`	(optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be less than 20. This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. `list("sample_similarity"=100, "permutation_vimp"=1000)`. This parameter can be set for the following data elements: `sample_similarity` and `ice_data`.
`detail_level`	(optional) Sets the level at which results are computed and aggregated. `ensemble`: Results are computed at the ensemble level, i.e. over all models in the ensemble. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the model performance of the ensemble model for each bootstrap. `hybrid` (default): Results are computed at the level of models in an ensemble. This means that, for example, bias-corrected estimates of model performance are directly computed using the models in the ensemble. If there are at least 20 trained models in the ensemble, performance is computed for each model, in contrast to `ensemble` where performance is computed for the ensemble of models. If there are less than 20 trained models in the ensemble, bootstraps are created so that at least 20 point estimates can be made. `model`: Results are computed at the model level. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the performance of the model for each bootstrap. Note that each level of detail has a different interpretation for bootstrap confidence intervals. For `ensemble` and `model` these are the confidence intervals for the ensemble and an individual model, respectively. That is, the confidence interval describes the range where an estimate produced by a respective ensemble or model trained on a repeat of the experiment may be found with the probability of the confidence level. For `hybrid`, it represents the range where any single model trained on a repeat of the experiment may be found with the probability of the confidence level. By definition, confidence intervals obtained using `hybrid` are at least as wide as those for `ensemble`. `hybrid` offers the correct interpretation if the goal of the analysis is to assess the result of a single, unspecified, model. `hybrid` is generally computationally less expensive then `ensemble`, which in turn is somewhat less expensive than `model`. A non-default `detail_level` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"="ensemble", "model_performance"="hybrid")`. This parameter can be set for the following data elements: `auc_data`, `decision_curve_analyis`, `model_performance`, `permutation_vimp`, `ice_data`, `prediction_data` and `confusion_matrix`.
`estimation_type`	(optional) Sets the type of estimation that should be possible. This has the following options: `point`: Point estimates. `bias_correction` or `bc`: Bias-corrected estimates. A bias-corrected estimate is computed from (at least) 20 point estimates, and `familiar` may bootstrap the data to create them. `bootstrap_confidence_interval` or `bci` (default): Bias-corrected estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The number of point estimates required depends on the `confidence_level` parameter, and `familiar` may bootstrap the data to create them. As with `detail_level`, a non-default `estimation_type` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"="bci", "model_performance"="point")`. This parameter can be set for the following data elements: `auc_data`, `decision_curve_analyis`, `model_performance`, `permutation_vimp`, `ice_data`, and `prediction_data`.
`aggregate_results`	(optional) Flag that signifies whether results should be aggregated during evaluation. If `estimation_type` is `bias_correction` or `bc`, aggregation leads to a single bias-corrected estimate. If `estimation_type` is `bootstrap_confidence_interval` or `bci`, aggregation leads to a single bias-corrected estimate with lower and upper boundaries of the confidence interval. This has no effect if `estimation_type` is `point`. The default value is equal to `TRUE` except when assessing metrics to assess model performance, as the default violin plot requires underlying data. As with `detail_level` and `estimation_type`, a non-default `aggregate_results` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"=TRUE, , "model_performance"=FALSE)`. This parameter exists for the same elements as `estimation_type`.
`confidence_level`	(optional) Numeric value for the level at which confidence intervals are determined. In the case bootstraps are used to determine the confidence intervals bootstrap estimation, `familiar` uses the rule of thumb `n = 20 / ci.level` to determine the number of required bootstraps. The default value is `0.95`.
`bootstrap_ci_method`	(optional) Method used to determine bootstrap confidence intervals (Efron and Hastie, 2016). The following methods are implemented: `percentile` (default): Confidence intervals obtained using the percentile method. `bc`: Bias-corrected confidence intervals. Note that the standard method is not implemented because this method is often not suitable due to non-normal distributions. The bias-corrected and accelerated (BCa) method is not implemented yet.
`icc_type`	String indicating the type of intraclass correlation coefficient (`1`, `2` or `3`) that should be used to compute robustness for features in repeated measurements during the evaluation of univariate importance. These types correspond to the types in Shrout and Fleiss (1979). If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects.
`dynamic_model_loading`	(optional) Enables dynamic loading of models during the evaluation process, if `TRUE`. Defaults to `FALSE`. Dynamic loading of models may reduce the overall memory footprint, at the cost of increased disk or network IO. Models can only be dynamically loaded if they are found at an accessible disk or network location. Setting this parameter to `TRUE` may help if parallel processing causes out-of-memory issues during evaluation.
`message_indent`	Number of indentation steps for messages shown during computation and extraction of various data elements.
`verbose`	Flag to indicate whether feedback should be provided on the computation and extraction of various data elements.
`...`	Unused arguments.