.parse_preprocessing_settings | R Documentation |
Internal function for parsing settings related to preprocessing
.parse_preprocessing_settings(
config = NULL,
data,
parallel,
outcome_type,
feature_max_fraction_missing = waiver(),
sample_max_fraction_missing = waiver(),
filter_method = waiver(),
univariate_test_threshold = waiver(),
univariate_test_threshold_metric = waiver(),
univariate_test_max_feature_set_size = waiver(),
low_var_minimum_variance_threshold = waiver(),
low_var_max_feature_set_size = waiver(),
robustness_icc_type = waiver(),
robustness_threshold_metric = waiver(),
robustness_threshold_value = waiver(),
transformation_method = waiver(),
transformation_optimisation_criterion = waiver(),
transformation_gof_test_p_value = waiver(),
normalisation_method = waiver(),
batch_normalisation_method = waiver(),
imputation_method = waiver(),
cluster_method = waiver(),
cluster_linkage_method = waiver(),
cluster_cut_method = waiver(),
cluster_similarity_metric = waiver(),
cluster_similarity_threshold = waiver(),
cluster_representation_method = waiver(),
parallel_preprocessing = waiver(),
...
)
config |
A list of settings, e.g. from an xml file. |
data |
Data set as loaded using the |
parallel |
Logical value that whether familiar uses parallelisation. If
|
outcome_type |
Type of outcome found in the data set. |
feature_max_fraction_missing |
(optional) Numeric value between |
sample_max_fraction_missing |
(optional) Numeric value between |
filter_method |
(optional) One or methods used to reduce dimensionality of the data set by removing irrelevant or poorly reproducible features. Several method are available:
More than one method can be used simultaneously. Features with singular values are always filtered, as these do not contain information. |
univariate_test_threshold |
(optional) Numeric value between |
univariate_test_threshold_metric |
(optional) Metric used with the to
compare the
|
univariate_test_max_feature_set_size |
(optional) Maximum size of the feature set after the univariate test. P or q values of features are compared against the threshold, but if the resulting data set would be larger than this setting, only the most relevant features up to the desired feature set size are selected. The default value is |
low_var_minimum_variance_threshold |
(required, if used) Numeric value
that determines which features will be filtered by the This parameter has no default value and should be set if |
low_var_max_feature_set_size |
(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against The default value is |
robustness_icc_type |
(optional) String indicating the type of
intraclass correlation coefficient ( |
robustness_threshold_metric |
(optional) String indicating which specific intraclass correlation coefficient (ICC) metric should be used to filter features. This should be one of:
|
robustness_threshold_value |
(optional) The intraclass correlation
coefficient value that is as threshold. The default value is |
transformation_method |
(optional) The transformation method used to change the distribution of the data to be more normal-like. The following methods are available:
Transformation requires the |
transformation_optimisation_criterion |
(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation.
|
transformation_gof_test_p_value |
(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is |
normalisation_method |
(optional) The normalisation method used to improve the comparability between numerical features that may have very different scales. The following normalisation methods can be chosen:
Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within |
batch_normalisation_method |
(optional) The method used for batch normalisation. Available methods are:
Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within If validation data contains data from unknown batches, normalisation parameters are separately determined for these batches. Note that for both empirical Bayes methods, the batch effect is assumed to produce results across the features. This is often true for things such as gene expressions, but the assumption may not hold generally. When performing batch normalisation, it is moreover important to check that differences between batches or cohorts are not related to the studied endpoint. |
imputation_method |
(optional) Method used for imputing missing feature values. Two methods are implemented:
The default value depends on the number of features in the dataset. If the
number is lower than 100, Only single imputation is performed. Imputation models and parameters are
stored within |
cluster_method |
(optional) Clustering is performed to identify and replace redundant features, for example those that are highly correlated. Such features do not carry much additional information and may be removed or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011). The cluster method determines the algorithm used to form the clusters. The following cluster methods are implemented:
Clusters and cluster information is stored within |
cluster_linkage_method |
(optional) Linkage method used for
agglomerative clustering in
|
cluster_cut_method |
(optional) The method used to define the actual clusters. The following methods can be used:
The default options are |
cluster_similarity_metric |
(optional) Clusters are formed based on feature similarity. All features are compared in a pair-wise fashion to compute similarity, for example correlation. The resulting similarity grid is converted into a distance matrix that is subsequently used for clustering. The following metrics are supported to compute pairwise similarities:
The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In In case any of the classical correlation coefficients ( |
cluster_similarity_threshold |
(optional) The threshold level for
pair-wise similarity that is required to form clusters using
Alternatively, if the
The threshold value is converted to a distance (1-similarity) prior to cutting hierarchical trees. |
cluster_representation_method |
(optional) Method used to determine how the information of co-clustered features is summarised and used to represent the cluster. The following methods can be selected:
If the |
parallel_preprocessing |
(optional) Enable parallel processing for the
preprocessing workflow. Defaults to |
... |
Unused arguments. |
List of parameters related to preprocessing.
Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Yeo, I. & Johnson, R. A. A new family of power transformations to improve normality or symmetry. Biometrika 87, 954–959 (2000).
Box, G. E. P. & Cox, D. R. An analysis of transformations. J. R. Stat. Soc. Series B Stat. Methodol. 26, 211–252 (1964).
Raymaekers, J., Rousseeuw, P. J. Transforming variables to central normality. Mach Learn. (2021).
Park, M. Y., Hastie, T. & Tibshirani, R. Averaged gene expressions for regression. Biostatistics 8, 212–227 (2007).
Tolosi, L. & Lengauer, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 1986–1994 (2011).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007)
Kaufman, L. & Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis. (John Wiley & Sons, 2009).
Muellner, D. fastcluster: fast hierarchical, agglomerative clustering routines for R and Python. J. Stat. Softw. 53, 1–18 (2013).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720 (2008).
McFadden, D. Conditional logit analysis of qualitative choice behavior. in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press, 1974).
Cox, D. R. & Snell, E. J. Analysis of binary data. (Chapman and Hall, 1989).
Nagelkerke, N. J. D. A note on a general definition of the coefficient of determination. Biometrika 78, 691–692 (1991).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.