clean.dataset: Data Cleaning
In neurodata/slbR: Statistical Learning Benchmarks

A function for scrubbing a datasetset for usage with most standard algorithms. This involves one-hot-encoding columns that are probably categorical.

1	clean.dataset(dataset, clean.invalid = TRUE, clean.ohe = FALSE)

dataset

a list with at least the following key-worded elements:

X[n, d] matrix containing n samples in d dimensions.
Y[n, r] matrix containing or [n] vector containing regressors or class labels forsamples in X.

clean.invalid

whether to remove samples with invalid entries. Defaults to TRUE.

TRUE Remove samples that have features with NaN entries or non-finite.
FALSE Do not remove samples that have features with NaN entries or are non-finite..

clean.ohe

options for whether to one-hot-encode columns. Defaults to FALSE.

clean.ohe < 1Converts columns with < thr*n unique identifiers to one-hot encoded.
is.integer(clean.ohe)Converts columns with < thr unique identifiers to one-hot encoded.
FALSEDo not one-hot-encode any columns.

A list containing at least the following key-worded elements:

X[m, d+r] the array with m samples in d+r dimensions, where r is the number of additional columns appended for encodings. m < n when there are non-finite or NaN entries. colnames(dataset) returns the column names of the cleaned columns.
Y[m, r] matrix or [n] vector containg regressors or class labels for samples in X. m < n when there are non-finite or NaN entries.
samplesm the sample ids that are included in the final array, where samp[i] is the original row id corresponding to Xc[i,]. If m < n, there were non-finite or NaN entries that were purged.