Application example"

library(knitr)
library(vtype)

Introduction

The purpose of the package is automatically detecting type of variables in not quality controlled data. The prediction is based on a pre-trained random forest model, trained on over 5000 medical variables with OOB accuracy of 99%. The accuracy depends heavily on the type and coding style of data. For example, often categorical variables are coded as integers 1 to x, if the number of categories is very large, there is no way to distinguish it from a continuous integer variable. Some types are per definition very sensitive to errors in data, like ID, missing or constant, where a single alternative non-missing value makes it not constant or not missing anymore. The data is assumed to be cross sectional, where ID is unique (no multiple entries per ID).

It can be used as a first step by data quality control to help sort the variables in advance and get some information about the possible formats.

Example data set

The data set 'sim_nqc_data' contains 100 observations and 14 artificial variables with some not well formatted or missing values. The data is complete artificial and was not used for training or validation of the random forest model.

knitr::kable(head(sim_nqc_data, 20), caption='Artificial not quality controlled data')

Application

An example on error afflicted data

The application is straightforward, it requires data in data.frame format. It is important that all unusual missing values in the data, e.g. the code 9999 for missing values are covered. Values as NA, NaN, Inf, NULL and spaces are automatic considered as invalid (missing) values. The second column type is the estimated type of the variable, and the column probability indicates how certain the type is. The format gives additional information about the possible format of the variable, especially useful for date variables. Class is just a translation of type into broader categories.

tab <- vtype(sim_nqc_data, miss_values='9999')
knitr::kable(tab, caption='Application example of vtype')

An example with very small sample size

Very small sample size can reduce the prediction performance significantly. The id variable is now detected as integer, age as categorical and decades as a date variable.

knitr::kable(vtype(sim_nqc_data[1:10,]), caption='Application example with small sample size')

An example on data without errors

knitr::kable(vtype(mtcars),  caption='Application example on data without errors')


Try the vtype package in your browser

Any scripts or data that you put into this service are public.

vtype documentation built on May 14, 2021, 5:07 p.m.