as.incadata: Identify data formats used by INCA and Rockan

Description Usage Arguments Details Value factors interactive use

View source: R/as.incadata.R

Description

Coerce data of any form to its relevant type as identified either by column/vector names or by variable content and convert all variable names to lower case.

Usage

1
2
3
4
5
6
7
8
9
as.incadata(x, ...)

is.incadata(x)

## S3 method for class 'data.frame'
as.incadata(x, decode = TRUE, id = TRUE, ask = TRUE, ...)

## Default S3 method:
as.incadata(x, n_i = NULL, ...)

Arguments

x

data

...

arguments passed to exceed_threshold (of most use is probably "threshold" and "force", see the "interactive use" section below)

decode

Should decode be applied to variables with identified variable names? (TRUE by default).

id

Should an id-column be added (see id)?

ask

ask for input if unsure how to coerce variables (see the "interactive use" section below)

n_i

used internally between methods (should not be set by the user)

Details

Vectors are coerced to identified formats in the following order:

Value

as.incadata.data.frame

object of class incadata based on the "tibble"-class used within the "tidyverse" with all variables possibly coerced as described above.

as.incadata.default

input vector coerced to relevant class

is.incadata

TRUE for objects of class incadata, otherwise FALSE

factors

Note that the incadata format does not include factors. Factors can be really useful for some applications but our philosophy is that they should be explicitly stated as such when needed. It is otherwise common that factor levels are created just by the responses present in a certain data set. These might or might not contain a complete list of possible alternatives from a INCA variable with a fixed value set.

interactive use

Some vectors can be undoubtedly recognized according to specifications above. It is however possible that a vector of an intended format might have been "contaminated" with data of some other form. This might happen for example when a numeric variable is technically a character in INCA. For example a hospital unit code like c(111, 123, "?") might suddenly occur (if someone use a question mark as placeholder for an unknown code). Ordinary coercing rules of R would treat this vector as a character (see c), although it might be more correct to treat it as a numeric with "?" set to NA.

The as.incadata function relies on exceed_threshold to ignore such contaminated values if they represent only a (preferably small) proportion of the values.

By default, if contaminated values exist but only to a proportion of less than 10 percent, the function will stop and ask the user for input on how to handle this variable. If the proportion exceeds 10 percent, ordinary coercing principles will apply.

The 10 percent limit can be modified by argument threshold and it is possible to force vectors with contaminated values to the otherwise potential format (without the need of individual confirmation) by setting argument force = TRUE (passed to exceed_threshold).


incadata documentation built on April 14, 2020, 6:08 p.m.