getDataCensus: Get Census KDD data set (+variation)

View source: R/getDataCensus.R

getDataCensusR Documentation

Get Census KDD data set (+variation)

Description

This function downloads (or loads from cache folder) the Census KDD Dataset (OpenML ID: 4535). If requested, data set is changed w.r.t the number of observations, number of numerical/categorical feature, the cardinality of the categorical features, and the task type (regr. or classif).

Usage

getDataCensus(
  task.type = "classif",
  nobs = 50000,
  nfactors = "high",
  nnumericals = "high",
  cardinality = "high",
  data.seed = 1,
  cachedir = "oml.cache",
  target = NULL,
  cache.only = FALSE
)

Arguments

task.type

character, either "classif" or "regr".

nobs

integer, number of observations uniformly sampled from the full data set.

nfactors

character, controls the number of factors (categorical features) to use. Can be "low", "med", "high", or "full" (full corresponds to original data set).

nnumericals

character, controls the number of numerical features to use. Can be "low", "med", "high", or "full" (full corresponds to original data set).

cardinality

character, controls the number of factor levels (categories) for the categorical features. Can be "low", "med", "high" (high corresponds to original data set).

data.seed

integer, this will be used via set.seed() to make the random subsampling reproducible. Will not have an effect if all observations are used.

cachedir

character. The cache directory, e.g., "oml.cache". Default: "oml.cache".

target

character "age" or "income_class". If target = age, the numerical varible age is converted to a factor: age<-as.factor(age<40)

cache.only

logical. Only try to retrieve the object from cache. Will result in error if the object is not found. Default is TRUE.

Value

census data set

Examples


## Example downloads OpenML data, might take some time:
task.type <- "classif"
nobs <- 1e4 # max: 229285
data.seed <- 1
nfactors <- "full"
nnumericals <- "low"
cardinality <- "med"
censusData <- getDataCensus(
  task.type = task.type,
  nobs = nobs,
  nfactors = nfactors,
  nnumericals = nnumericals,
  cardinality = cardinality,
  data.seed = data.seed,
  cachedir = "oml.cache",
  target="age")
  


SPOTMisc documentation built on Sept. 5, 2022, 5:06 p.m.