trainXgboost: Train an extreme gradient boosted tree model
In malsch/occupationCoding: Supervised Learning for Occupation Coding

trainXgboost

R Documentation

Train an extreme gradient boosted tree model

Description

Function does some preprocessing and calls xgboost to train gradient boosted trees.

Usage

trainXgboost(
  data,
  allowed.codes,
  testCases = NULL,
  returnPredictions = FALSE,
  coding_index = NULL,
  preprocessing = list(stopwords = character(0), stemming = "de", countWords = TRUE),
  tuning = list(eta = 0.5, lambda = 1e-04, alpha = 0, max.depth = 20, gamma = 0.6,
    min_child_weight = 0, max_delta_step = 1, subsample = 0.75, colsample_bytree = 1,
    colsample_bylevel = 1, nrounds = 40, early_stopping_rounds = 1,
    early.stopping.max.diff = sum(testCases)/100, early.stopping.precision.digits = 3,
    nthread = 1, verbose = 1)
)

Arguments

`data`	a data.table created with `removeFaultyAndUncodableAnswers_And_PrepareForAnalysis`
`allowed.codes`	a vector containg all the labels of the codes (even those that are not in the data are possible.)
`testCases`	If `NULL` (default) the model is trained as usual. If `testCases` is a logical vector equal to the length of `data`, the function splits data into a separate evaluation set (the elements with `testCases == TRUE`), prints missclassification error of training and evaluation dataset and returns the predictions from the evaluation dataset.
`returnPredictions`	(only used if testCases are given.) If TRUE, a data.table with predictions for all testCases is returned. Otherwise the xgboost model is returned which can be used for diagnostics but not inside `predictXgboost` because the term-document matrix will be calculated in a different way.
`coding_index`	a data.table with columns title a character vector, a preprocessed character vector with entries from the coding index. Code a character vector with associated classification codes.
`preprocessing`	a list with elements stopwords a character vector, use `tm::stopwords("de")` for German stopwords. stemming `NULL` for no stemming and `"de"` for stemming using the German porter stemmer. countWords Set to TRUE if the predictor matrix should contain a column for answer length.
`tuning`	a list with elements that will be passed to `xgb.train` except for two parameters early.stopping.max.diff If the sum of probabilities of the most probable category is by `early.stopping.max.diff` cases larger than the observed number of cases correctly predicted, return Infinity, a value that will support early stopping. Idea behind this: Once the sum of predicted probabilities becomes too large, it is unusual (impossible) to decrease it again. Thus, we have overfitting. early.stopping.precision.digits Logloss is rounded with `early.stopping.precision.digits`. This leads to early stopping if improvements in logloss are actually to small.

Details

See run_algorithms.R for some comments about tuning.

Value

If testCases = NULL (default) a xgboost model to be used with predictXgboost.

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)
group <- sample(c(rep("test", n.test), rep("train", nrow(proc.occupations)-n.test)))
##### Tune pararameters with verbose=1 output. We split the data into training and evaluation set of size n.test = 50
n.test <- 50

# output test dataset with 'returnPredictions = TRUE
eval.dataset <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = group == "test", returnPredictions = TRUE,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = 1,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=1)
                      )
eval.dataset[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id][, mean(acc)]
produceResults(expandPredictionResults(eval.dataset, allowed.codes = allowed.codes, method.name = "xgboost"), k = 1, n = n.test, num.codes = length(allowed.codes))

# same as before but output the model
XGboostModel <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = group == "test", returnPredictions = FALSE,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = 1,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=1)
                      )

# same as before, but without test data and without early stopping (not recommended because results can be worse)
XGboostModel <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = NULL, returnPredictions = FALSE,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = NULL,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=0)
                      )

# same as before, now using the coding index
# point path_to_file to your local file
# path_to_file <- ".../Gesamtberufsliste_der_BA.xlsx"
# coding_index_excerpt <- prepare_German_coding_index_Gesamtberufsliste_der_BA(path_to_file, count.categories = FALSE)
XGboostModel <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = NULL, returnPredictions = FALSE,
                      coding_index = coding_index_excerpt,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = NULL,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=0)
                      )

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.