trainXgboost: Train an extreme gradient boosted tree model

View source: R/trainXgboost.R

trainXgboostR Documentation

Train an extreme gradient boosted tree model

Description

Function does some preprocessing and calls xgboost to train gradient boosted trees.

Usage

trainXgboost(
  data,
  allowed.codes,
  testCases = NULL,
  returnPredictions = FALSE,
  coding_index = NULL,
  preprocessing = list(stopwords = character(0), stemming = "de", countWords = TRUE),
  tuning = list(eta = 0.5, lambda = 1e-04, alpha = 0, max.depth = 20, gamma = 0.6,
    min_child_weight = 0, max_delta_step = 1, subsample = 0.75, colsample_bytree = 1,
    colsample_bylevel = 1, nrounds = 40, early_stopping_rounds = 1,
    early.stopping.max.diff = sum(testCases)/100, early.stopping.precision.digits = 3,
    nthread = 1, verbose = 1)
)

Arguments

data

a data.table created with removeFaultyAndUncodableAnswers_And_PrepareForAnalysis

allowed.codes

a vector containg all the labels of the codes (even those that are not in the data are possible.)

testCases

If NULL (default) the model is trained as usual. If testCases is a logical vector equal to the length of data, the function splits data into a separate evaluation set (the elements with testCases == TRUE), prints missclassification error of training and evaluation dataset and returns the predictions from the evaluation dataset.

returnPredictions

(only used if testCases are given.) If TRUE, a data.table with predictions for all testCases is returned. Otherwise the xgboost model is returned which can be used for diagnostics but not inside predictXgboost because the term-document matrix will be calculated in a different way.

coding_index

a data.table with columns

title

a character vector, a preprocessed character vector with entries from the coding index.

Code

a character vector with associated classification codes.

preprocessing

a list with elements

stopwords

a character vector, use tm::stopwords("de") for German stopwords.

stemming

NULL for no stemming and "de" for stemming using the German porter stemmer.

countWords

Set to TRUE if the predictor matrix should contain a column for answer length.

tuning

a list with elements that will be passed to xgb.train except for two parameters

early.stopping.max.diff

If the sum of probabilities of the most probable category is by early.stopping.max.diff cases larger than the observed number of cases correctly predicted, return Infinity, a value that will support early stopping. Idea behind this: Once the sum of predicted probabilities becomes too large, it is unusual (impossible) to decrease it again. Thus, we have overfitting.

early.stopping.precision.digits

Logloss is rounded with early.stopping.precision.digits. This leads to early stopping if improvements in logloss are actually to small.

Details

See run_algorithms.R for some comments about tuning.

Value

If testCases = NULL (default) a xgboost model to be used with predictXgboost.

See Also

predictXgboost, xgb.train

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)
group <- sample(c(rep("test", n.test), rep("train", nrow(proc.occupations)-n.test)))
##### Tune pararameters with verbose=1 output. We split the data into training and evaluation set of size n.test = 50
n.test <- 50

# output test dataset with 'returnPredictions = TRUE
eval.dataset <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = group == "test", returnPredictions = TRUE,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = 1,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=1)
                      )
eval.dataset[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id][, mean(acc)]
produceResults(expandPredictionResults(eval.dataset, allowed.codes = allowed.codes, method.name = "xgboost"), k = 1, n = n.test, num.codes = length(allowed.codes))

# same as before but output the model
XGboostModel <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = group == "test", returnPredictions = FALSE,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = 1,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=1)
                      )

# same as before, but without test data and without early stopping (not recommended because results can be worse)
XGboostModel <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = NULL, returnPredictions = FALSE,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = NULL,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=0)
                      )

# same as before, now using the coding index
# point path_to_file to your local file
# path_to_file <- ".../Gesamtberufsliste_der_BA.xlsx"
# coding_index_excerpt <- prepare_German_coding_index_Gesamtberufsliste_der_BA(path_to_file, count.categories = FALSE)
XGboostModel <- trainXgboost(proc.occupations, allowed.codes = allowed.codes, testCases = NULL, returnPredictions = FALSE,
                      coding_index = coding_index_excerpt,
                      preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE),
                      tuning = list(eta = 0.5, lambda = 1e-4, alpha = 0,
                                    max.depth = 20, gamma = 0.6,
                                    min_child_weight = 0, max_delta_step = 1,
                                    subsample = 0.75, colsample_bytree = 1, colsample_bylevel=1,
                                    nrounds= 3, early_stopping_rounds = NULL,
                                    early.stopping.max.diff = n.test / 100, early.stopping.precision.digits = 3,
                                    nthread = 8, verbose=0)
                      )

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.