trainGweonsNearestNeighbor: Trains Gweons Nearest Neighbor model
In malsch/occupationCoding: Supervised Learning for Occupation Coding

View source: R/trainGweonsNearestNeighbor.R

trainGweonsNearestNeighbor

R Documentation

Trains Gweons Nearest Neighbor model

Description

Function does some preprocessing and creates a document term matrix to be used for the Nearest Neighbor model.

Usage

trainGweonsNearestNeighbor(
  data,
  preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de",
    strPreprocessing = FALSE, removePunct = TRUE)
)

Arguments

data

a data.table created with removeFaultyAndUncodableAnswers_And_PrepareForAnalysis

preprocessing

a list with elements

stopwords: a character vector, use tm::stopwords("de") for German stopwords.
stemming: NULL for no stemming and "de" for stemming using the German porter stemmer.
strPreprocessing: TRUE if stringPreprocessing shall be used.
removePunct: TRUE if removePunctuation shall be used.

Value

a document term matrix with some additional attributes

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

# Recommended configuration
dtmModel <- trainGweonsNearestNeighbor(proc.occupations,
                 preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", strPreprocessing = TRUE, removePunct = FALSE))
# Configuration used by Gweon et al. (2017)
dtmModel <- trainGweonsNearestNeighbor(proc.occupations,
                 preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", strPreprocessing = FALSE, removePunct = TRUE))
# Configuration used for most other approaches in this package
dtmModel <- trainGweonsNearestNeighbor(proc.occupations,
                 preprocessing = list(stopwords = character(0), stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE))

#######################################################
## RUN A GRID SEARCH (takes some time)

# create a grid of all combinations to be tried
model.grid <- data.table(expand.grid(stopwords = c(TRUE, FALSE), stemming = c(FALSE, "de"), strPreprocessing = c(TRUE, FALSE), nearest.neighbors.multiplier = c(0.05, 0.1, 0.2)))

# Do grid search
for (i in 1:nrow(model.grid)) {
  res.model <- trainGweonsNearestNeighbor(splitted.data$training, preprocessing = list(stopwords = if (model.grid[i, stopwords]) tm::stopwords("de") else character(0),
                                                                                       stemming = if (model.grid[i, stemming == "de"]) "de" else NULL,
                                                                                       strPreprocessing = model.grid[i, strPreprocessing],
                                                                                       removePunct = !model.grid[i, strPreprocessing]))

  res.proc <- predictGweonsNearestNeighbor(res.model, splitted.data$test,
                                        tuning = list(nearest.neighbors.multiplier = model.grid[i, nearest.neighbors.multiplier]))
  res.proc <- expandPredictionResults(res.proc, allowed.codes = allowed.codes, method.name = "NearestNeighbor_Gweon")

  ac <- accuracy(calcAccurateAmongTopK(res.proc, k = 1), n = nrow(splitted.data$test))
  ll <- logLoss(res.proc)
  sh <- sharpness(res.proc)

  model.grid[i, acc := ac[, acc]]
  model.grid[i, acc.se := ac[, se]]
  model.grid[i, acc.N := ac[, N]]
  model.grid[i, acc.prob0 := ac[, count.pred.prob0]]
  model.grid[i, loss.full := ll[1, logscore]]
  model.grid[i, loss.full.se := ll[1, se]]
  model.grid[i, loss.full.N := ll[1, N]]
  model.grid[i, loss.sub := ll[2, logscore]]
  model.grid[i, loss.sub.se := ll[2, se]]
  model.grid[i, loss.sub.N := ll[2, N]]
  model.grid[i, sharp := sh[, sharpness]]
  model.grid[i, sharp.se := sh[, se]]
  model.grid[i, sharp.N := sh[, N]]
}

model.grid[order(stopwords, stemming, strPreprocessing, nearest.neighbors.multiplier)]

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.