predictCreecysMemoryBasedReasoning: Predict codes with Creecys Memory-based reaoning model
In malsch/occupationCoding: Supervised Learning for Occupation Coding

View source: R/predictCreecysMemoryBasedReasoning.R

predictCreecysMemoryBasedReasoning

R Documentation

Predict codes with Creecys Memory-based reaoning model

Description

Function does the same preprocessing as in trainCreecysMemoryBasedReasoning and predicts codes with a modified k-nearest-neighbor approach.

Usage

predictCreecysMemoryBasedReasoning(
  model,
  newdata,
  tuning = list(k.neighbors = 12, metric = c("SUM", "ERROR", "MAX"))
)

Arguments

model

the output created from trainCreecysMemoryBasedReasoning

newdata

eiter a data.table created with removeFaultyAndUncodableAnswers_And_PrepareForAnalysis or a character vector

tuning

a list with elements

k.neighbors: Number of nearest neighbors to use.
metric: metric determines how to calculate 'nearness'. Setting metric == MAX is not recommended. See Creecy et al. for their reasoning and testing of different metrics.

Value

a data.table that provides a confidence score for the most likely category. Unlike other prediction functions in this package, no probabilities for all categories are provided, which makes post-processing a bit more difficult. See examples.

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

## split sample
set.seed(3451345)
n.test <- 50
group <- sample(c(rep("test", n.test), rep("training", nrow(proc.occupations) - n.test)))
splitted.data <- split(proc.occupations, group)

# train model and make predictions
memModel <- trainCreecysMemoryBasedReasoning(splitted.data$training,
                                             preprocessing = list(stopwords = character(0), stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE))
testi <- predictCreecysMemoryBasedReasoning(memModel, c("test", "HIWI", "Hilfswissenschaftler"), tuning = list(k.neighbors = 12, metric = c("SUM")))
testi; testi

resMem <- predictCreecysMemoryBasedReasoning(memModel, splitted.data$test, tuning = list(k.neighbors = 12, metric = c("SUM")))

# Analyize predictions
accuracy(resMem[, list(pred.code, pred.prob = confidence.score, acc = pred.code == code, num.suggested.codes = 1, num.suggested = 1, general.among.top5 = pred.code == "-9999", method.name = "Creecy.Sum.12")], n = n.test)
plotTruePredictionsVsFalsePredictions(resMem[, list(pred.code, pred.prob = confidence.score, acc = pred.code == code, num.suggested.codes = 1, num.suggested = 1, general.among.top5 = pred.code == "-9999", method.name = "Creecy.Sum.12")])
plotAgreementRateVsProductionRate(resMem[, list(pred.code, pred.prob = confidence.score, acc = pred.code == code, num.suggested.codes = 1, num.suggested = 1, general.among.top5 = pred.code == "-9999", method.name = "Creecy.Sum.12")], n = n.test, yintercept = 0.85)

#' #######################################################
## RUN A GRID SEARCH (takes some time)

# create a grid of all tuning combinations to try
 model.grid <- data.table(expand.grid(stopwords = c(TRUE, FALSE), stemming = c(FALSE, "de"), metric = c("SUM", "ERROR"), k.neighbors = c(2, 10, 17)))

 # Do grid search
 for (i in 1:nrow(model.grid)) {

   res.model <- trainCreecysMemoryBasedReasoning(splitted.data$training, preprocessing = list(stopwords = if (model.grid[i, stopwords]) tm::stopwords("de") else character(0),
                                                                                              stemming = if (model.grid[i, stemming == "de"]) "de" else NULL,
                                                                                              strPreprocessing = TRUE,
                                                                                              removePunct = FALSE))

   res.proc <- predictCreecysMemoryBasedReasoning(res.model, splitted.data$test,
                                                  tuning = list(k.neighbors = model.grid[i, k.neighbors],
                                                                metric = model.grid[i, metric]))

   ac <- accuracy(res.proc[, list(pred.code, pred.prob = confidence.score, acc = pred.code == code, num.suggested.codes = 1, num.suggested = 1, general.among.top5 = pred.code == "-9999", method.name = "Creecy.Sum.12")], n = nrow(splitted.data$test))

   model.grid[i, acc := ac[, acc]]
   model.grid[i, acc.se := ac[, se]]
   model.grid[i, acc.N := ac[, N]]
   model.grid[i, acc.prob0 := ac[, count.pred.prob0]]
 }

model.grid[order(metric, k.neighbors, stemming)]

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.