predictGweonsNearestNeighbor: Predict codes with Gweons Nearest Neighbor Method

View source: R/predictGweonsNearestNeighbor.R

predictGweonsNearestNeighborR Documentation

Predict codes with Gweons Nearest Neighbor Method

Description

Function does the same preprocessing as in trainGweonsNearestNeighbor and predicts codes with a modified 1-nearest-neighbor approach.

Usage

predictGweonsNearestNeighbor(
  model,
  newdata,
  tuning = list(nearest.neighbors.multiplier = 0.1)
)

Arguments

model

the output created from trainGweonsNearestNeighbor

newdata

eiter a data.table created with removeFaultyAndUncodableAnswers_And_PrepareForAnalysis or a character vector

tuning

a list with element

nearest.neighbors.multiplier

defaults to 0.1. Gweon et al. (2017) show that 0.1 is a better choice than 0 but the exact value is a bit arbitrary.

Value

a data.table of class occupationalPredictions that contains predicted probabilities pred.prob for every combination of ans and pred.code. pred.code may not cover the full set of possible codes. If all predicted codes have probability 0, these predictions are removed and we instead insert pred.code := "-9999" with pred.prob = 1/num.allowed.codes.

See Also

trainGweonsNearestNeighbor

Gweon, H.; Schonlau, M., Kaczmirek, L., Blohm, M., Steiner, S. (2017). Three Methods for Occupation Coding Based on Statistical Learning. Journal of Official Statistics 33(1), pp. 101–122

This function is based on https://github.com/hgweon/occupation-coding/blob/master/Modified_NN.r. Considerable speed improvements were implemented.

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

## split sample
set.seed(3451345)
n.test <- 50
group <- sample(c(rep("test", n.test), rep("training", nrow(proc.occupations) - n.test)))
splitted.data <- split(proc.occupations, group)

# train model and make predictions
model <- trainGweonsNearestNeighbor(splitted.data$train,
                                    preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", strPreprocessing = TRUE, removePunct = FALSE))
predictGweonsNearestNeighbor(model, c("test", "HIWI", "Hilfswissenschaftler"))
res <- predictGweonsNearestNeighbor(model, splitted.data$test)

# look at most probable answer from each id
res[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id]
res[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id][, mean(acc)] # calculate accurac of predictions

# for further analysis we usually require further processing:
produceResults(expandPredictionResults(res, allowed.codes, method.name = "GweonsNearestNeighbor"), k = 1, n = n.test, num.codes = length(allowed.codes))

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.