plotAgreementRateVsProductionRate: Plot agreement rate vs. production rate
In malsch/occupationCoding: Supervised Learning for Occupation Coding

View source: R/plotAgreementRateVsProductionRate.R

plotAgreementRateVsProductionRate

R Documentation

Plot agreement rate vs. production rate

Description

Plot the proportion of answers coded correctly for any given production rate (proportion of n that are coded automatically.)

Usage

plotAgreementRateVsProductionRate(
  occupationalPredictionsAmongTopK,
  n,
  yintercept,
  filename = NULL
)

Arguments

`occupationalPredictionsAmongTopK`	a data table created with `calcAccurateAmongTopK`.
`n`	Number of unique persons in test data (may be larger than the number of persons in occupationalPredictionsAmongTopK if for some persons no codes were suggested)
`yintercept`	Where to draw a horizontal line?
`filename`	If a `filename` is specified the diagram will be saved at with this name.

Value

a ggplot

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

## split sample
set.seed(3451345)
n.test <- 50
group <- sample(c(rep("test", n.test), rep("training", nrow(proc.occupations) - n.test)))
splitted.data <- split(proc.occupations, group)

# train model and make predictions
model <- trainLogisticRegressionWithPenalization(splitted.data$train, preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE), tuning = list(alpha = 0.05, maxit = 50^5, nlambda = 100, thresh = 1e-5))
res <- predictLogisticRegressionWithPenalization(model, splitted.data$test)

# expand to contain more categories than the initial ones
res.proc1 <- expandPredictionResults(res, allowed.codes = c("12345", allowed.codes), method.name = "glmnet1")

# we can use different methods to create a combined dataset. This is how to run the subsequent analysis functions only once.
res.proc2 <- expandPredictionResults(res, allowed.codes = c("12345", allowed.codes), method.name = "glmnet2")
res.proc <- rbind(res.proc1, res.proc2); class(res.proc) <- c(class(res.proc), "occupationalPredictionsComplete")

calcAccurateAmongTopK(res.proc, k = 5)[,mean(acc), by = method.name]
plotAgreementRateVsProductionRate(calcAccurateAmongTopK(res.proc, k = 5), n = n.test, yintercept = 0.85)
plotAgreementRateVsProductionRate(calcAccurateAmongTopK(res.proc, k = 1), n = n.test, yintercept = 0.85, filename = "test.pdf")

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.