sharpness: Sharpness
In malsch/occupationCoding: Supervised Learning for Occupation Coding

sharpness

R Documentation

Sharpness

Description

Calculate Sharpness \log_2 loss = \frac{1}{N} \sum_n entropy_n and standard error \sqrt{\frac{1}{N(N-1)} \sum_n (entropy_n - sharpness)^2} with entropy_n = - \sum_k p_{nk} \log_2 p_{nk}

Usage

sharpness(occupationalPredictions)

Arguments

occupationalPredictions

a data.table created with a expandPredictionResults-function from this package.

Details

Sharpness is zero (optimal) if a single category is predicted with probability 1. It is maximal if all categories have equal probability p = \frac{1}{K}

Note: What should be done if a predicted probability is zero? 0 \times log(0) is not defined but necessary to calculate sharpness. We set 0 \times log(0) = 0. This also means we exclude observations from our analysis if all probabilities are predicted as zero. An alternative could be to set such zeros to 1/k), which would lead to very different sharpness.

Value

a data.table

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

## split sample
set.seed(3451345)
n.test <- 50
group <- sample(c(rep("test", n.test), rep("training", nrow(proc.occupations) - n.test)))
splitted.data <- split(proc.occupations, group)

# train model and make predictions
model <- trainLogisticRegressionWithPenalization(splitted.data$train, preprocessing = list(stopwords = tm::stopwords("de"), stemming = "de", countWords = FALSE), tuning = list(alpha = 0.05, maxit = 50^5, nlambda = 100, thresh = 1e-5))
res <- predictLogisticRegressionWithPenalization(model, splitted.data$test)

# expand to contain more categories than the initial ones
res.proc1 <- expandPredictionResults(res, allowed.codes = c("12345", allowed.codes), method.name = "glmnet1")

# we can use different methods to create a combined dataset. This is how to run the subsequent analysis functions only once.
res.proc2 <- expandPredictionResults(res, allowed.codes = c("12345", allowed.codes), method.name = "glmnet2")
res.proc <- rbind(res.proc1, res.proc2); class(res.proc) <- c(class(res.proc), "occupationalPredictionsComplete")

sharpness(res.proc)

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.