View source: R/trainSimilarityBasedReasoning2.R
trainSimilarityBasedReasoning2 | R Documentation |
The output of this function is the same as from trainSimilarityBasedReasoning
, but as input trainSimilarityBasedReasoning2
only requires aggregated (and thus anonymized) training data. We provide such training data for coding of German occupations into the German classification of Occupations (KldB 2010) as part of this package (see surveyCountsSubstringSimilarity
and surveyCountsWordwiseSimilarity
). Parameter settings for this function should be the same as those used to anonymize the training data. The examples below detail recommended application.
trainSimilarityBasedReasoning2(
anonymized_data,
num.allowed.codes = 1291,
coding_index_w_codes,
coding_index_without_codes = NULL,
preprocessing = list(stopwords = NULL, stemming = NULL, strPreprocessing = TRUE,
removePunct = FALSE),
dist.type = c("wordwise", "substring", "fulltext"),
dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = c(max = 3, use = 1),
simulation.control = list(n.draws = 250, check.normality = FALSE),
tmp_folder = NULL
)
anonymized_data |
|
num.allowed.codes |
the number of allowed codes in the target classification. There are 1286 categories in the KldB 2010 plus 5 special codes in both anonymized training data sets, so the default value is 1291. |
coding_index_w_codes |
a data.table with columns
|
coding_index_without_codes |
(not used, but automatically determined) Any words from |
preprocessing |
a list with elements
|
dist.type |
How to calculate similarity between entries from both coding_indices and verbal answers from the survey? Three options are currently supported. Since we use the
|
dist.control |
If |
threshold |
A numeric vector with two elements. If |
simulation.control |
a list with two components,
|
tmp_folder |
(not used) |
a list with components
Contains all entries from the coding index. dist = "official" if the entry stems from coding_index_w_codes and dist = selfcreated if the entry stems from coding_index_without_codes. string.prob
is used for weighting purposes (model averaging) if a new verbal answer is similar to multiple strings. unobserved.mean.theta
gives a probability (usually very low) for any category that was not observed in the training data together with this string.
mean.theta
is the probability for code
given that an incoming verbal answer is similar to string
. Only available if this code was at least a single time observed with this string (Use unobserved.mean.theta
otherwise).
Number of categories in the classification.
The input parameter stored to replicate preprocessing with incoming data.
The input parameter stored to replicate distance calculations with incoming data.
The input parameter stored to replicate distance calculations with incoming data.
The input parameter stored to replicate distance calculations with incoming data.
The input parameters controlling the Monte Carlo simulation.
See trainSimilarityBasedReasoning
, which allows to run the same procedures using non-aggregated training data.
Schierholz, Malte (2019): New methods for job and occupation classification. Dissertation, Mannheim. https://madoc.bib.uni-mannheim.de/50617/, pp. 206-208 and p. 268, pp. 308-320
# set up test data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
"Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
"Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)
# set up dictionary (see help file for how to obtain the dictionary)
path_to_file <- "./Gesamtberufsliste_der_BA.xlsx" # change path
try({coding_index_w_codes <- prepare_German_coding_index_Gesamtberufsliste_der_BA(path_to_file, count.categories = FALSE)}, silent = TRUE)
# or, if the file does not exist at the given path, just use coding_index_excerpt
if (!exists("coding_index_w_codes")) coding_index_w_codes <- coding_index_excerpt
data(surveyCountsSubstringSimilarity)
simBasedModelSubstring <- trainSimilarityBasedReasoning2(anonymized_data = surveyCountsSubstringSimilarity,
num.allowed.codes = 1291,
coding_index_w_codes = coding_index_w_codes,
preprocessing = list(stopwords = NULL, stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE),
dist.type = "substring",
dist.control = NA,
threshold = NA,
simulation.control = list(n.draws = 250, check.normality = FALSE)
)
res <- predictSimilarityBasedReasoning(simBasedModelSubstring, proc.occupations)
# Look at most probable answer from each id
res[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id]
res[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id][, mean(acc)] # calculate aggrement rate
# Look at a single person and order predictions by their probability. According to the algorithm the code 81112 has the highest probability, but the code 71402 (which was selected by a coder) has second-highest probability
res[id == 11][order(pred.prob, decreasing = TRUE)]
data(surveyCountsWordwiseSimilarity)
simBasedModelWordwise <- trainSimilarityBasedReasoning2(anonymized_data = surveyCountsWordwiseSimilarity,
num.allowed.codes = 1291,
coding_index_w_codes = coding_index_w_codes,
preprocessing = list(stopwords = NULL, stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE),
dist.type = "wordwise",
dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = c(max = NA, use = 1),
simulation.control = list(n.draws = 250, check.normality = FALSE)
)
res <- predictSimilarityBasedReasoning(simBasedModelWordwise, proc.occupations)
# Look at most probable answer from each id
res[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id]
res[, .SD[which.max(pred.prob), list(ans, true.code = code, pred.code, acc = code == pred.code)], by = id][, mean(acc)] # calculate aggrement rate
# Look at a single person and order predictions by their probability. Other than previously, this algorithm predicts 71402, the correct code.
res[id == 11][order(pred.prob, decreasing = TRUE)]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.