View source: R/trainSimilarityBasedReasoning.R
trainSimilarityBasedReasoning | R Documentation |
For each entry in the coding index, look up how answers that are similar to the coding index were coded in training data and calculate probabilities.
trainSimilarityBasedReasoning(
data,
coding_index_w_codes,
coding_index_without_codes,
preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL,
strPreprocessing = TRUE, removePunct = FALSE),
dist.type = c("wordwise", "substring", "fulltext"),
dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = c(max = 3, use = 1),
simulation.control = list(n.draws = 250, check.normality = FALSE),
tmp_folder = "similarityTables"
)
data |
a data.table created with |
coding_index_w_codes |
a data.table with columns
|
coding_index_without_codes |
a preprocessed character vector, meant for |
preprocessing |
a list with elements
|
dist.type |
How to calculate similarity between entries from both coding_indices and verbal answers from the survey? Three options are currently supported. Since we use the
|
dist.control |
If |
threshold |
A numeric vector with two elements. If |
simulation.control |
a list with two components,
|
tmp_folder |
The name of a folder where the algorithm will store the results from similarity calculations. Use any folder you like. Links between verbal answers and coding indices are only stored if the distance is |
a list with components
Contains all entries from the coding index. dist = "official" if the entry stems from coding_index_w_codes and dist = selfcreated if the entry stems from coding_index_without_codes. string.prob
is used for weighting purposes (model averaging) if a new verbal answer is similar to multiple strings. unobserved.mean.theta
gives a probability (usually very low) for any category that was not observed in the training data together with this string.
mean.theta
is the probability for code
given that an incoming verbal answer is similar to string
. Only available if this code was at least a single time observed with this string (Use unobserved.mean.theta
otherwise).
Number of categories in the classification.
The input parameter stored to replicate preprocessing with incoming data.
The input parameter stored to replicate distance calculations with incoming data.
The input parameter stored to replicate distance calculations with incoming data.
The input parameter stored to replicate distance calculations with incoming data.
The input parameter.
See predictSimilarityBasedReasoning
for more examples and recommended settings. See trainSimilarityBasedReasoning2
for the same functionality, but using aggregated (anonymized!) training data. German training data are available.
createSimilarityTableWordwiseStringdist
, createSimilarityTableSubstring
, createSimilarityTableStringdist
for implementations of the different dist.type
. frequent_phrases
is a character vector with frequent German answers.
Schierholz, Malte (2019): New methods for job and occupation classification. Dissertation, Mannheim. https://madoc.bib.uni-mannheim.de/50617/, pp. 206-208 and p. 268, pp. 308-320
# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
"Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
"Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)
# train model
simBasedModel <- trainSimilarityBasedReasoning(data = proc.occupations,
coding_index_w_codes = coding_index_excerpt,
coding_index_without_codes = frequent_phrases,
preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE),
dist.type = "wordwise",
dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = c(max = 3, use = 1), simulation.control = list(n.draws = 50, check.normality = FALSE),
tmp_folder = "similarityTables")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.