trainSimilarityBasedReasoning: Train Similarity Based Probability Model

View source: R/trainSimilarityBasedReasoning.R

trainSimilarityBasedReasoningR Documentation

Train Similarity Based Probability Model

Description

For each entry in the coding index, look up how answers that are similar to the coding index were coded in training data and calculate probabilities.

Usage

trainSimilarityBasedReasoning(
  data,
  coding_index_w_codes,
  coding_index_without_codes,
  preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL,
    strPreprocessing = TRUE, removePunct = FALSE),
  dist.type = c("wordwise", "substring", "fulltext"),
  dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
  threshold = c(max = 3, use = 1),
  simulation.control = list(n.draws = 250, check.normality = FALSE),
  tmp_folder = "similarityTables"
)

Arguments

data

a data.table created with removeFaultyAndUncodableAnswers_And_PrepareForAnalysis

coding_index_w_codes

a data.table with columns

bezMale

a character vector, contains masculine job titles from the coding index.

bezFemale

a character vector, contains feminine job titles from the coding index.

Code

a character vector with associated classification codes.

coding_index_without_codes

a preprocessed character vector, meant for frequent_phrases

preprocessing

a list with elements

stopwords

a character vector, use tm::stopwords("de") for German stopwords. Only used if dist.type = "wordwise".

stemming

NULL for no stemming and "de" for stemming using the German porter stemmer. Do not use unless the job titles in coding_index_w_codes were stemmed.

strPreprocessing

TRUE if stringPreprocessing shall be used.

removePunct

TRUE if removePunctuation shall be used.

dist.type

How to calculate similarity between entries from both coding_indices and verbal answers from the survey? Three options are currently supported. Since we use the stringdist-function excessively, one could easily extend the functionality of this procedure to other distance metrics.

dist.type = "fulltext"

Uses the stringdist-function directly after preprocessing to calculate distances. (the simplest approach but least useful.)

dist.type = "substring"

An entry from the coding index and a verbal answer are similar if the entry from the coding index is a substring of the verbal answer.

dist.type = "wordwise"

After preprocessing, split the verbal answer into words. Then calculate for each word separately the the similarity with entries from the coding index, using stringdist. Not the complete verbal answer but only the words (0 or more) that have highest similarity are then used to determine similarity with entries from the coding index.

dist.control

If dist.type = "fulltext" or dist.type = "wordwise" the entries from this list will be passed to stringdist. Currently only two possible entries are supported (method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1) is recommended), but one could easily extend the functionality.

threshold

A numeric vector with two elements. If dist.type = "fulltext" or dist.type = "wordwise", the threshold determines up to which distance a verbal answer and an entry from the coding index are similar. The second number actually gets used. The first number is only used to speed up similarity calculations. It should be identical or larger than the second number.

simulation.control

a list with two components,

n.draws

Number of draws from the posterior distribution to determine posterior predictive probabilities. The larger, the more precise the results will be.

check.normality

We would like that the hyperprior distribution is normal. Set check.normality to TRUE to do some diagnostics about this.

tmp_folder

The name of a folder where the algorithm will store the results from similarity calculations. Use any folder you like. Links between verbal answers and coding indices are only stored if the distance is <= threshold[1]

Value

a list with components

prediction.datasets$modelProb

Contains all entries from the coding index. dist = "official" if the entry stems from coding_index_w_codes and dist = selfcreated if the entry stems from coding_index_without_codes. string.prob is used for weighting purposes (model averaging) if a new verbal answer is similar to multiple strings. unobserved.mean.theta gives a probability (usually very low) for any category that was not observed in the training data together with this string.

prediction.datasets$categoryProb

mean.theta is the probability for code given that an incoming verbal answer is similar to string. Only available if this code was at least a single time observed with this string (Use unobserved.mean.theta otherwise).

num.allowed.codes

Number of categories in the classification.

preprocessing

The input parameter stored to replicate preprocessing with incoming data.

dist.type

The input parameter stored to replicate distance calculations with incoming data.

dist.control

The input parameter stored to replicate distance calculations with incoming data.

threshold

The input parameter stored to replicate distance calculations with incoming data.

simulation.control

The input parameter.

See Also

See predictSimilarityBasedReasoning for more examples and recommended settings. See trainSimilarityBasedReasoning2 for the same functionality, but using aggregated (anonymized!) training data. German training data are available.

createSimilarityTableWordwiseStringdist, createSimilarityTableSubstring, createSimilarityTableStringdist for implementations of the different dist.type. frequent_phrases is a character vector with frequent German answers.

Schierholz, Malte (2019): New methods for job and occupation classification. Dissertation, Mannheim. https://madoc.bib.uni-mannheim.de/50617/, pp. 206-208 and p. 268, pp. 308-320

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

# train model
simBasedModel <- trainSimilarityBasedReasoning(data = proc.occupations,
                              coding_index_w_codes = coding_index_excerpt,
                              coding_index_without_codes = frequent_phrases,
                              preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE),
                              dist.type = "wordwise",
                              dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
                              threshold = c(max = 3, use = 1), simulation.control = list(n.draws = 50, check.normality = FALSE),
                              tmp_folder = "similarityTables")

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.