createSimilarityTableWordwiseStringdist: Wordwise Similarity Table with Coding index
In malsch/occupationCoding: Supervised Learning for Occupation Coding

View source: R/createSimilarityTableWordwiseStringdist.R

createSimilarityTableWordwiseStringdist

R Documentation

Wordwise Similarity Table with Coding index

Description

Calculate string similarity on a word basis between unique.string and (coding_index_w_codes, coding_index_without_codes).

Usage

createSimilarityTableWordwiseStringdist(
  unique.string,
  coding_index_w_codes,
  coding_index_without_codes,
  preprocessing,
  dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
  threshold = 1
)

Arguments

`unique.string`	a character vector (usually unique(answer))
`coding_index_w_codes`	a data.table with columns "title" and "Code".
`coding_index_without_codes`	a character vector of additional titles
`preprocessing`	a list with elements stopwords a character vector, use `tm::stopwords("de")` for German stopwords. stemming `NULL` for no stemming and `"de"` for stemming using the German porter stemmer. strPreprocessing `TRUE` if `stringPreprocessing` shall be used. removePunct `TRUE` if `removePunctuation` shall be used.
`dist.control`	a list that will be passed to `stringdistmatrix`. Currently only two elements are implemented: method Method for distance calculation. weight For method='osa' or 'dl'.
`threshold`	All entries with distance above this threshold will be removed from the result

Details

Special function for similarity-based reasoning: creates WORDWISE!!!!! distance data with osa-method c(d = 1, i = 1, s = 1, t = 1) –> allows to correct 1 letter in a single word and matches this word with the dictionary. This means: unique.string is split wordwise and that word is used for matchning which has lowest osa-distance (all in case of a tie) example: "KUESTER and HAUSMEISTER" has distance 0 to both dictString.title HAUSMEISTER and KUESTER. Because the word HAUSMEISTER has minimal distance, another dictString.title HAUMEISTER, which has dist = 1 is not included.

Value

a list with elements

dist_table_w_code: a data.table with colummns intString, dictString.title, dictString.Code, dist
dist_table_without_code: NULL or a data.table with colummns intString, dictString, dist
vect_vocab: see link{asDocumentTermMatrix}

Examples

## Prepare coding index
# write female titles beneath the male title
coding_index <- rbind(coding_index_excerpt[, list(title = bezMale, Code)],
                      coding_index_excerpt[, list(title = bezFemale, Code)])
# standardize titles from the coding index
coding_index <- coding_index[,title := stringPreprocessing(title)]
# drop duplicate lines, might be suboptimal because we keep each title and its associated code only a single time. This means we delete duplicates and the associated, possibly relevant codes.
coding_index <- coding_index[!duplicated(title)]

(x <- c("Abgeordneter", "Abgeordneter", "Abgeordnete", "abgeordnet", "abgeordnet zu xxx", "FSJ", "FSJ2", "Industriemechaniker", "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"))
createSimilarityTableWordwiseStringdist(unique.string = stringPreprocessing(x),
                                        coding_index_w_codes = coding_index,
                                         coding_index_without_codes = frequent_phrases,
                                         preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE),
                                         dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
                                         threshold = 1)

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.