createSimilarityTableWordwiseStringdist: Wordwise Similarity Table with Coding index

View source: R/createSimilarityTableWordwiseStringdist.R

createSimilarityTableWordwiseStringdistR Documentation

Wordwise Similarity Table with Coding index

Description

Calculate string similarity on a word basis between unique.string and (coding_index_w_codes, coding_index_without_codes).

Usage

createSimilarityTableWordwiseStringdist(
  unique.string,
  coding_index_w_codes,
  coding_index_without_codes,
  preprocessing,
  dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
  threshold = 1
)

Arguments

unique.string

a character vector (usually unique(answer))

coding_index_w_codes

a data.table with columns "title" and "Code".

coding_index_without_codes

a character vector of additional titles

preprocessing

a list with elements

stopwords

a character vector, use tm::stopwords("de") for German stopwords.

stemming

NULL for no stemming and "de" for stemming using the German porter stemmer.

strPreprocessing

TRUE if stringPreprocessing shall be used.

removePunct

TRUE if removePunctuation shall be used.

dist.control

a list that will be passed to stringdistmatrix. Currently only two elements are implemented:

method

Method for distance calculation.

weight

For method='osa' or 'dl'.

threshold

All entries with distance above this threshold will be removed from the result

Details

Special function for similarity-based reasoning: creates WORDWISE!!!!! distance data with osa-method c(d = 1, i = 1, s = 1, t = 1) –> allows to correct 1 letter in a single word and matches this word with the dictionary. This means: unique.string is split wordwise and that word is used for matchning which has lowest osa-distance (all in case of a tie) example: "KUESTER and HAUSMEISTER" has distance 0 to both dictString.title HAUSMEISTER and KUESTER. Because the word HAUSMEISTER has minimal distance, another dictString.title HAUMEISTER, which has dist = 1 is not included.

Value

a list with elements

dist_table_w_code

a data.table with colummns intString, dictString.title, dictString.Code, dist

dist_table_without_code

NULL or a data.table with colummns intString, dictString, dist

vect_vocab

see link{asDocumentTermMatrix}

See Also

trainSimilarityBasedReasoning, createSimilarityTableStringdist, createSimilarityTableSubstring

Examples

## Prepare coding index
# write female titles beneath the male title
coding_index <- rbind(coding_index_excerpt[, list(title = bezMale, Code)],
                      coding_index_excerpt[, list(title = bezFemale, Code)])
# standardize titles from the coding index
coding_index <- coding_index[,title := stringPreprocessing(title)]
# drop duplicate lines, might be suboptimal because we keep each title and its associated code only a single time. This means we delete duplicates and the associated, possibly relevant codes.
coding_index <- coding_index[!duplicated(title)]

(x <- c("Abgeordneter", "Abgeordneter", "Abgeordnete", "abgeordnet", "abgeordnet zu xxx", "FSJ", "FSJ2", "Industriemechaniker", "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"))
createSimilarityTableWordwiseStringdist(unique.string = stringPreprocessing(x),
                                        coding_index_w_codes = coding_index,
                                         coding_index_without_codes = frequent_phrases,
                                         preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL, strPreprocessing = TRUE, removePunct = FALSE),
                                         dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
                                         threshold = 1)

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.