createSimilarityTableStringdist: Similarity Table with Coding index

View source: R/createSimilarityTableStringdist.R

createSimilarityTableStringdistR Documentation

Similarity Table with Coding index

Description

Calculate string similarity between unique.string and (coding_index_w_codes, coding_index_without_codes).

Usage

createSimilarityTableStringdist(
  unique.string,
  coding_index_w_codes,
  coding_index_without_codes,
  dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
  threshold = 3
)

Arguments

unique.string

a character vector (usually unique(answer))

coding_index_w_codes

a data.table with columns "title" and "Code".

coding_index_without_codes

a character vector of additional titles

dist.control

a list that will be passed to stringdistmatrix. Currently only two elements are implemented:

method

Method for distance calculation.

weight

For method='osa' or 'dl'.

threshold

All entries with distance above this threshold will be removed from the result

Details

Special function for similarity-based reasoning: creates distance data with osa-method c(d = 1, i = 1, s = 1, t = 1) dist == 0: strings in dict and data are identical

Value

a list with elements

dist_table_w_code

a data.table with colummns intString, dictString.title, dictString.Code, dist

dist_table_without_code

NULL or a data.table with colummns intString, dictString, dist

vect_vocab

see link{asDocumentTermMatrix}

See Also

trainSimilarityBasedReasoning, createSimilarityTableWordwiseStringdist, createSimilarityTableSubstring

Examples

## Prepare coding index
# write female titles beneath the male title
coding_index <- rbind(coding_index_excerpt[, list(title = bezMale, Code)],
                      coding_index_excerpt[, list(title = bezFemale, Code)])
# standardize titles from the coding index
coding_index <- coding_index[,title := stringPreprocessing(title)]
# drop duplicate lines, might be suboptimal because we keep each title and its associated code only a single time. This means we delete duplicates and the associated, possibly relevant codes.
coding_index <- coding_index[!duplicated(title)]

(x <- c("Abgeordneter", "Abgeordneter", "Abgeordnete", "abgeordnet", "FSJ", "FSJ2", "Industriemechaniker", "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"))
createSimilarityTableStringdist(unique.string = stringPreprocessing(x),
                                coding_index_w_codes = coding_index,
                                coding_index_without_codes = frequent_phrases,
                                dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
                                threshold = 3)

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.