View source: R/createSimilarityTableStringdist.R
createSimilarityTableStringdist | R Documentation |
Calculate string similarity between unique.string
and (coding_index_w_codes, coding_index_without_codes)
.
createSimilarityTableStringdist(
unique.string,
coding_index_w_codes,
coding_index_without_codes,
dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = 3
)
unique.string |
a character vector (usually unique(answer)) |
coding_index_w_codes |
a data.table with columns "title" and "Code". |
coding_index_without_codes |
a character vector of additional titles |
dist.control |
a list that will be passed to
|
threshold |
All entries with distance above this threshold will be removed from the result |
Special function for similarity-based reasoning: creates distance data with osa-method c(d = 1, i = 1, s = 1, t = 1) dist == 0: strings in dict and data are identical
a list with elements
a data.table with colummns intString
, dictString.title
, dictString.Code
, dist
NULL
or a data.table with colummns intString
, dictString
, dist
see link{asDocumentTermMatrix}
trainSimilarityBasedReasoning
, createSimilarityTableWordwiseStringdist
, createSimilarityTableSubstring
## Prepare coding index
# write female titles beneath the male title
coding_index <- rbind(coding_index_excerpt[, list(title = bezMale, Code)],
coding_index_excerpt[, list(title = bezFemale, Code)])
# standardize titles from the coding index
coding_index <- coding_index[,title := stringPreprocessing(title)]
# drop duplicate lines, might be suboptimal because we keep each title and its associated code only a single time. This means we delete duplicates and the associated, possibly relevant codes.
coding_index <- coding_index[!duplicated(title)]
(x <- c("Abgeordneter", "Abgeordneter", "Abgeordnete", "abgeordnet", "FSJ", "FSJ2", "Industriemechaniker", "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"))
createSimilarityTableStringdist(unique.string = stringPreprocessing(x),
coding_index_w_codes = coding_index,
coding_index_without_codes = frequent_phrases,
dist.control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.