correct_categories: Match strings with a pre-defined set of strings

Description Usage Arguments Value Examples

View source: R/correct_categories.R

Description

Correct strings to pre-defined strings. This is a wrapper for stringdist approimate string matching, where certain parameters are preset, and can be used easily in a tidyverse pipe. Using cosine matching to disregard word order.

Usage

1
2
3
correct_categories(to_be_corrected = NULL, correct_terms = NULL,
max_dist = 2, method =  c("cosine", "osa", "lv", "dl", "hamming",
"lcs", "qgram", "jaccard", "jw", "soundex"), ...)

Arguments

to_be_corrected

vector containing the strings to be corrected

correct_terms

string vector containing the correct terms

max_dist

parameter passed down to stringdist::amatch() with a default

method

parameter passed down to stringdist::amatch() with a default

...

further parameters to be passed down to stringdist::amatch()

Value

A corrected string vector that can only contain the correct terms

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(dplyr)
reasons <- c("sample characteristics",
             "publication type",
             "manipulation",
             "other")

# Create category names with typos
reasons_with_typo <- c("simple characteristisc",
                       "publication t",
                       "manuplation",
                       "o",
                       "publicaton type")

# Create a dataset with random correct and incorrect categories in the "reason" column
df_with_typos <-
                 workaholism_pubmed %>%
                 mutate(decision = sample(c(0,1), size = nrow(.), replace = TRUE),
                        reason = if_else(decision == 0,
                                         NA_character_,
                         # Mix correct and incorrect categories
                                         sample(c(reasons, reasons_with_typo),
                                                size = nrow(.),
                                                replace = TRUE)
                                 )
                 )

# The typos are corrected in a new column
mutate(df_with_typos, corrected_reason = correct_categories(reason, reasons))

nthun/metamanager documentation built on Aug. 9, 2019, 1:37 p.m.