lma_dict: English Function Word Category and Special Character Lists
In miserman/lingmatch: Linguistic Matching and Accommodation

lma_dict

R Documentation

English Function Word Category and Special Character Lists

Description

Returns a list of function words based on the Linguistic Inquiry and Word Count 2015 dictionary (in terms of category names – words were selected independently), or a list of special characters and patterns.

Usage

lma_dict(..., as.regex = TRUE, as.function = FALSE)

Arguments

`...`	Numbers or letters corresponding to category names: ppron, ipron, article, adverb, conj, prep, auxverb, negate, quant, interrog, number, interjection, or special.
`as.regex`	Logical: if `FALSE`, lists are returned without regular expression.
`as.function`	Logical or a function: if specified and `as.regex` is `TRUE`, the selected dictionary will be collapsed to a regex string (terms separated by `\|`), and a function for matching characters to that string will be returned. The regex string is passed to the matching function (`grepl` by default) as a 'pattern' argument, with the first argument of the returned function being passed as an 'x' argument. See examples.

Value

A list with a vector of terms for each category, or (when as.function = TRUE) a function which accepts an initial "terms" argument (a character vector), and any additional arguments determined by function entered as as.function (grepl by default).

Note

The special category is not returned unless specifically requested. It is a list of regular expression strings attempting to capture special things like ellipses and emojis, or sets of special characters (those outside of the Basic Latin range; [^\u0020-\u007F]), which can be used for character conversions. If special is part of the returned list, as.regex is set to TRUE.

The special list is always used by both lma_dtm and lma_termcat. When creating a dtm, special is used to clean the original input (so that, by default, the punctuation involved in ellipses and emojis are treated as different – as ellipses and emojis rather than as periods and parens and colons and such). When categorizing a dtm, the input dictionary is passed by the special lists to be sure the terms in the dtm match up with the dictionary (so, for example, ": (" would be replaced with "repfrown" in both the text and dictionary).

Examples

# return the full dictionary (excluding special)
lma_dict()

# return the standard 7 category lsm categories
lma_dict(1:7)

# return just a few categories without regular expression
lma_dict(neg, ppron, aux, as.regex = FALSE)

# return special specifically
lma_dict(special)

# returning a function
is.ppron <- lma_dict(ppron, as.function = TRUE)
is.ppron(c("i", "am", "you", "were"))

in.lsmcat <- lma_dict(1:7, as.function = TRUE)
in.lsmcat(c("a", "frog", "for", "me"))

## use as a stopword filter
is.stopword <- lma_dict(as.function = TRUE)
dtm <- lma_dtm("Most of these words might not be all that relevant.")
dtm[, !is.stopword(colnames(dtm))]

## use to replace special characters
clean <- lma_dict(special, as.function = gsub)
clean(c(
  "\u201Ccurly quotes\u201D", "na\u00EFve", "typographer\u2019s apostrophe",
  "en\u2013dash", "em\u2014dash"
))

miserman/lingmatch documentation built on Feb. 21, 2025, 3 p.m.