Releases will be numbered with the following semantic versioning format:
And constructed with the following guidelines:
match_tokens added to find all the tokens that match a regex(es) within a
given text vector. This useful when combined with the
Fixed versions of
keep_element added to allow for dropping
elements specified by a known vector rather than a regex.
glue functions from the glue package are reexported
for easy string manipulation.
replace_names drops the replacement of
c('An', 'To', 'Oh', 'So', 'Do', 'He', 'Ha', 'In', 'Pa', 'Un') which are
likely words and not names.
replace_html picks ups some additional symbol replacments including:
c("™", "“", "”", "‘", "’", "•", "·",
"⋅", "–", "—", "≠", "½", "¼", "¾",
"°", "←", "→", "…").
replace_kern added to replace a form of informal emphasis in which the
writer takes words >2 letters long, capitalizes the entire word, and places
spaces in between each letter. This was contributed by Stack Overflow's
replace_internet_slang added to replace Internet acronyms and abbreviations
with machine friendly word equivalents.
replace_word_elongation added to replace word elongations (a.k.a. "word
lengthening") with the most likely normalized word form. See
http://www.aclweb.org/anthology/D11-105 for details.
fgsub added for the ability to match, extract, operate a function over the
extracted strings, & replace the original matches with the extracted strings.
This performs similar functionality to
gsubfn::gsubfn but is less powerful.
For more powerful needs see the gsubfn package.
replace_gradedid not use
fixed = TRUEfor its call to
mgsub. This could result in the plus signs being interpreted as meta-characters. This has been corrected.
replace_names added to remove/replace common first and last names from text
make_plural added to make a vector of singular noun forms plural.
replace_emoji_identifier added for replacing emojis with
text or an identifier token for use in the sentimentr package.
mgsub_fixed to provide wrappers for
mgsub that makes
their use apparent without setting the
replace_curly_quote added to replace curly quotes with straight versions.
replace_non_ascii now uses
stringi::stri_trans_general to coerce more
non-ASCII characters to ASCII format.
check_text now checks for HTML characters/tags. Thanks to @Peter Gensler
for suggesting this (see issue #15).
filter_functions deprecated in favor of
keep_versions of filter functions. This was change was to address the opposite meaning that dplyr's
filterhas, which retains rows matching a pattern be default.
replace_tokensadded to complement
mgsubfor times when the user wants to replace fixed tokens with a single value or remove them entirely. This yields an optimized solution that is much faster than
mgusbno longer uses
trim = TRUEby default.
check_textreported to use
add_missing_endmarkwhen endmark is missing.
replace_rating functions have
been moved from the sentimentr package to textclean as these are
cleaning functions. This makes the functions more modular and generalizable
to all types of text cleaning. These functions are still imported and
exported by sentimentr.
replace_html added to remove html tags and repalce symbols with appropriate
add_missing_endmarks added to detect missing endmarks and replace with the
replace_numbernow uses the english package making it faster and more maintainable. In addition, the function now handles decimal places as well.
NAas non-ASCII. This has been fixed.
check_text added to report on potential problems in a text vector.
replace_ordinal added to replace ordinal numbers (e.g., 1st) with word
representation (e.g., first).
swap added to swap two patterns simultaneously.
filter_element added to exclude matching elements from a vector.
This package is a collection of tools to clean and process text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.