desc <- suppressWarnings(readLines("DESCRIPTION")) regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)" loc <- grep(regex, desc) ver <- gsub(regex, "\\2", desc[loc]) verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver) verbadge <- '' pacman::p_load(textstem) pacman::p_load_current_gh('trinker/numform') nr <- numform::f_comma(length(presidential_debates_2012$dialogue)) nw <- numform::f_comma(sum(stringi::stri_count_words(presidential_debates_2012$dialogue), na.rm = TRUE)) ```` [![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active) [![Build Status](https://travis-ci.org/trinker/textstem.svg?branch=master)](https://travis-ci.org/trinker/textstem) [![Coverage Status](https://coveralls.io/repos/trinker/textstem/badge.svg?branch=master)](https://coveralls.io/r/trinker/textstem?branch=master) [![](https://cranlogs.r-pkg.org/badges/textstem)](https://cran.r-project.org/package=textstem) `r verbadge` ![](tools/textstem_logo/r_textstem.png) **textstem** is a tool-set for stemming and lemmatizing words. Stemming is a process that removes affixes. Lemmatization is the process of grouping inflected forms together as a single base form. # Functions The main functions, task category, & descriptions are summarized in the table below: | Function | Task | Description | |-------------------------------|-------------|--------------------------------------------| | `stem_words` | stemming | Stem words | | `stem_strings` | stemming | Stem strings | | `lemmatize_words` | lemmatizing | Lemmatize words | | `lemmatize_strings` | lemmatizing | Lemmatize strings | | `make_lemma_dictionary_words` | lemmatizing | Generate a dictionary of lemmas for a text | # Installation To download the development version of **textstem**: Download the [zip ball](https://github.com/trinker/textstem/zipball/master) or [tar ball](https://github.com/trinker/textstem/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load_gh("trinker/textstem")
You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/textstem/issues
- send a pull request on: https://github.com/trinker/textstem/
- compose a friendly e-mail to: tyler.rinker@gmail.com
The following examples demonstrate some of the functionality of textstem.
if (!require("pacman")) install.packages("pacman") pacman::p_load(textstem, dplyr) data(presidential_debates_2012)
Before moving into the meat these two examples let's highlight the difference between stemming and lemmatizing.
dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving') stem_words(dw) lemmatize_words(dw)
bw <- c('are', 'am', 'being', 'been', 'be') stem_words(bw) lemmatize_words(bw)
Stemming is the act of removing inflections from a word not necessarily "identical to the morphological root of the word" (wikipedia). Below I show stemming of several small strings.
y <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) stem_strings(y)
Lemmatizing is the "grouping together the inflected forms of a word so they can be analysed as a single item" (wikipedia). In the example below I reduce the strings to their lemma form. lemmatize_strings
uses a lookup dictionary. The default uses Mechura's (2016) English lemmatization list available from the lexicon package. The make_lemma_dictionary
function contains two additional engines for generating a lemma lookup table for use in lemmatize_strings
.
y <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) lemmatize_strings(y)
This lemmatization uses the hunspell package to generate lemmas.
lemma_dictionary_hs <- make_lemma_dictionary(y, engine = 'hunspell') lemmatize_strings(y, dictionary = lemma_dictionary_hs)
This lemmatization uses the koRpus package and the TreeTagger program to generate lemmas. You'll have to get TreeTagger set up, preferably in your machine's root directory.
lemma_dictionary_tt <- make_lemma_dictionary(y, engine = 'treetagger') lemmatize_strings(y, lemma_dictionary_tt)
It's pretty fast too. Observe:
tic <- Sys.time() presidential_debates_2012$dialogue %>% lemmatize_strings() %>% head() (toc <- Sys.time() - tic)
That's r nr
rows of text, or r nw
words, in r round(as.numeric(toc), 2)
seconds.
This example shows how stemming/lemmatizing might be complemented by other text tools such as replace_contraction
from the textclean package.
library(textclean) 'aren\'t' %>% lemmatize_strings() 'aren\'t' %>% textclean::replace_contraction() %>% lemmatize_strings()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.