title: "
stanza: An R Interface to the Stanford NLP Toolkit" date: "2025-05-16" output: github_documentThe stanza package provides an R interface to the Stanford NLP Group's Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:
First, install the stanza R package from CRAN:
install.packages("stanza")
You can install the Python package using either virtualenv (recommended):
library("stanza")
virtualenv_install_stanza()
Or using conda if you prefer:
library("stanza")
conda_install_stanza()
Make sure that pip is installed along with the Python version you choose.
To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON
.
For example testing on Windows I set RETICULATE_PYTHON
to "C:/apps/Python/python.exe"
python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")
virtualenv_install_stanza()
during the installation. However, after the installation
library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")
is sufficent since then "~\\.virtualenvs\\stanza"
is detected, but if RETICULATE_PYTHON
is
still "C:/apps/Python/python.exe"
it does not find the correct environment and therefore
stanza can not be loaded.
library("stanza")
stanza_initialize(virtualenv = "stanza")
Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.
To download the English model:
stanza_download("en")
Similarly, for German:
stanza_download("de")
A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:
processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)
The Stanza documentation provides detailed information on all available processors:
tokenize
: Split text into sentences and wordsmwt
: Expand multi-word tokenspos
: Part-of-speech tagginglemma
: Lemmatizationner
: Named entity recognitiondepparse
: Dependency parsingTo select specific models for each processor, use a named list:
processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)
The stanza_pipeline()
function returns a pipeline function that transforms text into annotated document objects:
doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#> number of sentences: 1
#> number of tokens: 9
#> number of words: 9
# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#> number of sentences: 1
#> number of tokens: 9
#> number of words: 9
Stanza provides several helper functions to extract different types of information from the processed documents:
sents(doc)
#> [[1]]
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
words(doc)
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
tokens(doc)
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
entities(doc)
#> list()
multi_word_token(doc)
#> tid wid token word
#> 1 1 1 R R
#> 2 2 2 is is
#> 3 3 3 a a
#> 4 4 4 collaborative collaborative
#> 5 5 5 project project
#> 6 6 6 with with
#> 7 7 7 many many
#> 8 8 8 contributors contributors
#> 9 9 9 . .
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.