Home

/

CRAN

/

stanza

/

README.md

README.md
In stanza: 'Stanza' - A 'R' NLP Package for Many Human Languages

title: "

stanza: An R Interface to the Stanford NLP Toolkit" date: "2025-05-16" output: github_document

Overview

The stanza package provides an R interface to the Stanford NLP Group's Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:

Tokenize text into sentences, words, and multi-word tokens
Perform part-of-speech tagging
Extract lemmas (base forms) of words
Identify named entities (people, locations, organizations, etc.)
Parse syntactic dependencies
And more!

First, install the stanza R package from CRAN:

install.packages("stanza")

You can install the Python package using either virtualenv (recommended):

library("stanza")
virtualenv_install_stanza()

Or using conda if you prefer:

library("stanza")
conda_install_stanza()

Environment variables

Make sure that pip is installed along with the Python version you choose. To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON. For example testing on Windows I set RETICULATE_PYTHON to "C:/apps/Python/python.exe"

python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")

virtualenv_install_stanza()

during the installation. However, after the installation

library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")

is sufficent since then "~\\.virtualenvs\\stanza" is detected, but if RETICULATE_PYTHON is still "C:/apps/Python/python.exe" it does not find the correct environment and therefore stanza can not be loaded.

Getting Started

library("stanza")
stanza_initialize(virtualenv = "stanza")

Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.

To download the English model:

stanza_download("en")

Similarly, for German:

stanza_download("de")

A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:

processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)

The Stanza documentation provides detailed information on all available processors:

tokenize: Split text into sentences and words
mwt: Expand multi-word tokens
pos: Part-of-speech tagging
lemma: Lemmatization
ner: Named entity recognition
depparse: Dependency parsing
And more

To select specific models for each processor, use a named list:

processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)

The stanza_pipeline() function returns a pipeline function that transforms text into annotated document objects:

doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

Stanza provides several helper functions to extract different types of information from the processed documents:

sents(doc)
#> [[1]]
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

words(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

tokens(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

entities(doc)
#> list()

multi_word_token(doc)
#>   tid wid         token          word
#> 1   1   1             R             R
#> 2   2   2            is            is
#> 3   3   3             a             a
#> 4   4   4 collaborative collaborative
#> 5   5   5       project       project
#> 6   6   6          with          with
#> 7   7   7          many          many
#> 8   8   8  contributors  contributors
#> 9   9   9             .             .

Any scripts or data that you put into this service are public.

stanza documentation built on June 8, 2025, 1:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stanza
'Stanza' - A 'R' NLP Package for Many Human Languages

README.md
In stanza: 'Stanza' - A 'R' NLP Package for Many Human Languages

Overview

Installation

Step 1: Install the R package

Step 2: Install the Python backend

Environment variables

Getting Started

Load the package and initialize

Download language models

Building a Pipeline

Using specific models for processors

Processing Text

Extracting Results

Sentences

Words with linguistic features

Tokens

Named entities

Multi-word tokens

Try the stanza package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

stanza 'Stanza' - A 'R' NLP Package for Many Human Languages

README.md In stanza: 'Stanza' - A 'R' NLP Package for Many Human Languages

Overview

Installation

Step 1: Install the R package

Step 2: Install the Python backend

Environment variables

Getting Started

Load the package and initialize

Download language models

Building a Pipeline

Using specific models for processors

Processing Text

Extracting Results

Sentences

Words with linguistic features

Tokens

Named entities

Multi-word tokens

Try the stanza package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

stanza
'Stanza' - A 'R' NLP Package for Many Human Languages

README.md
In stanza: 'Stanza' - A 'R' NLP Package for Many Human Languages