mallet.import: Import text documents into Mallet format

View source: R/mallet.R

mallet.importR Documentation

Import text documents into Mallet format

Description

This function takes an array of document IDs and text files (as character strings) and converts them into a Mallet instance list.

Usage

mallet.import(
  id.array = NULL,
  text.array,
  stoplist = "",
  preserve.case = FALSE,
  token.regexp = "[\\p{L}]+"
)

Arguments

id.array

An array of document IDs. Default is text.array index.

text.array

A character vector with each element containing a document.

stoplist

The name of a file containing stopwords (words to ignore), one per line, or a character vector containing stop words. If the file is not in the current working directory, you may need to include a full path. Default is no stoplist.

preserve.case

By default, the input text is converted to all lowercase.

token.regexp

A quoted string representing a regular expression that defines a token. The default is one or more unicode letter: "[\\p{L}]+". Note that special characters must have double backslashes.

Value

a cc/mallet/types/InstanceList object.

See Also

mallet.word.freqs returns term and document frequencies, which may be useful in selecting stopwords.

Examples

## Not run: 
# Read in sotu example data
data(sotu)
sotu.instances <-
   mallet.import(id.array = row.names(sotu),
                 text.array = sotu[["text"]],
                 stoplist = mallet_stoplist_file_path("en"),
                 token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")


## End(Not run)


mallet documentation built on July 20, 2022, 5:08 p.m.