Import text documents into Mallet format

Share:

Description

This function takes an array of document IDs and text files (as character strings) and converts them into a Mallet instance list.

Usage

1
mallet.import(id.array, text.array, stoplist.file, preserve.case, token.regexp)

Arguments

id.array

An array of document IDs.

text.array

An array of text strings to use as documents. The type of the array must be character.

stoplist.file

The name of a file containing stopwords (words to ignore), one per line. If the file is not in the current working directory, you may need to include a full path.

preserve.case

By default, the input text is converted to all lowercase.

token.regexp

A quoted string representing a regular expression that defines a token. The default is one or more unicode letter: "[\\p{L}]+". Note that special characters must have double backslashes.

See Also

mallet.word.freqs returns term and document frequencies, which may be useful in selecting stopwords.

Examples

1
2
3
4
5
## Not run: 
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt",
		    		token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

## End(Not run)