mallet.import: Import text documents into Mallet format

Description Usage Arguments See Also Examples

Description

This function takes an array of document IDs and text files (as character strings) and converts them into a Mallet instance list.

Usage

1
mallet.import(id.array, text.array, stoplist.file, preserve.case, token.regexp)

Arguments

id.array

An array of document IDs.

text.array

An array of text strings to use as documents. The type of the array must be character.

stoplist.file

The name of a file containing stopwords (words to ignore), one per line. If the file is not in the current working directory, you may need to include a full path.

preserve.case

By default, the input text is converted to all lowercase.

token.regexp

A quoted string representing a regular expression that defines a token. The default is one or more unicode letter: "[\\p{L}]+". Note that special characters must have double backslashes.

See Also

mallet.word.freqs returns term and document frequencies, which may be useful in selecting stopwords.

Examples

1
2
3
4
5
## Not run: 
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt",
		    		token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

## End(Not run)


Search within the mallet package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.