corpus2dtm: From ISC corpus to a Document-Term Matrix

Description Usage Arguments Value Note Examples

Description

corpus2dtm transforms a corpus of decisions from Italian Supreme Court to a document term matrix.

Usage

1
corpus2dtm(corpus, stopwords)

Arguments

corpus

a corpus of decisions from Italian Supreme Court.

stopwords

a character vector of stopwords.

Value

dtm a base document-term matrix with minimum term length 3 and terms appearing at least in 5 documents.

Note

Basic text cleansing steps build a base-dtm by selecting only terms (columns) corresponding to a suitable vocabulary. Typically, this involves converting tokens to lower-case, removing punctuation characters, removing numbers, stemming, removing stop-words and selecting terms with a length above a certain minimum and occurring at least in a minimum number of documents. Package tm version >= 0.6 required.

Examples

1
2
3
4
5
6
7
## Not run: 
library(Supreme)
data("corpus")
data("italianStopWords")  # for removing italian stop words
dtm <- corpus2dtm(corpus, italianStopWords)

## End(Not run)

paolofantini/Supreme documentation built on May 24, 2019, 6:14 p.m.