reduce_dtm: Reducing the number of columns (terms) of a document-term...
In paolofantini/Supreme: Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions

Description Usage Arguments Details Value Note Examples

reduce_dtm reduces the number of columns (terms) of a document-term matrix.

1
2
3

reduce_dtm(dtm, method = c("tfidf", "lognet", "lognet_cv"), q = list(inf =
  0.25, sup = 0.75), classes = NULL, lambda = c("lambda.min", "lambda.1se"),
  SEED = NULL, c_normalize = TRUE, parallel = TRUE, export = FALSE)

`dtm`	a document-term matrix in term frequency format.
`method`	the method for selecting the columns.
`q`	a list with `inf` and `sup` quantiles of tf-idf scores distribution. Default are the first and third quartiles. Only use for `tfidf` method.
`classes`	factor. The labeling variable. Only use for `lognet` methods.
`lambda`	a string with the selection rule of the optimal fit. Only use for `lognet` methods.
`SEED`	integer, the random seed for selecting train and test set. Only use for `lognet` methods.
`c_normalize`	a Boolean value indicating whether the `dtm` entries should be (cosine) normalized when using the `lognet` methods. Default is `TRUE`.
`parallel`	logical. If `TRUE` parallel cross-validation is performed. Default is `TRUE`. Only use for `lognet_cv` method.
`export`	logical. If `TRUE` exports the discarded terms, the vocabulary and the returned object to the built-in directory `data/dtm`. Default is `FALSE`.

This function is a wrapper for applying three different methods for dimensionality reduction of a document-term matrix:

tfidf: It calls the reduce_dtm_tfidf function to select suitable columns of an unlabeled document-term matrix by deleting terms whith tf-idf score out of an user defined range.
lognet: It calls the reduce_dtm_lognet function to apply lognet, a logistic classification method from package glmnet, to a labeled document-term matrix.
lognet_cv: It calls the reduce_dtm_lognet_cv function to apply the former lognet method via (parallel) cross-validation.

tfidf: A list as in reduce_dtm_tfidf.
lognet: A list as in reduce_dtm_lognet.
lognet_cv: A list as in reduce_dtm_lognet_cv.

From Wikipedia: tfidf, short for term frequency inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tfidf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

In the optimal fit of the lognet method the tuning parameters alpha and lambda are respectively set to 1 (default) and one out of lambda.min or lambda.1se. The latter follows from the "minimum training error rule" and the former from the more conservative approach of the "one standard error rule". Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61. Dimensionality reduction is performed by selecting only columns (terms) corresponding to non zero beta coefficients in the optimal fit.

discardedTerms.txt and vocabulary.txt respectively contain the rejected terms and the vocabulary (i.e. columns) of the reduced dtm.

## Not run: 

### tfidf method
library(Supreme)
data("dtm")
dtm.tfidf <- reduce_dtm(dtm, method = "tfidf")

### lognet method
library(Supreme)
data("dtm")
data("classes")
dtm.lognet <- reduce_dtm(dtm, method = "lognet", classes = classes, SEED = 123)

### lognet_cv method
library(Supreme)
data("dtm")
data("classes")
dtm.lognet.cv <- reduce_dtm(dtm, method = "lognet_cv", classes = classes, lambda = "lambda.1se", SEED = 123)


## End(Not run)

paolofantini/Supreme documentation built on May 24, 2019, 6:14 p.m.

paolofantini/Supreme index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

paolofantini/Supreme
Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions

reduce_dtm: Reducing the number of columns (terms) of a document-term...
In paolofantini/Supreme: Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions

Description

Usage

Arguments

Details

Value

Note

Examples

Related to reduce_dtm in paolofantini/Supreme...

R Package Documentation

Browse R Packages

We want your feedback!

paolofantini/Supreme Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions

reduce_dtm: Reducing the number of columns (terms) of a document-term... In paolofantini/Supreme: Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions

Description

Usage

Arguments

Details

Value

Note

Examples

Related to reduce_dtm in paolofantini/Supreme...

R Package Documentation

Browse R Packages

We want your feedback!

paolofantini/Supreme
Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions

reduce_dtm: Reducing the number of columns (terms) of a document-term...
In paolofantini/Supreme: Make it easier applying LDA topic models to a corpus of Italian Supreme Court decisions