reduce_dtm: Reducing the number of columns (terms) of a document-term...

Description Usage Arguments Details Value Note Examples

Description

reduce_dtm reduces the number of columns (terms) of a document-term matrix.

Usage

1
2
3
reduce_dtm(dtm, method = c("tfidf", "lognet", "lognet_cv"), q = list(inf =
  0.25, sup = 0.75), classes = NULL, lambda = c("lambda.min", "lambda.1se"),
  SEED = NULL, c_normalize = TRUE, parallel = TRUE, export = FALSE)

Arguments

dtm

a document-term matrix in term frequency format.

method

the method for selecting the columns.

q

a list with inf and sup quantiles of tf-idf scores distribution. Default are the first and third quartiles. Only use for tfidf method.

classes

factor. The labeling variable. Only use for lognet methods.

lambda

a string with the selection rule of the optimal fit. Only use for lognet methods.

SEED

integer, the random seed for selecting train and test set. Only use for lognet methods.

c_normalize

a Boolean value indicating whether the dtm entries should be (cosine) normalized when using the lognet methods. Default is TRUE.

parallel

logical. If TRUE parallel cross-validation is performed. Default is TRUE. Only use for lognet_cv method.

export

logical. If TRUE exports the discarded terms, the vocabulary and the returned object to the built-in directory data/dtm. Default is FALSE.

Details

This function is a wrapper for applying three different methods for dimensionality reduction of a document-term matrix:

tfidf

It calls the reduce_dtm_tfidf function to select suitable columns of an unlabeled document-term matrix by deleting terms whith tf-idf score out of an user defined range.

lognet

It calls the reduce_dtm_lognet function to apply lognet, a logistic classification method from package glmnet, to a labeled document-term matrix.

lognet_cv

It calls the reduce_dtm_lognet_cv function to apply the former lognet method via (parallel) cross-validation.

Value

tfidf

A list as in reduce_dtm_tfidf.

lognet

A list as in reduce_dtm_lognet.

lognet_cv

A list as in reduce_dtm_lognet_cv.

Note

From Wikipedia: tfidf, short for term frequency inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tfidf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

In the optimal fit of the lognet method the tuning parameters alpha and lambda are respectively set to 1 (default) and one out of lambda.min or lambda.1se. The latter follows from the "minimum training error rule" and the former from the more conservative approach of the "one standard error rule". Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61. Dimensionality reduction is performed by selecting only columns (terms) corresponding to non zero beta coefficients in the optimal fit.

discardedTerms.txt and vocabulary.txt respectively contain the rejected terms and the vocabulary (i.e. columns) of the reduced dtm.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## Not run: 

### tfidf method
library(Supreme)
data("dtm")
dtm.tfidf <- reduce_dtm(dtm, method = "tfidf")

### lognet method
library(Supreme)
data("dtm")
data("classes")
dtm.lognet <- reduce_dtm(dtm, method = "lognet", classes = classes, SEED = 123)

### lognet_cv method
library(Supreme)
data("dtm")
data("classes")
dtm.lognet.cv <- reduce_dtm(dtm, method = "lognet_cv", classes = classes, lambda = "lambda.1se", SEED = 123)


## End(Not run)

paolofantini/Supreme documentation built on May 24, 2019, 6:14 p.m.