generateDictionary: Generates dictionary of decisive terms

View source: R/generateDictionary.R

generateDictionaryR Documentation

Generates dictionary of decisive terms

Description

Routine applies method for dictionary generation (LASSO, ridge regularization, elastic net, ordinary least squares, generalized linear model or spike-and-slab regression) to the document-term matrix in order to extract decisive terms that have a statistically significant impact on the response variable.

Usage

generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'Corpus'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'character'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'data.frame'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'TermDocumentMatrix'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

## S3 method for class 'DocumentTermMatrix'
generateDictionary(
  x,
  response,
  language = "english",
  modelType = "lasso",
  filterTerms = NULL,
  control = list(),
  minWordLength = 3,
  sparsity = 0.9,
  weighting = function(x) tm::weightTfIdf(x, normalize = FALSE),
  ...
)

Arguments

x

A vector of characters, a data.frame, an object of type Corpus, TermDocumentMatrix or DocumentTermMatrix.

response

Response variable including the given gold standard.

language

Language used for preprocessing operations (default: English).

modelType

A string denoting the estimation method. Allowed values are lasso, ridge, enet, lm or glm or spikeslab.

filterTerms

Optional vector of strings (default: NULL) to filter terms that are used for dictionary generation.

control

(optional) A list of parameters defining the model used for dictionary generation.

If modelType=lasso is selected, individual parameters are as follows:

  • "s" Value of the parameter lambda at which the LASSO is evaluated. Default is s="lambda.1se" which takes the calculated minimum value for \lambda and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by lambda="lambda.min".

  • "family" Distribution for response variable. Default is family="gaussian". For non-negative counts, use family="poisson". For binary variables family="binomial". See glmnet for further details.

  • "grouped" Determines whether grouped LASSO is used (with default FALSE).

If modelType=ridge is selected, individual parameters are as follows:

  • "s" Value of the parameter lambda at which the ridge is evaluated. Default is s="lambda.1se" which takes the calculated minimum value for \lambda and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by lambda="lambda.min".

  • "family" Distribution for response variable. Default is family="gaussian". For non-negative counts, use family="poisson". For binary variables family="binomial". See glmnet for further details.

  • "grouped" Determines whether grouped function is used (with default FALSE).

If modelType=enet is selected, individual parameters are as follows:

  • "alpha" Abstraction parameter for switching between LASSO (with alpha=1) and ridge regression (alpha=0). Default is alpha=0.5. Recommended option is to test different values between 0 and 1.

  • "s" Value of the parameter lambda at which the elastic net is evaluated. Default is s="lambda.1se" which takes the calculated minimum value for \lambda and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by lambda="lambda.min".

  • "family" Distribution for response variable. Default is family="gaussian". For non-negative counts, use family="poisson". For binary variables family="binomial". See glmnet for further details.

  • "grouped" Determines whether grouped function is used (with default FALSE).

If modelType=lm is selected, no parameters are passed on.

If modelType=glm is selected, individual parameters are as follows:

  • "family" Distribution for response variable. Default is family="gaussian". For non-negative counts, use family="poisson". For binary variables family="binomial". See glm for further details.

If modelType=spikeslab is selected, individual parameters are as follows:

  • "n.iter1" Number of burn-in Gibbs sampled values (i.e., discarded values). Default is 500.

  • "n.iter2" Number of Gibbs sampled values, following burn-in. Default is 500.

minWordLength

Removes words given a specific minimum length (default: 3). This preprocessing is applied when the input is a character vector or a corpus and the document-term matrix is generated inside the routine.

sparsity

A numeric for removing sparse terms in the document-term matrix. The argument sparsity specifies the maximal allowed sparsity. Default is sparsity=0.9, however, this is only applied when the document-term matrix is calculated inside the routine.

weighting

Weights a document-term matrix by e.g. term frequency - inverse document frequency (default). Other variants can be used from DocumentTermMatrix.

...

Additional parameters passed to function for e.g. preprocessing or glmnet.

Value

Result is a matrix which sentiment values for each document across all defined rules

Source

\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1371/journal.pone.0209323")}

References

Pr\"ollochs and Feuerriegel (2018). Statistical inferences for Polarity Identification in Natural Language, PloS One 13(12).

See Also

analyzeSentiment, predict.SentimentDictionaryWeighted, plot.SentimentDictionaryWeighted and compareToResponse for advanced evaluations

Examples

# Create a vector of strings
documents <- c("This is a good thing!",
               "This is a very good thing!",
               "This is okay.",
               "This is a bad thing.",
               "This is a very bad thing.")
response <- c(1, 0.5, 0, -0.5, -1)

# Generate dictionary with LASSO regularization
dictionary <- generateDictionary(documents, response)

# Show dictionary
dictionary
summary(dictionary)
plot(dictionary)

# Compute in-sample performance
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)
plotSentimentResponse(sentiment, response)

# Generate new dictionary with spike-and-slab regression instead of LASSO regularization
library(spikeslab)
dictionary <- generateDictionary(documents, response, modelType="spikeslab")

# Generate new dictionary with tf weighting instead of tf-idf

library(tm)
dictionary <- generateDictionary(documents, response, weighting=weightTf)
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)

# Use instead lambda.min from the LASSO estimation
dictionary <- generateDictionary(documents, response, control=list(s="lambda.min"))
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)

# Use instead OLS as estimation method
dictionary <- generateDictionary(documents, response, modelType="lm")
sentiment <- predict(dictionary, documents)
sentiment

dictionary <- generateDictionary(documents, response, modelType="lm", 
                                 filterTerms = c("good", "bad"))
sentiment <- predict(dictionary, documents)
sentiment

dictionary <- generateDictionary(documents, response, modelType="lm", 
                                 filterTerms = extractWords(loadDictionaryGI()))
sentiment <- predict(dictionary, documents)
sentiment

# Generate dictionary without LASSO intercept
dictionary <- generateDictionary(documents, response, intercept=FALSE)
dictionary$intercept
 
## Not run: 
imdb <- loadImdb()

# Generate Dictionary
dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson")
summary(dictionary_imdb)

compareDictionaries(dictionary_imdb,
                    loadDictionaryGI())
                    
# Show estimated coefficients with Kernel Density Estimation (KDE)
plot(dictionary_imdb)
plot(dictionary_imdb) + xlim(c(-0.1, 0.1))

# Compute in-sample performance
pred_sentiment <- predict(dict_imdb, imdb$Corpus)
compareToResponse(pred_sentiment, imdb$Rating)

# Test a different sparsity parameter
dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson", sparsity=0.99)
summary(dictionary_imdb)
pred_sentiment <- predict(dict_imdb, imdb$Corpus)
compareToResponse(pred_sentiment, imdb$Rating)

## End(Not run)

SentimentAnalysis documentation built on Aug. 24, 2023, 1:07 a.m.