sento_model: Optimized and automated sentiment-based sparse regression
In sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

sento_model

R Documentation

Optimized and automated sentiment-based sparse regression

Description

Linear or nonlinear penalized regression of any dependent variable on the wide number of sentiment measures and potentially other explanatory variables. Either performs a regression given the provided variables at once, or computes regressions sequentially for a given sample size over a longer time horizon, with associated prediction performance metrics.

Usage

sento_model(sento_measures, y, x = NULL, ctr)

Arguments

`sento_measures`	a `sento_measures` object created using `sento_measures`.
`y`	a one-column `data.frame` or a `numeric` vector capturing the dependent (response) variable. In case of a logistic regression, the response variable is either a `factor` or a `matrix` with the factors represented by the columns as binary indicators, with the second factor level or column as the reference class in case of a binomial regression. No `NA` values are allowed.
`x`	a named `data.table`, `data.frame` or `matrix` with other explanatory variables as `numeric`, by default set to `NULL`.
`ctr`	output from a `ctr_model` call.

Details

Models are computed using the elastic net regularization as implemented in the glmnet package, to account for the multidimensionality of the sentiment measures. Independent variables are normalized in the regression process, but coefficients are returned in their original space. For a helpful introduction to glmnet, we refer to their vignette. The optimal elastic net parameters lambda and alpha are calibrated either through a to specify information criterion or through cross-validation (based on the "rolling forecasting origin" principle, using the train function). In the latter case, the training metric is automatically set to "RMSE" for a linear model and to "Accuracy" for a logistic model. We suppress many of the details that can be supplied to the glmnet and train functions we rely on, for the sake of user-friendliness.

Value

If ctr$do.iter = FALSE, a sento_model object which is a list containing:

`reg`	optimized regression, i.e., a model-specific glmnet object, including for example the estimated coefficients.
`model`	the input argument `ctr$model`, to indicate the type of model estimated.
`alpha`	calibrated alpha.
`lambda`	calibrated lambda.
`trained`	output from `train` call (if `ctr$type =` "`cv`"). There is no such output if the control parameters `alphas` and `lambdas` both specify one value.
`ic`	a `list` composed of two elements: under `"criterion"`, the type of information criterion used in the calibration, and under `"matrix"`, a `matrix` of all information criterion values for `alphas` as rows and the respective lambda values as columns (if `ctr$type !=` "`cv`"). Any `NA` value in the latter element means the specific information criterion could not be computed.
`dates`	sample reference dates as a two-element `character` vector, being the earliest and most recent date from the `sento_measures` object accounted for in the estimation window.
`nVar`	a vector of size two, with respectively the number of sentiment measures, and the number of other explanatory variables inputted.
`discarded`	a named `logical` vector of length equal to the number of sentiment measures, in which `TRUE` indicates that the particular sentiment measure has not been considered in the regression process. A sentiment measure is not considered when it is a duplicate of another, or when at least 50% of the observations are equal to zero.

If ctr$do.iter = TRUE, a sento_modelIter object which is a list containing:

`models`	all sparse regressions, i.e., separate `sento_model` objects as above, as a `list` with as names the dates from the perspective of the sentiment measures at which the out-of-sample predictions are carried out.
`alphas`	calibrated alphas.
`lambdas`	calibrated lambdas.
`performance`	a `data.frame` with performance-related measures, being "`RMSFE`" (root mean squared forecasting error), "`MAD`" (mean absolute deviation), "`MDA`" (mean directional accuracy, in which's calculation zero is considered as a positive; in p.p.), "`accuracy`" (proportion of correctly predicted classes in case of a logistic regression; in p.p.), and each's respective individual values in the sample. Directional accuracy is measured by comparing the change in the realized response with the change in the prediction between two consecutive time points (omitting the very first prediction as `NA`). Only the relevant performance statistics are given depending on the type of regression. Dates are as in the `"models"` output element, i.e., from the perspective of the sentiment measures.

Author(s)

Samuel Borms, Keven Bluteau

Examples

## Not run: 
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
data("epu", package = "sentometrics")

set.seed(505)

# construct a sento_measures object to start with
corpusAll <- sento_corpus(corpusdf = usnews)
corpus <- quanteda::corpus_subset(corpusAll, date >= "2004-01-01")
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional",
               howTime = c("equal_weight", "linear"),
               by = "month", lag = 3)
sento_measures <- sento_measures(corpus, l, ctr)

# prepare y and other x variables
y <- epu[epu$date %in% get_dates(sento_measures), "index"]
length(y) == nobs(sento_measures) # TRUE
x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables
colnames(x) <- c("x1", "x2")

# a linear model based on the Akaike information criterion
ctrIC <- ctr_model(model = "gaussian", type = "AIC", do.iter = FALSE, h = 4,
                   do.difference = TRUE)
out1 <- sento_model(sento_measures, y, x = x, ctr = ctrIC)

# attribution and prediction as post-analysis
attributions1 <- attributions(out1, sento_measures,
                              refDates = get_dates(sento_measures)[20:25])
plot(attributions1, "features")

nx <- nmeasures(sento_measures) + ncol(x)
newx <- runif(nx) * cbind(data.table::as.data.table(sento_measures)[, -1], x)[30:40, ]
preds <- predict(out1, newx = as.matrix(newx), type = "link")

# an iterative out-of-sample analysis, parallelized
ctrIter <- ctr_model(model = "gaussian", type = "BIC", do.iter = TRUE, h = 3,
                     oos = 2, alphas = c(0.25, 0.75), nSample = 75, nCore = 2)
out2 <- sento_model(sento_measures, y, x = x, ctr = ctrIter)
summary(out2)

# plot predicted vs. realized values
p <- plot(out2)
p

# a cross-validation based model, parallelized
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
ctrCV <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE,
                   h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70,
                   testWindow = 10, oos = 0, do.progress = TRUE)
out3 <- sento_model(sento_measures, y, x = x, ctr = ctrCV)
parallel::stopCluster(cl)
foreach::registerDoSEQ()
summary(out3)

# a cross-validation based model for a binomial target
yb <- epu[epu$date %in% get_dates(sento_measures), "above"]
ctrCVb <- ctr_model(model = "binomial", type = "cv", do.iter = FALSE,
                    h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70,
                    testWindow = 10, oos = 0, do.progress = TRUE)
out4 <- sento_model(sento_measures, yb, x = x, ctr = ctrCVb)
summary(out4)
## End(Not run)

sentometrics documentation built on April 3, 2025, 6:15 p.m.