language_model: Create Language Model

Description Usage Arguments Details Value References Examples

View source: R/language_model.R

Description

This function creates a regression model using input text as predictors, and a specified variable as the outcome.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
language_model(
  input,
  outcome,
  outcomeType,
  text,
  ngrams = "1",
  dfmWeightScheme = "count",
  lossMeasure = "deviance",
  lambda = "lambda.min",
  parallelCores = NULL,
  permutePValue = FALSE,
  permutationK = 1000,
  permuteByGroup = NULL,
  progressBar = TRUE
)

Arguments

input

A dataframe containing a column with text data (character strings) and an outcome variable (numeric or two-level factor)

outcome

A string consisting of the column name for the outcome variable in inputDataframe

outcomeType

A string consisting of the type of outcome variable being used - options are "binary" or "continuous"

text

A string consisting of the column name for the text data in inputDataframe

ngrams

A string defining the ngrams to serve as predictors in the model. Defaults to "1". For more information, see the okens_ngrams function in the quanteda package

dfmWeightScheme

A string defining the weight scheme you wish to use for constructing a document-frequency matrix. Default is "count". For more information, see the dfm_weight function in the quanteda package

lossMeasure

A string defining the loss measure to use. Must be one of the options given by cv.glmnet. Default is "deviance".

lambda

A string defining the lambda value to be used. Default is "lambda.min". For more information, see the cv.glmnet function in the glmnet package

parallelCores

An integer defining the number of cores to use in parallel processing for model creation. Defaults to NULL (no parallel processing).

permutePValue

If TRUE, a permutation test is run to estimate a p-value for the model (i.e. whether the language provided significantly predicts the outcome variable). Warning: this can take a while depending on the size of the dataset and number of permutations!

permutationK

The number of permutations to run in a permutation test. Only used if permutePValue = TRUE. Defaults to 1000.

permuteByGroup

A string consisting of the column name defining a grouping variable in the dataset (often a participant number). This means that when permutations are randomized, they will permute items on a group level rather than trial level. Default is NULL (no group variable considered).

progressBar

Show a progress bar. Defaults to TRUE.

Details

This is the core function of the languagePredictR package. It largely follows the analysis laid out in Dobbins & Kantner 2019 (see References).

In the broadest terms, this serves as a wrapper for the quanteda (text analysis) and glmnet (modeling) packages.
The input text is converted into a document-frequency matrix (sometimes called a document-feature matrix) where each row represents a string of text, and each column represents a word that appears in the entire text corpus.
Each cell is populated by a value defined by the dfmWeightScheme. For example, the default, "count", means that each word column contains a value representing the number of times that word appears in the given text string.
This matrix is then used to train a regression algorithm appropriate to the outcome variable (standard linear regression for continuous variables, logistic regression for binary variables).
See the documentation for the cv.glmnet function in the glmnet package for more information.
10-fold cross validation is currently implemented to reduce overfitting to the data.
Additionally, a LASSO constraint is used (following Tibshirani, 1996; see References) to eliminate weakly-predictive variables. This reduces the number of predictors (i.e. word engrams) to sparse, interpretable set.

Value

An object of the type "langModel"

References

Dobbins, I. G., & Kantner, J. (2019). The language of accurate recognition memory. *Cognition, 192*, 103988.
Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso. *Journal of the Royal Statistical Society: Series B (Methodological), 58*(1), 267-288.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## Not run: 
movie_review_data1$cleanText = clean_text(movie_review_data1$text)

# Using language to predict "Positive" vs. "Negative" reviews
movie_model_valence = language_model(movie_review_data1,
                                     outcome = "valence",
                                     outcomeType = "binary",
                                     text = "cleanText")

summary(movie_model_valence)

# Using language to predict 1-10 scale ratings,
# but using both unigrams and bigrams, as well as a proportion weighting scheme
movie_model_rating = language_model(movie_review_data1,
                                    outcomeV = "rating",
                                    outcomeType = "continuous",
                                    textC = "cleanText",
                                    ngrams = "1:2",
                                    dfmWeightScheme = "prop")

summary(movie_model_rating)

## End(Not run)

nlanderson9/languagePredictR documentation built on June 10, 2021, 11 a.m.