language_model: Create Language Model
In nlanderson9/languagePredictR: Predict Outcomes from Natural Language

Description Usage Arguments Details Value References Examples

This function creates a regression model using input text as predictors, and a specified variable as the outcome.

language_model(
  input,
  outcome,
  outcomeType,
  text,
  ngrams = "1",
  dfmWeightScheme = "count",
  lossMeasure = "deviance",
  lambda = "lambda.min",
  parallelCores = NULL,
  permutePValue = FALSE,
  permutationK = 1000,
  permuteByGroup = NULL,
  progressBar = TRUE
)

`input`	A dataframe containing a column with text data (character strings) and an outcome variable (numeric or two-level factor)
`outcome`	A string consisting of the column name for the outcome variable in `inputDataframe`
`outcomeType`	A string consisting of the type of outcome variable being used - options are "binary" or "continuous"
`text`	A string consisting of the column name for the text data in `inputDataframe`
`ngrams`	A string defining the ngrams to serve as predictors in the model. Defaults to "1". For more information, see the `okens_ngrams` function in the `quanteda` package
`dfmWeightScheme`	A string defining the weight scheme you wish to use for constructing a document-frequency matrix. Default is "count". For more information, see the `dfm_weight` function in the `quanteda` package
`lossMeasure`	A string defining the loss measure to use. Must be one of the options given by `cv.glmnet`. Default is "deviance".
`lambda`	A string defining the lambda value to be used. Default is "lambda.min". For more information, see the `cv.glmnet` function in the `glmnet` package
`parallelCores`	An integer defining the number of cores to use in parallel processing for model creation. Defaults to NULL (no parallel processing).
`permutePValue`	If TRUE, a permutation test is run to estimate a p-value for the model (i.e. whether the language provided significantly predicts the outcome variable). Warning: this can take a while depending on the size of the dataset and number of permutations!
`permutationK`	The number of permutations to run in a permutation test. Only used if `permutePValue = TRUE`. Defaults to 1000.
`permuteByGroup`	A string consisting of the column name defining a grouping variable in the dataset (often a participant number). This means that when permutations are randomized, they will permute items on a group level rather than trial level. Default is NULL (no group variable considered).
`progressBar`	Show a progress bar. Defaults to TRUE.

This is the core function of the languagePredictR package. It largely follows the analysis laid out in Dobbins & Kantner 2019 (see References).

In the broadest terms, this serves as a wrapper for the quanteda (text analysis) and glmnet (modeling) packages.
The input text is converted into a document-frequency matrix (sometimes called a document-feature matrix) where each row represents a string of text, and each column represents a word that appears in the entire text corpus.
Each cell is populated by a value defined by the dfmWeightScheme. For example, the default, "count", means that each word column contains a value representing the number of times that word appears in the given text string.
This matrix is then used to train a regression algorithm appropriate to the outcome variable (standard linear regression for continuous variables, logistic regression for binary variables).
See the documentation for the cv.glmnet function in the glmnet package for more information.
10-fold cross validation is currently implemented to reduce overfitting to the data.
Additionally, a LASSO constraint is used (following Tibshirani, 1996; see References) to eliminate weakly-predictive variables. This reduces the number of predictors (i.e. word engrams) to sparse, interpretable set.

An object of the type "langModel"

Dobbins, I. G., & Kantner, J. (2019). The language of accurate recognition memory. *Cognition, 192*, 103988.
Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso. *Journal of the Royal Statistical Society: Series B (Methodological), 58*(1), 267-288.

## Not run: 
movie_review_data1$cleanText = clean_text(movie_review_data1$text)

# Using language to predict "Positive" vs. "Negative" reviews
movie_model_valence = language_model(movie_review_data1,
                                     outcome = "valence",
                                     outcomeType = "binary",
                                     text = "cleanText")

summary(movie_model_valence)

# Using language to predict 1-10 scale ratings,
# but using both unigrams and bigrams, as well as a proportion weighting scheme
movie_model_rating = language_model(movie_review_data1,
                                    outcomeV = "rating",
                                    outcomeType = "continuous",
                                    textC = "cleanText",
                                    ngrams = "1:2",
                                    dfmWeightScheme = "prop")

summary(movie_model_rating)

## End(Not run)