trainNB: Train Naive Bayes

View source: R/trainNB.R

trainNBR Documentation

Train Naive Bayes

Description

Trains multiclass Naive Bayes classifier

Usage

trainNB(coding, train_matrix, smoothing = c("normalized", "simple",
  "parameterized", "none"), alpha = 2, beta = 10)

Arguments

coding

Numeric vector of training document codings

train_matrix

A quanteda document-feature matrix with the number of rows equal to the length of coding

smoothing

Type of Laplacian smoothing for term priors. See 'Details'.

alpha

Smoothing hyperparameter for 'parameterized' smoothing

beta

Smoothing hyperparameter for 'parameterized' smoothing

Details

Smoothing method defaults to 'normalized' using the system advocated by Frank and Bouckaert (2006) for per-class word vector normalization.

Using 'simple' will employ a simple version of Laplacian smoothing described in Metsis et al. (2006). Prior probability of term appearance, given a class, is just frequency of term in class plus 1 over count of documents in class plus 2.

Using 'parameterized' will use a version of smoothing mentioned in O'Neil & Schutt (2013) for multiclass Naive bayes. Prior prob. of term appearance, given a class, is frequency of term in class plus alpha minus 1 over count of documents in class plus alpha plus beta minus 2.

Using 'none' is inadvisable. In this case, prior prob. of term appearance, given a class, is frequency of term in class over count of documents in class. This will likely generate zero priors, which is a problem.

Value

A list with the elements

w_0c

Constant portion of NB classification probabilities.

w_jc

Portion of NB classification probabilities that varies with test document word appearances.

nc

Frequency of each category in training documents (named numeric vector)

theta_c

Unsmoothed prior class probabilities (named numeric vector)

Author(s)

Matt W. Loftis

References

Frank, E. and Bouckaert, R.R. (2006) Naive Bayes for Text Classification with Unbalanced Classes. s, Knowledge Discovery in Databases: PKDD, 503-510.

Metsis, V. Androutsopoulos, I. and Paliouras, G. (2006) Spam Filtering with Naive Bayes – Which Naive Bayes? CEAS 2006 - Third Conference on Email and Anti-Spam, July 27-28, 2006, Mountain View, California USA.

O'Neil, C. and Schutt, R. (2013) Doing Data Science: Straight Talk from the Frontline. O'Reilly.

Examples

## Load data and create document-feature matrices
train_corpus <- quanteda::corpus(x = training_agendas$text)
train_matrix <- quanteda::dfm(train_corpus,
                    language = "danish",
                    stem = TRUE,
                    removeNumbers = FALSE)

test_corpus <- quanteda::corpus(x = test_agendas$text)
test_matrix <- quanteda::dfm(test_corpus,
                   language = "danish",
                   stem = TRUE,
                   removeNumbers = FALSE)

## Convert matrix of frequencies to matrix of indicators
train_matrix@x[train_matrix@x > 1] <- 1
test_matrix@x[test_matrix@x > 1] <- 1

## Dropping training features not in the test set
train_matrix <- train_matrix[, (colnames(train_matrix) %in% colnames(test_matrix))]

est <- trainNB(training_agendas$coding, train_matrix)


mattwloftis/agendacodeR documentation built on June 5, 2023, 7 p.m.