trainNB | R Documentation |
Trains multiclass Naive Bayes classifier
trainNB(coding, train_matrix, smoothing = c("normalized", "simple",
"parameterized", "none"), alpha = 2, beta = 10)
coding |
Numeric vector of training document codings |
train_matrix |
A |
smoothing |
Type of Laplacian smoothing for term priors. See 'Details'. |
alpha |
Smoothing hyperparameter for 'parameterized' smoothing |
beta |
Smoothing hyperparameter for 'parameterized' smoothing |
Smoothing method defaults to 'normalized' using the system advocated by Frank and Bouckaert (2006) for per-class word vector normalization.
Using 'simple' will employ a simple version of Laplacian smoothing described in Metsis et al. (2006). Prior probability of term appearance, given a class, is just frequency of term in class plus 1 over count of documents in class plus 2.
Using 'parameterized' will use a version of smoothing mentioned in O'Neil & Schutt (2013) for multiclass Naive bayes. Prior prob. of term appearance, given a class, is frequency of term in class plus alpha minus 1 over count of documents in class plus alpha plus beta minus 2.
Using 'none' is inadvisable. In this case, prior prob. of term appearance, given a class, is frequency of term in class over count of documents in class. This will likely generate zero priors, which is a problem.
A list with the elements
w_0c |
Constant portion of NB classification probabilities. |
w_jc |
Portion of NB classification probabilities that varies with test document word appearances. |
nc |
Frequency of each category in training documents (named numeric vector) |
theta_c |
Unsmoothed prior class probabilities (named numeric vector) |
Matt W. Loftis
Frank, E. and Bouckaert, R.R. (2006) Naive Bayes for Text Classification with Unbalanced Classes. s, Knowledge Discovery in Databases: PKDD, 503-510.
Metsis, V. Androutsopoulos, I. and Paliouras, G. (2006) Spam Filtering with Naive Bayes – Which Naive Bayes? CEAS 2006 - Third Conference on Email and Anti-Spam, July 27-28, 2006, Mountain View, California USA.
O'Neil, C. and Schutt, R. (2013) Doing Data Science: Straight Talk from the Frontline. O'Reilly.
## Load data and create document-feature matrices
train_corpus <- quanteda::corpus(x = training_agendas$text)
train_matrix <- quanteda::dfm(train_corpus,
language = "danish",
stem = TRUE,
removeNumbers = FALSE)
test_corpus <- quanteda::corpus(x = test_agendas$text)
test_matrix <- quanteda::dfm(test_corpus,
language = "danish",
stem = TRUE,
removeNumbers = FALSE)
## Convert matrix of frequencies to matrix of indicators
train_matrix@x[train_matrix@x > 1] <- 1
test_matrix@x[test_matrix@x > 1] <- 1
## Dropping training features not in the test set
train_matrix <- train_matrix[, (colnames(train_matrix) %in% colnames(test_matrix))]
est <- trainNB(training_agendas$coding, train_matrix)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.