Description Usage Arguments Value Prior distributions Author(s) References See Also Examples
Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.
1 2  textmodel_nb(x, y, smooth = 1, prior = c("uniform", "docfreq",
"termfreq"), distribution = c("multinomial", "Bernoulli"))

x 
the dfm on which the model will be fit. Does not need to contain only the training documents. 
y 
vector of training labels associated with each document identified
in 
smooth 
smoothing parameter for feature counts by class 
prior 
prior distribution on texts; one of 
distribution 
count model for text features, can be 
textmodel_nb()
returns a list consisting of the following (where
I is the total number of documents, J is the total number of
features, and k is the total number of training classes):
call 
original function call 
PwGc 
k \times J; probability of the word given the class (empirical likelihood) 
Pc 
klength named numeric vector of class prior probabilities 
PcGw 
k \times J; posterior class probability given the word 
Pw 
J \times 1; baseline probability of the word 
x 
the I \times J training dfm 
y 
the Ilength 
distribution 
the distribution argument 
prior 
the prior argument 
smooth 
the value of the smoothing parameter 
Prior distributions refer to the prior probabilities assigned to the training classes, and the choice of prior distribution affects the calculation of the fitted probabilities. The default is uniform priors, which sets the unconditional probability of observing the one class to be the same as observing any other class.
"Document frequency" means that the class priors will be taken from the relative proportions of the class documents used in the training set. This approach is so common that it is assumed in many examples, such as the worked example from Manning, Raghavan, and Schütze (2008) below. It is not the default in quanteda, however, since there may be nothing informative in the relative numbers of documents used to train a classifier other than the relative availability of the documents. When training classes are balanced in their number of documents (usually advisable), however, then the empirically computed "docfreq" would be equivalent to "uniform" priors.
Setting prior
to "termfreq" makes the priors equal to the proportions
of total feature counts found in the grouped documents in each training
class, so that the classes with the largest number of features are assigned
the largest priors. If the total count of features in each training class was
the same, then "uniform" and "termfreq" would be the same.
Kenneth Benoit
Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press (Chapter 13). Available at https://nlp.stanford.edu/IRbook/pdf/irbookonlinereading.pdf.
Jurafsky, D. & Martin, J.H. (2018). From Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of September 23, 2018 (Chapter 6, Naive Bayes). Available at https://web.stanford.edu/~jurafsky/slp3/.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  ## Example from 13.1 of _An Introduction to Information Retrieval_
txt < c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "Chinese Chinese Chinese Tokyo Japan")
trainingset < dfm(txt, tolower = FALSE)
trainingclass < factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
## replicate IIR p261 prediction for test set (document 5)
(tmod1 < textmodel_nb(trainingset, y = trainingclass, prior = "docfreq"))
summary(tmod1)
coef(tmod1)
predict(tmod1)
# contrast with other priors
predict(textmodel_nb(trainingset, y = trainingclass, prior = "uniform"))
predict(textmodel_nb(trainingset, y = trainingclass, prior = "termfreq"))
## replicate IIR p264 Bernoulli Naive Bayes
tmod2 < textmodel_nb(trainingset, y = trainingclass, distribution = "Bernoulli",
prior = "docfreq")
predict(tmod2, newdata = trainingset[5, ])

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.