# textmodel_nb: Naive Bayes classifier for texts In quanteda.textmodels: Scaling Models and Classifiers for Textual Data

## Description

Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.

## Usage

 1 2 3 4 5 6 7 textmodel_nb( x, y, smooth = 1, prior = c("uniform", "docfreq", "termfreq"), distribution = c("multinomial", "Bernoulli") ) 

## Arguments

 x the dfm on which the model will be fit. Does not need to contain only the training documents. y vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.) smooth smoothing parameter for feature counts, added to the feature frequency totals by training class prior prior distribution on texts; one of "uniform", "docfreq", or "termfreq". See Prior Distributions below. distribution count model for text features, can be multinomial or Bernoulli. To fit a "binary multinomial" model, first convert the dfm to a binary matrix using [quanteda::dfm_weight](x, scheme = "boolean").

## Value

textmodel_nb() returns a list consisting of the following (where I is the total number of documents, J is the total number of features, and k is the total number of training classes):

 call original function call param k \times V; class conditional posterior estimates x the N \times V training dfm x y the N-length y training class vector, where NAs will not be used will be retained in the saved x matrix distribution character; the distribution of x for the NB model priors numeric; the class prior probabilities smooth numeric; the value of the smoothing parameter

## Prior distributions

Prior distributions refer to the prior probabilities assigned to the training classes, and the choice of prior distribution affects the calculation of the fitted probabilities. The default is uniform priors, which sets the unconditional probability of observing the one class to be the same as observing any other class.

"Document frequency" means that the class priors will be taken from the relative proportions of the class documents used in the training set. This approach is so common that it is assumed in many examples, such as the worked example from Manning, Raghavan, and Schütze (2008) below. It is not the default in quanteda, however, since there may be nothing informative in the relative numbers of documents used to train a classifier other than the relative availability of the documents. When training classes are balanced in their number of documents (usually advisable), however, then the empirically computed "docfreq" would be equivalent to "uniform" priors.

Setting prior to "termfreq" makes the priors equal to the proportions of total feature counts found in the grouped documents in each training class, so that the classes with the largest number of features are assigned the largest priors. If the total count of features in each training class was the same, then "uniform" and "termfreq" would be the same.

## Smoothing parameter

The smooth value is added to the feature frequencies, aggregated by training class, to avoid zero frequencies in any class. This has the effect of giving more weight to infrequent term occurrences.

Kenneth Benoit

## References

Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press (Chapter 13). Available at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf.

Jurafsky, D. & Martin, J.H. (2018). From Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of September 23, 2018 (Chapter 6, Naive Bayes). Available at https://web.stanford.edu/~jurafsky/slp3/.

predict.textmodel_nb()

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ## Example from 13.1 of _An Introduction to Information Retrieval_ library("quanteda") txt <- c(d1 = "Chinese Beijing Chinese", d2 = "Chinese Chinese Shanghai", d3 = "Chinese Macao", d4 = "Tokyo Japan Chinese", d5 = "Chinese Chinese Chinese Tokyo Japan") x <- dfm(tokens(txt), tolower = FALSE) y <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE) ## replicate IIR p261 prediction for test set (document 5) (tmod1 <- textmodel_nb(x, y, prior = "docfreq")) summary(tmod1) coef(tmod1) predict(tmod1, type = "prob") predict(tmod1) # contrast with other priors predict(textmodel_nb(x, y, prior = "uniform")) predict(textmodel_nb(x, y, prior = "termfreq")) ## replicate IIR p264 Bernoulli Naive Bayes tmod2 <- textmodel_nb(x, y, distribution = "Bernoulli", prior = "docfreq") predict(tmod2, newdata = x[5, ], type = "prob") predict(tmod2, newdata = x[5, ]) 

### Example output

Package version: 2.1.2
Parallel computing: 1 of 1 threads used.
See https://quanteda.io for tutorials and examples.

Attaching package: ‘quanteda’

The following object is masked from ‘package:quanteda.textmodels’:

data_dfm_lbgexample

The following object is masked from ‘package:utils’:

View

Call:
textmodel_nb.dfm(x = x, y = y, prior = "docfreq")

Distribution: multinomial ; priors: 0.25 0.75 ; smoothing value: 1 ; 4 training documents;  fitted features.

Call:
textmodel_nb.dfm(x = x, y = y, prior = "docfreq")

Class Priors:
(showing first 2 elements)
N    Y
0.25 0.75

Estimated Feature Scores:
Chinese Beijing Shanghai  Macao   Tokyo   Japan
N  0.2222  0.1111   0.1111 0.1111 0.22222 0.22222
Y  0.4286  0.1429   0.1429 0.1429 0.07143 0.07143
N          Y
Chinese  0.2222222 0.42857143
Beijing  0.1111111 0.14285714
Shanghai 0.1111111 0.14285714
Macao    0.1111111 0.14285714
Tokyo    0.2222222 0.07142857
Japan    0.2222222 0.07142857
N         Y
d1 0.06516267 0.9348373
d2 0.06516267 0.9348373
d3 0.11850060 0.8814994
d4 0.62587672 0.3741233
d5 0.31024139 0.6897586
d1 d2 d3 d4 d5
Y  Y  Y  N  Y
Levels: N Y
d1 d2 d3 d4 d5
Y  Y  Y  N  N
Levels: N Y
d1 d2 d3 d4 d5
Y  Y  Y  N  Y
Levels: N Y
N         Y
d5 0.8089332 0.1910668
d5
N
Levels: N Y


quanteda.textmodels documentation built on April 6, 2021, 9:06 a.m.