# In sidiwang/NaiveBayes: Naive Bayes Classification and Prediction

## Introduction

The NaiveBayes package provides an efficient implementation of the popular Naive Bayes classifier. This package is efficient, user friendly and written in base.R and Rcpp. Like many other classifier packages, the general function NaiveBayes detects the class of each feature in the dataset. Predict function uses a NaiveBayes model and a new data set to create the classifications. This can either be the raw probabilities generated by the NaiveBayes model or the classes themselves.

### What is Naive Bayes?

Naive Bayes is one of the most popular and simple Machine Learning classification algorithms, the Naive Bayes Algorithm. It works on Bayes theorem of probability to predict the class of unknown data sets with an assumption of independece among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature (i.e. assumes your X are all independent.)

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

### What is Bayes Theorem?

Bayes Theorem provides a way of calculating posterior probability

$P(C_k|x)=P(x|C_k)P(C_k)/P(x)$

where:

$P(C|x)$ is the posterior probability of class (C, target) given predictor (x, attributes).

$P(C)$ is the prior probability of class

$P(x|C)$ is the likelihood which is the probability of predictor given class.

$P(x)$ is the prior probability of predictor

### How Naive Bayes algorithem works?

Using the chain rule for repeated applications of the definition of conditional probability:

$P(C_k,x_1,...,x_n) = P(x_1, ..., x_n,C_k) = P(x_1|x_2,...,x_n,C_k)P(x_2|x_3,...,x_n,C_k)...P(x_n|C_k)P(C_k)$

given that all feature in x are mutually independent, conditional on the category $C_k$. Under this assumption, we have:

$P(C_k|x_1,...,x_n)=1/Z \times P(x_1|C_k)\times P(x_2|C_k)\times ...\times P(x_n|C_k)\times P(C_k)$

where the evidence $Z = \sum_k P(C_k)P(x|C_k)$ is a scaling factor dependent only on $x_1,...x_n$, that is, a constant if the values of the feature variables are known. (Wikipedia: Naive Bayes Classifier)

### Pros and Cons of Naive Bayes

Pros:

• Easy and fast in predicting class of test data. Also performs well in multi class prediction

• When independence holds, Naive Bayes performs better compare to other models like logistic regression and less training data is needed

• Performs well in case of categorical input variables compare to numerical variables. For numerical variable, normal distribution is assumed.

Cons:

• If categorical variable has a category (in test data), which was not observed in training data. then model will not be able to make a prediction. To solve this, we can use the smoothing technique.

• Assumption of independent predictors. in real life, it is almost impossible that we get a set of predictors which are completely independent.

## Installation

library(devtools)
install_github("sidiwang/NaiveBayes", build_vignettes = T)


After successful installation, the package can be used with:

library(NaiveBayes)


## Main Functions

The general function NaiveBayes() detects the class of each feature in the dataset and assumes normal distribution for continous variables. The prediction function predict.NaiveBayes() can be called like many other classification packages: predict(model_name, newdata, ...)

## Numerical Underflow:

To avoid a numerical unverflow:

i.e. when n >> 0 in $P(C_k|x_1,...,x_n)=1/Z \times P(x_1|C_k)\times P(x_2|C_k)\times ...\times P(x_n|C_k)\times P(C_k)$

these calculations are performed on the log scale:

$log(P(C_k|x_1,...,x_n)) \propto log(P(C_k))+\sum_1^n log(P(x_i|C_k))$

Lastly, the class with the highest log-posterior probability is chosen to be the prediction,

which is equivalent to predict(..., type = "class)

if instead, the conditional class probabilities $P(C_K|X=x)$ are of the main interest,

which then is equivalent to predict(..., type = "prob"),

then the log-posterior probabilities are transformed back to the original space and then normalized.

To speed up the calculation for large datasets, this package further simplied the formula above into matrix multiplication calculations in R.

## General Usage

• model fitting style 1: NaiveBayes(formula, data,...)
### Simulate data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
bern = sample(LETTERS[1: 2], n, TRUE),
cat  = sample(letters[1: 3], n, TRUE),
logical = sample(c(TRUE, FALSE), n, TRUE),
norm = rnorm(n),
count = rpois(n, lambda = c(5, 15)))

# fit model
nb <- NaiveBayes(class ~ ., data)
# check output
nb

• model fitting style 2: NaiveBayes(x, y,...)
# prepare data:
data(iris)
x = iris[ , -5]
y = iris[ , 5]

# fit model
nb2 <- NaiveBayes(x, y)
# check output
nb2


## Prediction

# prepare data:
set.seed(2)
iris_shuffle = iris[sample(nrow(iris)),]
training = iris_shuffle[1:130, ]
x = training[ , -5]
y = training[ , 5]

testing = iris_shuffle[131:150, -5]

# fit model
nb3 <- NaiveBayes(x, y)
# preidict (type = "class")
prediction = predict(nb3, testing)
# check output
prediction

# predict (type = "raw")
prediction_raw = predict(nb3, testing, type = "raw")
# check output
prediction_raw


## Performace Comparison

# load data, with 2213 variables and 100 observations.
data(tweet1)
x = as.data.frame(tweet1[, -1])
y = tweet1[, 1]

library(e1071)
library(bench)
library(rmarkdown)
library(ggplot2)
library(tidyr)
library(ggbeeswarm)

# check output are the same
model_a = NaiveBayes::NaiveBayes(x, y)
model_b = e1071::naiveBayes(x, y)

#   compare fit model output: given that this dataset contain too many variables, here we only compare
#   output frequency table of 4 randomly selected variables. Since each method has different ways
#   of organizing outputs, we will only check the results on one randomly selected row. Each time we run
#   the code below, the checked variables and output row will be RANDOM

# for fairness, we RANDOMLY select four variable index and one output row for testing

selected_index = sample(1:2213, 4)
selected_row = sample(c("negative","neutral","positive"), 1)
results = 0
for (i in selected_index){
result = all.equal(model_a$results[[i]][selected_row,], model_b$tables[[i]][selected_row,], tolerance = 1.5e-5)
results = result + results
}

ifelse(results == 4, "results are all equal", "results are not equal")

# compare fit model performance
# again, we randomly select one variable to compare
# NOTE: it is possible that this warning will pop up: "Warning: Some expressions had a GC in every iteration; so filtering is disabled."
# The plot below shows that, very often, it is e1071's naivebayes that is causing this warning. Given this is not our coding issue, and GC
# behavior cannot be controlled by R users, this warning shouldn't be regarded as our package error.

idx = sample(1:2213, 1)

result = bench::mark(NaiveBayes::NaiveBayes(x, y)$results[[idx]][selected_row,],e1071::naiveBayes(x, y)$tables[[idx]][selected_row,])

paged_table(result)

plot(result)


as can be seen from the table and plots above, NaiveBayes package performs significantly better than the naiveBayes function in package e1071, with less memory allocation as well.

This verifies that rewriting R code in Rcpp can increase the efficiency of the code.

(We have verified the comparion conclusion above many times before publishing this vignettes, if you get a contrary outcome by chance, please run a few more times and compare)




sidiwang/NaiveBayes documentation built on Nov. 26, 2019, 9 a.m.