The NaiveBayes package provides an efficient implementation of the popular Naive Bayes classifier. This package is efficient, user friendly and written in base.R and Rcpp. Like many other classifier packages, the general function NaiveBayes detects the class of each feature in the dataset. Predict function uses a NaiveBayes model and a new data set to create the classifications. This can either be the raw probabilities generated by the NaiveBayes model or the classes themselves.
Naive Bayes is one of the most popular and simple Machine Learning classification algorithms, the Naive Bayes Algorithm. It works on Bayes theorem of probability to predict the class of unknown data sets with an assumption of independece among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature (i.e. assumes your X are all independent.)
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes Theorem provides a way of calculating posterior probability
$P(C_k|x)=P(x|C_k)P(C_k)/P(x)$
where:
$P(C|x)$ is the posterior probability of class (C, target) given predictor (x, attributes).
$P(C)$ is the prior probability of class
$P(x|C)$ is the likelihood which is the probability of predictor given class.
$P(x)$ is the prior probability of predictor
Using the chain rule for repeated applications of the definition of conditional probability:
$P(C_k,x_1,...,x_n) = P(x_1, ..., x_n,C_k) = P(x_1|x_2,...,x_n,C_k)P(x_2|x_3,...,x_n,C_k)...P(x_n|C_k)P(C_k)$
given that all feature in x are mutually independent, conditional on the category $C_k$. Under this assumption, we have:
$P(C_k|x_1,...,x_n)=1/Z \times P(x_1|C_k)\times P(x_2|C_k)\times ...\times P(x_n|C_k)\times P(C_k)$
where the evidence $Z = \sum_k P(C_k)P(x|C_k)$ is a scaling factor dependent only on $x_1,...x_n$, that is, a constant if the values of the feature variables are known. (Wikipedia: Naive Bayes Classifier)
Pros:
Easy and fast in predicting class of test data. Also performs well in multi class prediction
When independence holds, Naive Bayes performs better compare to other models like logistic regression and less training data is needed
Performs well in case of categorical input variables compare to numerical variables. For numerical variable, normal distribution is assumed.
Cons:
If categorical variable has a category (in test data), which was not observed in training data. then model will not be able to make a prediction. To solve this, we can use the smoothing technique.
Assumption of independent predictors. in real life, it is almost impossible that we get a set of predictors which are completely independent.
This package can be downloaded from Github: Repository: NaiveBayes with:
library(devtools) install_github("sidiwang/NaiveBayes", build_vignettes = T)
After successful installation, the package can be used with:
library(NaiveBayes)
The general function NaiveBayes()
detects the class of each feature in the dataset and assumes normal distribution for continous variables.
The prediction function predict.NaiveBayes()
can be called like many other classification packages: predict(model_name, newdata, ...)
To avoid a numerical unverflow:
i.e. when n >> 0 in $P(C_k|x_1,...,x_n)=1/Z \times P(x_1|C_k)\times P(x_2|C_k)\times ...\times P(x_n|C_k)\times P(C_k)$
these calculations are performed on the log scale:
$log(P(C_k|x_1,...,x_n)) \propto log(P(C_k))+\sum_1^n log(P(x_i|C_k))$
Lastly, the class with the highest log-posterior probability is chosen to be the prediction,
which is equivalent to predict(..., type = "class
)
if instead, the conditional class probabilities $P(C_K|X=x)$ are of the main interest,
which then is equivalent to predict(..., type = "prob")
,
then the log-posterior probabilities are transformed back to the original space and then normalized.
To speed up the calculation for large datasets, this package further simplied the formula above into matrix multiplication calculations in R.
NaiveBayes(formula, data,...)
### Simulate data n <- 100 set.seed(1) data <- data.frame(class = sample(c("classA", "classB"), n, TRUE), bern = sample(LETTERS[1: 2], n, TRUE), cat = sample(letters[1: 3], n, TRUE), logical = sample(c(TRUE, FALSE), n, TRUE), norm = rnorm(n), count = rpois(n, lambda = c(5, 15))) # fit model nb <- NaiveBayes(class ~ ., data) # check output nb
NaiveBayes(x, y,...)
# prepare data: data(iris) x = iris[ , -5] y = iris[ , 5] # fit model nb2 <- NaiveBayes(x, y) # check output nb2
# prepare data: set.seed(2) iris_shuffle = iris[sample(nrow(iris)),] training = iris_shuffle[1:130, ] x = training[ , -5] y = training[ , 5] testing = iris_shuffle[131:150, -5] # fit model nb3 <- NaiveBayes(x, y) # preidict (type = "class") prediction = predict(nb3, testing) # check output prediction # predict (type = "raw") prediction_raw = predict(nb3, testing, type = "raw") # check output prediction_raw
# load data, with 2213 variables and 100 observations. data(tweet1) x = as.data.frame(tweet1[, -1]) y = tweet1[, 1] # load other packages library(e1071) library(bench) library(rmarkdown) library(ggplot2) library(tidyr) library(ggbeeswarm) # check output are the same model_a = NaiveBayes::NaiveBayes(x, y) model_b = e1071::naiveBayes(x, y) # compare fit model output: given that this dataset contain too many variables, here we only compare # output frequency table of 4 randomly selected variables. Since each method has different ways # of organizing outputs, we will only check the results on one randomly selected row. Each time we run # the code below, the checked variables and output row will be RANDOM # for fairness, we RANDOMLY select four variable index and one output row for testing selected_index = sample(1:2213, 4) selected_row = sample(c("negative","neutral","positive"), 1) results = 0 for (i in selected_index){ result = all.equal(model_a$results[[i]][selected_row,], model_b$tables[[i]][selected_row,], tolerance = 1.5e-5) results = result + results } ifelse(results == 4, "results are all equal", "results are not equal") # compare fit model performance # again, we randomly select one variable to compare # NOTE: it is possible that this warning will pop up: "Warning: Some expressions had a GC in every iteration; so filtering is disabled." # The plot below shows that, very often, it is e1071's naivebayes that is causing this warning. Given this is not our coding issue, and GC # behavior cannot be controlled by R users, this warning shouldn't be regarded as our package error. idx = sample(1:2213, 1) result = bench::mark(NaiveBayes::NaiveBayes(x, y)$results[[idx]][selected_row,],e1071::naiveBayes(x, y)$tables[[idx]][selected_row,]) paged_table(result)
plot(result)
as can be seen from the table and plots above, NaiveBayes package performs significantly better than the naiveBayes function in package e1071, with less memory allocation as well.
This verifies that rewriting R code in Rcpp can increase the efficiency of the code.
(We have verified the comparion conclusion above many times before publishing this vignettes, if you get a contrary outcome by chance, please run a few more times and compare)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.