How to use CountVectorizer in R ?
In superml: Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this tutorial, we'll look at how to create bag of words model (token occurence count matrix) in R in two simple steps with superml. Superml borrows speed gains using parallel computation and optimised functions from data.table R package. Bag of words model is often use to analyse text pattern using word occurences in a given text.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Caveats on superml installation

For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don't install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:

install.packages("superml", dependencies=TRUE)

Sample Data

First, we'll create a sample data. Feel free to run it alongside in your laptop and check the computation.

library(superml)

# should be a vector of texts
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work')

# generate more sentences
n <- 10
sents <- rep(sents, n) 
length(sents)

For sample, we've generated 50 documents. Let's create the features now. For ease, superml uses the similar API layout as python scikit-learn.

# initialise the class
cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE)

# generate the matrix
cf_mat <- cfv$fit_transform(sents)

head(cf_mat, 3)

Few observations:

remove_stopwords = FALSE defaults to TRUE. We set it to FALSE since most of the words in our dummy sents are stopwords.
max_features = 10 select the top 10 features (tokens) based on frequency.

Now, let's generate the matrix using its ngram_range features.

# initialise the class
cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE, ngram_range = c(1, 3))

# generate the matrix
cf_mat <- cfv$fit_transform(sents)

head(cf_mat, 3)

Few observations:

ngram_range = c(1,3) set the lower and higher range respectively of the resulting ngram tokens.

Usage for a Machine Learning Model

In order to use Count Vectorizer as an input for a machine learning model, sometimes it gets confusing as to which method fit_transform, fit, transform should be used to generate features for the given data. Here's a way to do:

library(data.table)
library(superml)

# use sents from above
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work',
          'how does it work')

# create dummy data
train <- data.table(text = sents, target = rep(c(0,1), 3))
test <- data.table(text = sample(sents), target = rep(c(0,1), 3))

Let's see how the data looks like:

head(train, 3)

head(test, 3)

Now, we generate features for train-test data:

# initialise the class
cfv <- CountVectorizer$new(max_features = 12, remove_stopwords = FALSE, ngram_range = c(1,3))

# we fit on train data
cfv$fit(train$text)

train_cf_features <- cfv$transform(train$text)
test_cf_features <- cfv$transform(test$text)

dim(train_cf_features)
dim(test_cf_features)

We generate 12 features for each of the given data. Let's see how they look:

head(train_cf_features, 3)

head(test_cf_features, 3)

Finally, to train a machine learning model on this, you can simply do:

# ensure the input to classifier is a data.table or data.frame object
x_train <- data.table(cbind(train_cf_features, target = train$target))
x_test <- data.table(test_cf_features)


xgb <- RFTrainer$new(n_estimators = 10)
xgb$fit(x_train, "target")

predictions <- xgb$predict(x_test)
predictions

Summary

In this tutorial, we discussed how to use superml's countvectorizer (also known as bag of words model) to create word counts matrix and train a machine learning model on it.

Any scripts or data that you put into this service are public.

superml documentation built on May 29, 2024, 1:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

superml
Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

How to use CountVectorizer in R ?
In superml: Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

Install

Caveats on superml installation

Sample Data

Usage for a Machine Learning Model

Summary

Try the superml package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

superml Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

How to use CountVectorizer in R ? In superml: Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

Install

Caveats on superml installation

Sample Data

Usage for a Machine Learning Model

Summary

Try the superml package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

superml
Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

How to use CountVectorizer in R ?
In superml: Build Machine Learning Models Like Using Python's Scikit-Learn Library in R