CountVectorizer: Count Vectorizer
In superml: Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

CountVectorizer

R Documentation

Count Vectorizer

Description

Creates CountVectorizer Model.

Details

Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.

Public fields

sentences: a list containing sentences
max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_features: Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
split: splitting criteria for strings, default: " "
lowercase: convert all characters to lowercase before tokenizing
regex: regex expression to use for text cleaning.
remove_stopwords: a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
model: internal attribute which stores the count model

Methods

Method `new()`

Usage

CountVectorizer$new(
  min_df,
  max_df,
  max_features,
  ngram_range,
  regex,
  remove_stopwords,
  split,
  lowercase
)

Arguments

min_df: numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_df: numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_features: integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range: vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regex: character, regex expression to use for text cleaning.
remove_stopwords: list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
split: character, splitting criteria for strings, default: " "
lowercase: logical, convert all characters to lowercase before tokenizing, default: TRUE

Details

Create a new 'CountVectorizer' object.

Returns

A 'CountVectorizer' object.

Examples

cv = CountVectorizer$new(min_df=0.1)

Method `fit()`

Usage

CountVectorizer$fit(sentences)

Arguments

sentences: a list of text sentences

Details

Fits the countvectorizer model on sentences

Returns

NULL

Examples

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)

Method `fit_transform()`

Usage

CountVectorizer$fit_transform(sentences)

Arguments

sentences: a list of text sentences

Details

Fits the countvectorizer model and returns a sparse matrix of count of tokens

Returns

a sparse matrix containing count of tokens in each given sentence

Examples

sents = c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)

Method `transform()`

Usage

CountVectorizer$transform(sentences)

Arguments

sentences: a list of new text sentences

Details

Returns a matrix of count of tokens

Returns

a sparse matrix containing count of tokens in each given sentence

Examples

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

CountVectorizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------

cv = CountVectorizer$new(min_df=0.1)

## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)

## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)

## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)

superml documentation built on May 29, 2024, 1:09 a.m.

superml index

README.md How to use CountVectorizer in R ? How to use TfidfVectorizer in R ? Introduction to SuperML

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

superml
Build Machine Learning Models Like Using Python's Scikit-Learn Library in R

CountVectorizer: Count Vectorizer
In superml: Build Machine Learning Models Like Using Python's Scikit-Learn Library in R