| CountVectorizer | R Documentation |
Creates CountVectorizer Model.
Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.
sentencesa list containing sentences
max_dfWhen building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_dfWhen building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_featuresBuild a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangeThe lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
splitsplitting criteria for strings, default: " "
lowercaseconvert all characters to lowercase before tokenizing
regexregex expression to use for text cleaning.
remove_stopwordsa list of stopwords to use, by default it uses its inbuilt list of standard stopwords
modelinternal attribute which stores the count model
new()CountVectorizer$new( min_df, max_df, max_features, ngram_range, regex, remove_stopwords, split, lowercase )
min_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_featuresinteger, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangevector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regexcharacter, regex expression to use for text cleaning.
remove_stopwordslist, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
splitcharacter, splitting criteria for strings, default: " "
lowercaselogical, convert all characters to lowercase before tokenizing, default: TRUE
Create a new 'CountVectorizer' object.
A 'CountVectorizer' object.
cv = CountVectorizer$new(min_df=0.1)
fit()CountVectorizer$fit(sentences)
sentencesa list of text sentences
Fits the countvectorizer model on sentences
NULL
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
fit_transform()CountVectorizer$fit_transform(sentences)
sentencesa list of text sentences
Fits the countvectorizer model and returns a sparse matrix of count of tokens
a sparse matrix containing count of tokens in each given sentence
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
transform()CountVectorizer$transform(sentences)
sentencesa list of new text sentences
Returns a matrix of count of tokens
a sparse matrix containing count of tokens in each given sentence
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)
clone()The objects of this class are cloneable with this method.
CountVectorizer$clone(deep = FALSE)
deepWhether to make a deep clone.
## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------
cv = CountVectorizer$new(min_df=0.1)
## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.