mlr_pipeops_textvectorizer | R Documentation |
Computes a bag-of-word representation from a (set of) columns.
Columns of type character
are split up into words.
Uses the quanteda::dfm()
,
quanteda::dfm_trim()
from the 'quanteda' package.
TF-IDF computation works similarly to quanteda::dfm_tfidf()
but has been adjusted for train/test data split using quanteda::docfreq()
and quanteda::dfm_weight()
In short:
Per default, produces a bag-of-words representation
If n
is set to values > 1, ngrams are computed
If df_trim
parameters are set, the bag-of-words is trimmed.
The scheme_tf
parameter controls term-frequency (per-document, i.e. per-row) weighting
The scheme_df
parameter controls the document-frequency (per token, i.e. per-column) weighting.
Parameters specify arguments to quanteda's dfm
, dfm_trim
, docfreq
and dfm_weight
.
What belongs to what can be obtained from each params tags
where tokenizer
are
arguments passed on to quanteda::dfm()
.
Defaults to a bag-of-words representation with token counts as matrix entries.
In order to perform the default dfm_tfidf
weighting, set the scheme_df
parameter to "inverse"
.
The scheme_df
parameter is initialized to "unary"
, which disables document frequency weighting.
The pipeop works as follows:
Words are tokenized using quanteda::tokens
.
Ngrams are computed using quanteda::tokens_ngrams
A document-frequency matrix is computed using quanteda::dfm
The document-frequency matrix is trimmed using quanteda::dfm_trim
during train-time.
The document-frequency matrix is re-weighted (similar to quanteda::dfm_tfidf
) if scheme_df
is not set to "unary"
.
R6Class
object inheriting from PipeOpTaskPreproc
/PipeOp
.
PipeOpTextVectorizer$new(id = "textvectorizer", param_vals = list())
id
:: character(1)
Identifier of resulting object, default "textvectorizer"
.
param_vals
:: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list()
.
Input and output channels are inherited from PipeOpTaskPreproc
.
The output is the input Task
with all affected features converted to a bag-of-words
representation.
The $state
is a list with element 'cols': A vector of extracted columns.
The parameters are the parameters inherited from PipeOpTaskPreproc
, as well as:
return_type
:: character(1)
Whether to return an integer representation ("integer-sequence") or a Bag-of-words ("bow").
If set to "integer_sequence", tokens are replaced by an integer and padded/truncated to sequence_length
.
If set to "factor_sequence", tokens are replaced by a factor and padded/truncated to sequence_length
.
If set to 'bow', a possibly weighted bag-of-words matrix is returned.
Defaults to bow
.
stopwords_language
:: character(1)
Language to use for stopword filtering. Needs to be either "none"
, a language identifier listed in
stopwords::stopwords_getlanguages("snowball")
("de"
, "en"
, ...) or "smart"
.
"none"
disables language-specific stopwords.
"smart"
coresponds to stopwords::stopwords(source = "smart")
, which
contains English stopwords and also removes one-character strings. Initialized to "smart"
.
extra_stopwords
:: character
Extra stopwords to remove. Must be a character
vector containing individual tokens to remove. Initialized to character(0)
.
When n
is set to values greater than 1, this can also contain stop-ngrams.
tolower
:: logical(1)
Convert to lower case? See quanteda::dfm
. Default: TRUE
.
stem
:: logical(1)
Perform stemming? See quanteda::dfm
. Default: FALSE
.
what
:: character(1)
Tokenization splitter. See quanteda::tokens
. Default: word
.
remove_punct
:: logical(1)
See quanteda::tokens
. Default: FALSE
.
remove_url
:: logical(1)
See quanteda::tokens
. Default: FALSE
.
remove_symbols
:: logical(1)
See quanteda::tokens
. Default: FALSE
.
remove_numbers
:: logical(1)
See quanteda::tokens
. Default: FALSE
.
remove_separators
:: logical(1)
See quanteda::tokens
. Default: TRUE
.
split_hypens
:: logical(1)
See quanteda::tokens
. Default: FALSE
.
n
:: integer
Vector of ngram lengths. See quanteda::tokens_ngrams
. Initialized to 1, deviating from the base function's default.
Note that this can be a vector of multiple values, to construct ngrams of multiple orders.
skip
:: integer
Vector of skips. See quanteda::tokens_ngrams
. Default: 0. Note that this can be a vector of multiple values.
sparsity
:: numeric(1)
Desired sparsity of the 'tfm' matrix. See quanteda::dfm_trim
. Default: NULL
.
max_termfreq
:: numeric(1)
Maximum term frequency in the 'tfm' matrix. See quanteda::dfm_trim
. Default: NULL
.
min_termfreq
:: numeric(1)
Minimum term frequency in the 'tfm' matrix. See quanteda::dfm_trim
. Default: NULL
.
termfreq_type
:: character(1)
How to asess term frequency. See quanteda::dfm_trim
. Default: "count"
.
scheme_df
:: character(1)
Weighting scheme for document frequency: See quanteda::docfreq
. Initialized to "unary"
(1 for each document, deviating from base function default).
smoothing_df
:: numeric(1)
See quanteda::docfreq
. Default: 0.
k_df
:: numeric(1)
k
parameter given to quanteda::docfreq
(see there).
Default is 0.
threshold_df
:: numeric(1)
See quanteda::docfreq
. Default: 0. Only considered for scheme_df
= "count"
.
base_df
:: numeric(1)
The base for logarithms in quanteda::docfreq
(see there). Default: 10.
scheme_tf
:: character(1)
Weighting scheme for term frequency: See quanteda::dfm_weight
. Default: "count"
.
k_tf
:: numeric(1)
k
parameter given to quanteda::dfm_weight
(see there).
Default behaviour is 0.5.
base_df
:: numeric(1)
The base for logarithms in quanteda::dfm_weight
(see there). Default: 10.
#' * sequence_length
:: integer(1)
The length of the integer sequence. Defaults to Inf
, i.e. all texts are padded to the length
of the longest text. Only relevant for "return_type" : "integer_sequence"
See Description. Internally uses the quanteda
package. Calls quanteda::tokens
, quanteda::tokens_ngrams
and quanteda::dfm
. During training,
quanteda::dfm_trim
is also called. Tokens not seen during training are dropped during prediction.
Only methods inherited from PipeOpTaskPreproc
/PipeOp
.
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_threshold
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
library("mlr3")
library("data.table")
# create some text data
dt = data.table(
txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)
pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))
pos$train(list(task))[[1]]$data()
one_line_of_iris = task$filter(13)
one_line_of_iris$data()
pos$predict(list(one_line_of_iris))[[1]]$data()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.