textfeatures has been removed from Suggests. (#255)
step_textfeatures()
no longer returns a politeness feature. (#254)
step_untokenize()
and step_normalization()
now returns factors instead of strings. (#247)step_clean_names()
now throw an informative error if needed non-standard role columns are missing during bake()
. (#235)
The keep_original_cols
argument has been added to step_tokenmerge
. This change should mean that every step that produces new columns has the keep_original_cols
argument. (#242)
Many internal changes to improve consistency and slight speed increases.
Fixed bug where step_dummy_hash()
and step_texthash()
would add new columns before old columns. (#235)
Fixed bug where vocabulary_size
wasn't tunable in step_tokenize_bpe()
. (#239)
Steps with tunable arguments now have those arguments listed in the documentation.
All steps that add new columns will now informatively error if name collision occurs.
step_tf()
wasn't tunable for weight
argument.Setting token = "tweets"
in step_tokenize()
have been deprecated due to tokenizers::tokenize_tweets()
being deprecated. (#209)
step_sequence_onehot()
, step_dummy_hash()
, step_dummy_texthash()
now return integers. step_tf()
returns integer when weight_scheme
is "binary"
or "raw count"
.
All steps now have required_pkgs()
methods.
if (require(...))
code.Remove use of okc_text in vignette
Fix bug in printing of tokenlists
step_tfidf()
now correctly saves the idf values and applies them to the testing data set.
tidy.step_tfidf()
now returns calculated IDF weights.
step_dummy_hash()
generates binary indicators (possibly signed) from simple factor or character vectors.
step_tokenize()
has gotten a couple of cousin functions step_tokenize_bpe()
, step_tokenize_sentencepiece()
and step_tokenize_wordpiece()
which wraps {tokenizers.bpe}, {sentencepiece} and {wordpiece} respectively (#147).
Added all_tokenized()
and all_tokenized_predictors()
to more easily select tokenized columns (#132).
Use show_tokens()
to more easily debug a recipe involving tokenization.
Reorganize documentation for all recipe step tidy
methods (#126).
Steps now have a dedicated subsection detailing what happens when tidy()
is applied. (#163)
All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
step_ngram()
has been given a speed increase to put it in line with other packages performance.
step_tokenize()
will now try to error if vocabulary size is too low when using engine = "tokenizers.bpe"
(#119).
Warning given by step_tokenfilter()
when filtering failed to apply now correctly refers to the right argument name (#137).
step_tf()
now returns 0 instead of NaN when there aren't any tokens present (#118).
step_tokenfilter()
now has a new argument filter_fun
will takes a function which can be used to filter tokens. (#164)
tidy.step_stem()
now correctly shows if custom stemmer was used.
Added keep_original_cols
argument to step_lda
, step_texthash()
, step_tf()
, step_tfidf()
, step_word_embeddings()
, step_dummy_hash()
, step_sequence_onehot()
, and step_textfeatures()
(#139).
prefix
argument now creates names according to the pattern prefix_variablename_name/number
. (#124)step_tokenfilter()
and step_sequence_onehot()
that sometimes caused crashes in R 4.1.0.step_lda()
now takes a tokenlist instead of a character variable. See readme for more detail.step_sequence_onehot()
now takes tokenlists as input.step_tokenize()
.step_tokenize()
.step_clean_names()
and step_clean_levels()
. (#101)step_ngram()
gained an argument min_num_tokens
to be able to return multiple n-grams together. (#90)step_text_normalization()
to perform unicode normalization on character vectors. (#86)step_word_embeddings()
got a argument aggregation_default
to specify value in cases where no words matches embedding.step_tokenize()
got an engine
argument to specify packages other then tokenizers to tokenize.spacyr
have been added as an engine to step_tokenize()
.step_lemma()
has been added to extract lemma attribute from tokenlists.step_pos_filter()
has been added to allow filtering of tokens bases on their pat of speech tags.step_ngram()
has been added to generate ngrams from tokenlists.step_stem()
not correctly uses the options argument. (Thanks to @grayskripko for finding bug, #64)step_word2vec()
have been changed to step_lda()
to reflect what is actually happening.step_word_embeddings()
has been added. Allows for use of pre-trained word embeddings to convert token columns to vectors in a high-dimensional "meaning" space. (@jonthegeek, #20)step_tfidf()
calculations are slightly changed due to flaw in original implementation https://github.com/dselivanov/text2vec/issues/280.step_textfeatures()
have been added, allows for multiple numerical features to be pulled from text.step_sequence_onehot()
have been added, allows for one hot encoding of sequences of fixed width.step_word2vec()
have been added, calculates word2vec dimensions.step_tokenmerge()
have been added, combines multiple list columns into one list-columns.step_texthash()
now correctly accepts signed
argument.step_tf()
and step_tfidf()
.First CRAN version
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.