Nothing
kgram_freqs
class is now called sbo_kgram_freqs
. The constructor
kgram_freqs()
is still available as an alias to sbo_kgram_freqs()
.Former sbo_preds
class is now substituted by two classes:
- `sbo_predictor`: for interactive use
- `sbo_predtable`: for storing text predictors out of memory (e.g.
`save()` to file)
sbo_predictor
and sbo_predtable
objects are obtained by the homonym
constructors, which are now S3 generics accepting character
input, as well as
sbo_kgram_freqs
and sbo_predtable
(for the sbo_predictor()
constructor)
class objects. In particular, these allow to directly train a text predictor
without storing the intermediate sbo_dictionary
, and kgram_freqs
objects.
dict
argument in kgram_freqs()
and kgram_freqs_fast()
has changed, now accepting either a sbo_dictionary
, a character
or a formula
(see also 'New features').sbo_predictor
implementation dramatically improves the speed of
predict()
(by a factor of x10). A single call to predict()
now allocates a
few kBs of RAM (whereas it previously allocated few MBs, c.f. issue #10).sbo_kgram_freqs
and sbo_pred*
objects is now stored via
attributes (#11).sbo_dictionary
.word_coverage
with generic constructors and a preconfigured
plot()
method. kgram_freqs()
and sbo_pred*()
can now
be built also with a fixed target coverage fraction of training corpus.prune()
generic function for reducing -gram order of
kgram_freqs
and sbo_predtable
's.summary()
methods for sbo_kgram_freqs
and sbo_pred*
objects;
correspondingly, the output of print()
has been simplified considerably (#5).sbo_kgram_freqs
, sbo_dictionary
, sbo_predictor
and
sbo_predtable
can be constructed either through the homonymous constructors,
or through the aliases kgram_freqs()
, dictionary()
, predictor()
,
predtable()
.sbo
now has SystemRequirements: C++11
, for correct integration with C++11 code (in particular std::unordered_map
).sbo_predictor()
) is now considerably faster, due to
optimizations in the algorithm for building Stupid Back-Off prediction tables.predict.kgram_freqs()
and
predict.sbo_predictor()
methods have been fixed, including:- Proper handling of unknown words
- Consistent handling of ties in prediction probabilities.
eval_sbo_predictor()
is now carried out by sampling
a single sentence from each document in test corpus.Depends
and Imports
package fields.erase
argument in
preprocess()
and kgram_freqs_fast()
, c.f. issue #17.kgramFreqs
class, as per ยง1.6.4 of the "Writing R extensions" guide.kgram_freqs_fast()
for fast and memory efficient kgram
tokenization using the default text preprocessing utility.kgram_freqs()
, get_word_freqs()
, preprocess()
, and predict.sbo_preds()
has been entirely rewritten in C++.tokenize_sentences()
function for sentence level tokenization.kgram_freqs()
now accepts any user defined single character EOS token, through the EOS
argument.preproc
argument to kgram_freqs()
and get_word_freqs()
, for
custom training corpus preprocessing.dict
argument of kgram_freqs()
now also accepts numeric values,
allowing to build a dictionary directly from the training corpus.predict
method for sbo_kgram_freqs
class.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.