kgram_freqs | R Documentation |
Extract k-gram frequency counts from a text or a connection.
kgram_freqs
query()
: query k-gram counts from the table.
See query
probability()
: compute word continuation and sentence probabilities
using Maximum Likelihood estimates. See probability.
language_model()
: build a k-gram language model using various
probability smoothing techniques. See language_model.
kgram_freqs(object, ...)
## S3 method for class 'numeric'
kgram_freqs(
object,
.preprocess = identity,
.tknz_sent = identity,
dict = NULL,
...
)
## S3 method for class 'kgram_freqs'
kgram_freqs(object, ...)
## S3 method for class 'character'
kgram_freqs(
object,
N,
.preprocess = identity,
.tknz_sent = identity,
dict = NULL,
open_dict = is.null(dict),
verbose = FALSE,
...
)
## S3 method for class 'connection'
kgram_freqs(
object,
N,
.preprocess = identity,
.tknz_sent = identity,
dict = NULL,
open_dict = is.null(dict),
verbose = FALSE,
max_lines = Inf,
batch_size = max_lines,
...
)
process_sentences(
text,
freqs,
.preprocess = attr(freqs, ".preprocess"),
.tknz_sent = attr(freqs, ".tknz_sent"),
open_dict = TRUE,
in_place = TRUE,
verbose = FALSE,
...
)
## S3 method for class 'character'
process_sentences(
text,
freqs,
.preprocess = attr(freqs, ".preprocess"),
.tknz_sent = attr(freqs, ".tknz_sent"),
open_dict = TRUE,
in_place = TRUE,
verbose = FALSE,
...
)
## S3 method for class 'connection'
process_sentences(
text,
freqs,
.preprocess = attr(freqs, ".preprocess"),
.tknz_sent = attr(freqs, ".tknz_sent"),
open_dict = TRUE,
in_place = TRUE,
verbose = FALSE,
max_lines = Inf,
batch_size = max_lines,
...
)
object |
any type allowed by the available methods. The type defines the
behaviour of |
... |
further arguments passed to or from other methods. |
.preprocess |
a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before k-gram tokenization. See ‘Details’. |
.tknz_sent |
a function taking a character vector as input and returning a character vector as output. Optional sentence tokenization step applied to text after preprocessing and before k-gram tokenization. See ‘Details’. |
dict |
anything coercible to class dictionary. Optional pre-specified word dictionary. |
N |
a length one integer. Maximum order of k-grams to be considered. |
open_dict |
|
verbose |
Print current progress to the console. |
max_lines |
a length one positive integer or |
batch_size |
a length one positive integer less than or equal to
|
text |
a character vector or a connection. Source of text from which k-gram frequencies are to be extracted. |
freqs |
a |
in_place |
|
The function kgram_freqs()
is a generic constructor for
objects of class kgram_freqs
, i.e. k-gram frequency tables. The
constructor from integer
returns an empty 'kgram_freqs' of fixed
order, with an optional
predefined dictionary (which can be empty) and .preprocess
and
.tknz_sent
functions to be used as defaults in other kgram_freqs
methods. The constructor from kgram_freqs
returns a copy of an
existing object, and it is provided because, in general, kgram_freqs
objects have reference semantics, as discussed below.
The following discussion focuses on process_sentences()
generic, as
well as on the character
and connection
methods of the
constructor kgram_freqs()
. These functions extract k-gram
frequency counts from a text source, which may be either a character vector
or a connection. The second option is useful if one wants to avoid loading
the full text corpus in physical memory, allowing to process text from
different sources such as files, compressed files or URLs.
The returned object is of class kgram_freqs
(a thin wrapper
around the internal C++ class where all k-gram computations take place).
kgram_freqs
objects have methods for querying bare k-gram frequencies
(query) and maximum likelihood estimates of sentence
probabilities or word continuation probabilities
(see probability)) . More importantly
kgram_freqs
objects are used to create language_model
objects, which support various probability smoothing techniques.
The function kgram_freqs()
is used to construct a new
kgram_freqs
object, initializing it with the k-gram counts from
the text
input, whereas process_sentences()
is used to
add k-gram counts from a new text
to an existing
kgram_freqs
object, freqs
. In this second case, the initial
object freqs
can either be modified in place
(for in_place == TRUE
, the default) or by making a copy
(in_place == FALSE
), see the examples below.
The final object is returned invisibly when modifying in place,
visibly in the second case. It is worth to mention that modifying in place
a kgram_freqs
object freqs
will also affect
language_model
objects created from freqs
with
language_model()
, which will also be updated with the new information.
If one wants to avoid this behaviour, one can make copies using either the
kgram_freqs()
copy constructor, or the in_place = FALSE
argument.
The dict
argument allows to provide an initial set of known
words. Subsequently, one can either work with such a closed dictionary
(open_dict == FALSE
), or extended the dictionary with all
new words encountered during k-gram processing
(open_dict == TRUE
) .
The .preprocess
and .tknz_sent
functions are applied
before k-gram counting takes place, and are in principle
arbitrary transformations of the original text.
After preprocessing and sentence tokenization, each line of the
transformed input is presented to the k-gram counting algorithm as a separate
sentence (these sentences are implicitly padded
with N - 1
Begin-Of-Sentence (BOS) and one End-Of-Sentence (EOS)
tokens, respectively. This is illustrated in the examples). For basic
usage, this package offers the utilities preprocess and
tknz_sent. Notice that, strictly speaking, there is
some redundancy in these two arguments, as the processed input to the k-gram
counting algorithm is .tknz_sent(.preprocess(text))
.
They appear explicitly as separate arguments for two main reasons:
The presence of .tknz_sent
is a reminder of the
fact that sentences have to be explicitly separeted in different entries
of the processed input, in order for kgram_freqs()
to append the
correct Begin-Of-Sentence and End-Of-Sentence paddings to each sentence.
At prediction time (e.g. with probability), by default only
.preprocess
is applied when computing conditional probabilities,
whereas both .preprocess()
and .tknz_sent()
are
applied when computing sentence absolute probabilities.
A kgram_freqs
class object: k-gram frequency table storing
k-gram counts from text. For process_sentences()
, the updated
kgram_freqs
object is returned invisibly if in_place
is
TRUE
, visibly otherwise.
Valerio Gherardi
query, probability language_model, dictionary
# Build a k-gram frequency table from a character vector
f <- kgram_freqs("a b b a a", 3)
f
summary(f)
query(f, c("a", "b")) # c(3, 2)
query(f, c("a b", "a" %+% EOS(), BOS() %+% "a b")) # c(1, 1, 1)
query(f, "a b b a") # NA (counts for k-grams of order k > 3 are not known)
process_sentences("b", f)
query(f, c("a", "b")) # c(3, 3): 'f' is updated in place
f1 <- process_sentences("b", f, in_place = FALSE)
query(f, c("a", "b")) # c(3, 3): 'f' is copied
query(f1, c("a", "b")) # c(3, 4): the new 'f1' stores the updated counts
# Build a k-gram frequency table from a file connection
## Not run:
f <- kgram_freqs(file("my_text_file.txt"), 3)
## End(Not run)
# Build a k-gram frequency table from an URL connection
## Not run:
f <- kgram_freqs(url("http://my.website/my_text_file.txt"), 3)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.