| kgram_freqs | R Documentation | 
Extract k-gram frequency counts from a text or a connection.
kgram_freqsquery(): query k-gram counts from the table.
See query
probability(): compute word continuation and sentence probabilities
using Maximum Likelihood estimates. See probability.
language_model(): build a k-gram language model using various
probability smoothing techniques. See language_model.
kgram_freqs(object, ...)
## S3 method for class 'numeric'
kgram_freqs(
  object,
  .preprocess = identity,
  .tknz_sent = identity,
  dict = NULL,
  ...
)
## S3 method for class 'kgram_freqs'
kgram_freqs(object, ...)
## S3 method for class 'character'
kgram_freqs(
  object,
  N,
  .preprocess = identity,
  .tknz_sent = identity,
  dict = NULL,
  open_dict = is.null(dict),
  verbose = FALSE,
  ...
)
## S3 method for class 'connection'
kgram_freqs(
  object,
  N,
  .preprocess = identity,
  .tknz_sent = identity,
  dict = NULL,
  open_dict = is.null(dict),
  verbose = FALSE,
  max_lines = Inf,
  batch_size = max_lines,
  ...
)
process_sentences(
  text,
  freqs,
  .preprocess = attr(freqs, ".preprocess"),
  .tknz_sent = attr(freqs, ".tknz_sent"),
  open_dict = TRUE,
  in_place = TRUE,
  verbose = FALSE,
  ...
)
## S3 method for class 'character'
process_sentences(
  text,
  freqs,
  .preprocess = attr(freqs, ".preprocess"),
  .tknz_sent = attr(freqs, ".tknz_sent"),
  open_dict = TRUE,
  in_place = TRUE,
  verbose = FALSE,
  ...
)
## S3 method for class 'connection'
process_sentences(
  text,
  freqs,
  .preprocess = attr(freqs, ".preprocess"),
  .tknz_sent = attr(freqs, ".tknz_sent"),
  open_dict = TRUE,
  in_place = TRUE,
  verbose = FALSE,
  max_lines = Inf,
  batch_size = max_lines,
  ...
)
| object | any type allowed by the available methods. The type defines the
behaviour of  | 
| ... | further arguments passed to or from other methods. | 
| .preprocess | a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before k-gram tokenization. See ‘Details’. | 
| .tknz_sent | a function taking a character vector as input and returning a character vector as output. Optional sentence tokenization step applied to text after preprocessing and before k-gram tokenization. See ‘Details’. | 
| dict | anything coercible to class dictionary. Optional pre-specified word dictionary. | 
| N | a length one integer. Maximum order of k-grams to be considered. | 
| open_dict | 
 | 
| verbose | Print current progress to the console. | 
| max_lines | a length one positive integer or  | 
| batch_size | a length one positive integer less than or equal to
 | 
| text | a character vector or a connection. Source of text from which k-gram frequencies are to be extracted. | 
| freqs | a  | 
| in_place | 
 | 
The function kgram_freqs() is a generic constructor for
objects of class kgram_freqs, i.e. k-gram frequency tables. The
constructor from integer returns an empty 'kgram_freqs' of fixed
order, with an optional
predefined dictionary (which can be empty) and .preprocess and
.tknz_sent functions to be used as defaults in other kgram_freqs
methods. The constructor from kgram_freqs returns a copy of an
existing object, and it is provided because, in general, kgram_freqs
objects have reference semantics, as discussed below.
The following discussion focuses on process_sentences() generic, as
well as on the character and connection methods of the
constructor kgram_freqs(). These functions extract k-gram
frequency counts from a text source, which may be either a character vector
or a connection. The second option is useful if one wants to avoid loading
the full text corpus in physical memory, allowing to process text from
different sources such as files, compressed files or URLs.
The returned object is of class kgram_freqs (a thin wrapper
around the internal C++ class where all k-gram computations take place).
kgram_freqs objects have methods for querying bare k-gram frequencies
(query) and maximum likelihood estimates of sentence
probabilities or word continuation probabilities
(see probability)) . More importantly
kgram_freqs objects are used to create language_model
objects, which support various probability smoothing techniques.
The function kgram_freqs() is used to construct a new
kgram_freqs object, initializing it with the k-gram counts from
the text input, whereas process_sentences() is used to
add k-gram counts from a new text to an existing
kgram_freqs object, freqs. In this second case, the initial
object freqs can either be modified in place
(for in_place == TRUE, the default) or by making a copy
(in_place == FALSE), see the examples below.
The final object is returned invisibly when modifying in place,
visibly in the second case. It is worth to mention that modifying in place
a kgram_freqs object freqs will also affect
language_model objects created from freqs with
language_model(), which will also be updated with the new information.
If one wants to avoid this behaviour, one can make copies using either the
kgram_freqs() copy constructor, or the in_place = FALSE
argument.
The dict argument allows to provide an initial set of known
words. Subsequently, one can either work with such a closed dictionary
(open_dict == FALSE), or extended the dictionary with all
new words encountered during k-gram processing
(open_dict == TRUE)  .
The .preprocess and .tknz_sent functions are applied
before k-gram counting takes place, and are in principle
arbitrary transformations of the original text.
After preprocessing and sentence tokenization, each line of the
transformed input is presented to the k-gram counting algorithm as a separate
sentence (these sentences are implicitly padded
with N - 1 Begin-Of-Sentence (BOS) and one End-Of-Sentence (EOS)
tokens, respectively. This is illustrated in the examples). For basic
usage, this package offers the utilities preprocess and
tknz_sent. Notice that, strictly speaking, there is
some redundancy in these two arguments, as the processed input to the k-gram
counting algorithm is .tknz_sent(.preprocess(text)).
They appear explicitly as separate arguments for two main reasons:
 The presence of .tknz_sent is a reminder of the
fact that sentences have to be explicitly separeted in different entries
of the processed input, in order for kgram_freqs() to append the
correct Begin-Of-Sentence and End-Of-Sentence paddings to each sentence.
 At prediction time (e.g. with probability), by default only
.preprocess is applied when computing conditional probabilities,
whereas both .preprocess() and .tknz_sent() are
applied when computing sentence absolute probabilities.
A kgram_freqs class object: k-gram frequency table storing
k-gram counts from text. For process_sentences(), the updated
kgram_freqs object is returned invisibly if in_place is
TRUE, visibly otherwise.
Valerio Gherardi
query, probability language_model, dictionary
# Build a k-gram frequency table from a character vector
f <- kgram_freqs("a b b a a", 3)
f
summary(f)
query(f, c("a", "b")) # c(3, 2)
query(f, c("a b", "a" %+% EOS(), BOS() %+% "a b")) # c(1, 1, 1)
query(f, "a b b a") # NA (counts for k-grams of order k > 3 are not known)
process_sentences("b", f)
query(f, c("a", "b")) # c(3, 3): 'f' is updated in place
f1 <- process_sentences("b", f, in_place = FALSE)
query(f, c("a", "b")) # c(3, 3): 'f' is copied
query(f1, c("a", "b")) # c(3, 4): the new 'f1' stores the updated counts
# Build a k-gram frequency table from a file connection
## Not run: 
f <- kgram_freqs(file("my_text_file.txt"), 3)
## End(Not run)
# Build a k-gram frequency table from an URL connection
## Not run: 
f <- kgram_freqs(url("http://my.website/my_text_file.txt"), 3)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.