R/ngram_asweka.r
In ngram: Fast n-Gram 'Tokenization'

Documented in ngram_asweka

#' Weka-like n-gram Tokenization
#' 
#' An n-gram tokenizer with identical output to the \code{NGramTokenizer}
#' function from the RWeka package.
#' 
#' @details
#' This n-gram tokenizer behaves similarly in both input and return to 
#' the tokenizer in RWeka.  Unlike the tokenizer \code{ngram()}, the
#' return is not a special class of external pointers; it is a vector,
#' and therefore can be serialized via \code{save()} or \code{saveRDS()}.
#' 
#' @param str 
#' The input text.
#' @param min,max
#' The minimum and maximum 'n' as in 'n-gram'.
#' @param sep
#' A set of separator characters for the "words".  See details for
#' information about how this works; it works a little differently
#' from \code{sep} arguments in R functions.
#' 
#' @return
#' A vector of n-grams listed in decreasing blocks of n, in order within a
#' block.  The output matches that of RWeka's n-gram tokenizer.
#' 
#' @examples
#' library(ngram)
#' 
#' str = "A B A C A B B"
#' ngram_asweka(str, min=2, max=4)
#' 
#' @seealso \code{\link{ngram}}
#' @keywords Tokenization
#' @useDynLib ngram R_ng_asweka
#' @name Tokenize-AsWeka
#' @rdname tokenize-asweka
#' @export
ngram_asweka = function(str, min=2, max=2, sep=" ")
{
  check.is.string(str)
  check.is.posint(min)
  check.is.posint(max)
  check.is.string(sep)
  
  .Call(R_ng_asweka, str, as.integer(min), as.integer(max), sep)
}