tokenizers: Fast, Consistent Tokenization of Natural Language Text

context("Encodings")

test_that("Encodings work on Windows", {
  input <- "César Moreira Nuñez"
  reference <- c("césar", "moreira", "nuñez")
  reference_enc <- c("UTF-8", "unknown", "UTF-8")
  output_n1 <- tokenize_ngrams(input, n = 1, simplify = TRUE)
  output_words <- tokenize_words(input, simplify = TRUE)
  output_skip <- tokenize_skip_ngrams(input, n = 1, k = 0, simplify = TRUE)
  expect_equal(output_n1, reference)
  expect_equal(output_words, reference)
  expect_equal(output_skip, reference)
  expect_equal(Encoding(output_n1), reference_enc)
  expect_equal(Encoding(output_words), reference_enc)
  expect_equal(Encoding(output_skip), reference_enc)
})

lmullen/tokenizers documentation built on March 28, 2024, 11:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lmullen/tokenizers
Fast, Consistent Tokenization of Natural Language Text

tests/testthat/test-encoding.R
In lmullen/tokenizers: Fast, Consistent Tokenization of Natural Language Text

R Package Documentation

Browse R Packages

We want your feedback!

lmullen/tokenizers Fast, Consistent Tokenization of Natural Language Text

tests/testthat/test-encoding.R In lmullen/tokenizers: Fast, Consistent Tokenization of Natural Language Text

R Package Documentation

Browse R Packages

We want your feedback!

lmullen/tokenizers
Fast, Consistent Tokenization of Natural Language Text

tests/testthat/test-encoding.R
In lmullen/tokenizers: Fast, Consistent Tokenization of Natural Language Text