tokenizers: Fast, Consistent Tokenization of Natural Language Text

context("PTB tokenizer")

test_that("PTB tokenizer works as expected", {
  out_l <- tokenize_ptb(docs_l)
  out_c <- tokenize_ptb(docs_c)
  out_1 <- tokenize_ptb(docs_c[1], simplify = TRUE)

  expect_is(out_l, "list")
  expect_is(out_l[[1]], "character")
  expect_is(out_c, "list")
  expect_is(out_c[[1]], "character")
  expect_is(out_1, "character")

  expect_identical(out_l, out_c)
  expect_identical(out_l[[1]], out_1)
  expect_identical(out_c[[1]], out_1)

  expect_named(out_l, names(docs_l))
  expect_named(out_c, names(docs_c))

  expect_error(tokenize_ptb(bad_list))
})

test_that("Word tokenizer produces correct output", {
  sents <-
    c(paste0("Good muffins cost $3.88\nin New York. ",
             "Please buy me\\ntwo of them.\\nThanks."),
      "They'll save and invest more.",
      "hi, my name can't hello,")
  expected <-
    list(c("Good", "muffins", "cost", "$", "3.88", "in", "New", "York.",
           "Please", "buy", "me\\ntwo", "of", "them.\\nThanks", "."),
         c("They", "'ll", "save", "and", "invest", "more", "."),
         c("hi", ",", "my", "name", "ca", "n't", "hello", ","))
  expect_identical(tokenize_ptb(sents), expected)

  expect_identical(tokenize_ptb("This can't work.", lowercase = TRUE, simplify = TRUE),
                   c("this", "ca", "n't", "work", "."))
})

lmullen/tokenizers documentation built on March 28, 2024, 11:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lmullen/tokenizers
Fast, Consistent Tokenization of Natural Language Text

tests/testthat/test-ptb.R
In lmullen/tokenizers: Fast, Consistent Tokenization of Natural Language Text

R Package Documentation

Browse R Packages

We want your feedback!

lmullen/tokenizers Fast, Consistent Tokenization of Natural Language Text

tests/testthat/test-ptb.R In lmullen/tokenizers: Fast, Consistent Tokenization of Natural Language Text

R Package Documentation

Browse R Packages

We want your feedback!

lmullen/tokenizers
Fast, Consistent Tokenization of Natural Language Text

tests/testthat/test-ptb.R
In lmullen/tokenizers: Fast, Consistent Tokenization of Natural Language Text