mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

context("tokenizer")

## C_tokenize(c("fsfsabcdsssfdsffsfsd", "abcdssfdsffsfsd"), "[sf]")
## stringi::stri_split_regex(c("fsfsabcdsssfdsffsfsd", "abcdssfdsffsfsd"), "[sf]")
## C_tokenize(c("aℵζsdfd γκkkdaβδbdc αbbf"), "[ α-ϵ]+")
## C_tokenize(c("aℵb aβb α β", "abcdssfdsffsfsd"), "[ b β]+")

test_that("NULL separator results in no segmentation", {
  corpus <- c("some sentence", "another sentence")
  v <- vocab(corpus, regex = NULL)
  expect_equal(v$term, corpus)
  expect_equal(v, vocab(corpus, regex = ""))
})

test_that("Tokenizer tokenizes with ASCII tokens", {
  expect_equal(C_tokenize(c("fsfsabcdsssfdsffsfsd", "abcdssfdsffsfsd"), "[sf]+"),
               list(c("abcd", "d", "d"), c("abcd", "d", "d")))
})

test_that("Tokenizer tokenizes with Unicode tokens", {
  expect_equal(C_tokenize(c("aℵb aβb α β", "abcdssfdsffsfsd"), "[b ]+"),
               list(c("aℵ", "aβ", "α", "β"), c("a", "cdssfdsffsfsd")))
})

test_that("utf8 string corpus is correctly tokenized with space", {
  corpus <- c("a α b β", "aℵb aβb α β", "ASCII string")
  v <- vocab(corpus, regex = " ")
  expect_equal(v$term, c("α", "β", "a",  "b", "aℵb", "aβb", "ASCII", "string"))
})

test_that("Tokenizer tokenizes with Unicode tokens and Unicode regex", {
  skip_on_os("windows")
  expect_equal(C_tokenize(c("aℵb aβb α β", "abcdssfdsffsfsd"), "[ b β]+"),
               list(c("aℵ", "a", "α"), c("a", "cdssfdsffsfsd")))
})

test_that("Tokenizer handles missing values with Unicode regex", {
  skip_on_os("windows")
  expect_equal(C_tokenize(c(NA, "aℵb aβb α β", NA, "abcdssfdsffsfsd"), "[ b β]+"),
               list(NA_character_, c("aℵ", "a", "α"), NA_character_,  c("a", "cdssfdsffsfsd")))
})

test_that("utf8 string corpus is correctly tokenized with Unicode regex", {
  skip_on_os("windows")
  corpus <- c("a α b β", "aℵb aβb α β", "ASCII string")
  v <- vocab(corpus, regex = "[ abβ]")
  expect_equal(v$term, c("α", "ℵ", "ASCII", "string"))
  ## str(vocab(c(NA, "aℵb aβb α β", NA, "abcdssfdsffsfsd"), regex = "[ b β]"))
})

test_that("Tokenizer tokenizes with Unicode range regexp", {
  skip_on_os("windows")
  expect_equal(C_tokenize(c("aℵζsdfd γκkkdaβδbdc γδϵbbf"), "[ α-ϵ]+"),
               list(c("aℵ", "sdfd", "kkda", "bdc", "bbf")))
})

vspinu/mlvocab documentation built on June 11, 2021, 7:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vspinu/mlvocab
Vocabulary and Corpus Preprocessing for Natural Language Pipelines

tests/testthat/test_tokenizer.R
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

R Package Documentation

Browse R Packages

We want your feedback!

vspinu/mlvocab Vocabulary and Corpus Preprocessing for Natural Language Pipelines

tests/testthat/test_tokenizer.R In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

R Package Documentation

Browse R Packages

We want your feedback!

vspinu/mlvocab
Vocabulary and Corpus Preprocessing for Natural Language Pipelines

tests/testthat/test_tokenizer.R
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines