Description Usage Arguments Value Examples
Removes sentences from a corpus or a character vector shorter than a specified length.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
x |
corpus or character object whose sentences will be selected. |
what |
units of trimming, |
min_ntoken, max_ntoken |
minimum and maximum lengths in word tokens (excluding punctuation) |
exclude_pattern |
a stringi regular expression whose match (at the sentence level) will be used to exclude sentences |
a corpus or character vector equal in length to the input. If
the input was a corpus, then the all docvars and metadata are preserved.
For documents whose sentences have been removed entirely, a null string
(""
) will be returned.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | txt <- c("PAGE 1. This is a single sentence. Short sentence. Three word sentence.",
"PAGE 2. Very short! Shorter.",
"Very long sentence, with multiple parts, separated by commas. PAGE 3.")
corp <- corpus(txt, docvars = data.frame(serial = 1:3))
texts(corp)
# exclude sentences shorter than 3 tokens
texts(corpus_trim(corp, min_ntoken = 3))
# exclude sentences that start with "PAGE <digit(s)>"
texts(corpus_trim(corp, exclude_pattern = "^PAGE \\d+"))
# trimming character objects
char_trim(txt, "sentences", min_ntoken = 3)
char_trim(txt, "sentences", exclude_pattern = "sentence\\.")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.