Description Usage Arguments Details Value See Also Examples
Get or specify the process by which text gets transformed into a sequence of tokens or sentences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | text_filter(x = NULL, ...)
text_filter(x) <- value
## S3 method for class 'corpus_text'
text_filter(x = NULL, ...)
## S3 method for class 'data.frame'
text_filter(x = NULL, ...)
## Default S3 method:
text_filter(x = NULL, ...,
map_case = TRUE, map_quote = TRUE,
remove_ignorable = TRUE,
combine = NULL,
stemmer = NULL, stem_dropped = FALSE,
stem_except = NULL,
drop_letter = FALSE, drop_number = FALSE,
drop_punct = FALSE, drop_symbol = FALSE,
drop = NULL, drop_except = NULL,
connector = "_",
sent_crlf = FALSE,
sent_suppress = corpus::abbreviations_en)
|
x |
text or corpus object. |
value |
text filter object, or |
... |
further arguments passed to or from other methods. |
map_case |
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents. |
map_quote |
a logical value indicating whether to replace curly single quotes and other Unicode apostrophe characters with ASCII apostrophe (U+0027). |
remove_ignorable |
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens. |
combine |
a character vector of multi-word phrases to combine, or
|
stemmer |
a character value giving the name of a Snowball stemming
algorithm (see |
stem_dropped |
a logical value indicating whether to stem words
in the |
stem_except |
a character vector of exception words to exempt from
stemming, or |
drop_letter |
a logical value indicating whether to replace
|
drop_number |
a logical value indicating whether to replace
|
drop_punct |
a logical value indicating whether to replace
|
drop_symbol |
a logical value indicating whether to replace
|
drop |
a character vector of types to replace with |
drop_except |
a character of types to exempt from the drop
rules specified by the |
connector |
a character to use as a connector in lieu of white space for types that stem to multi-word phrases. |
sent_crlf |
a logical value indicating whether to break sentences on carriage returns or line feeds. |
sent_suppress |
a character vector of sentence break suppressions. |
The set of properties in a text filter determine the tokenization
and sentence breaking rules. See the documentation for
text_tokens
and text_split
for details
on the tokenization process.
text_filter
retrieves an objects text filter, optionally
with modifications to some of its properties.
text_filter<-
sets an object's text filter. Setting the
text filter on a character object is not allowed; the object must
have type "corpus_text"
or be a data frame with a "text"
column of type "corpus_text"
.
as_corpus_text
, text_tokens
,
text_split
, abbreviations
,
stopwords
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | # text filter with default options set
text_filter()
# specify some options but leave others unchanged
f <- text_filter(map_case = FALSE, drop = stopwords_en)
# set the text filter property
x <- as_corpus_text(c("Marnie the Dog is #1 on the internet."))
text_filter(x) <- f
text_tokens(x) # by default, uses x's text_filter to tokenize
# change a filter property
f2 <- text_filter(x, map_case = TRUE)
# equivalent to:
# f2 <- text_filter(x)
# f2$map_case <- TRUE
text_tokens(x, f2) # override text_filter(x)
# setting text_filter on a data frame is allowed if it has a
# column names "text" of type "corpus_text"
d <- data.frame(text = x)
text_filter(d) <- f2
text_tokens(d)
# but you can't set text filters on character objects
y <- "hello world"
## Not run: text_filter(y) <- f2 # gives an error
d2 <- data.frame(text = "hello world", stringsAsFactors = FALSE)
## Not run: text_filter(d2) <- f2 # gives an error
|
Text filter with the following options:
map_case: TRUE
map_quote: TRUE
remove_ignorable: TRUE
stemmer: NULL
stem_dropped: FALSE
stem_except: NULL
combine: chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." "AB." ...
drop_letter: FALSE
drop_number: FALSE
drop_punct: FALSE
drop_symbol: FALSE
drop: NULL
drop_except: NULL
sent_crlf: FALSE
sent_suppress: chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." ...
[[1]]
[1] "Marnie" NA "Dog" NA "#" "1"
[7] NA NA "internet" "."
[[1]]
[1] "marnie" NA "dog" NA "#" "1"
[7] NA NA "internet" "."
[[1]]
[1] "marnie" NA "dog" NA "#" "1"
[7] NA NA "internet" "."
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.