text_filter: Text Filters
In corpus: Text Corpus Analysis

Description Usage Arguments Details Value See Also Examples

Get or specify the process by which text gets transformed into a sequence of tokens or sentences.

text_filter(x = NULL, ...)
text_filter(x) <- value

## S3 method for class 'corpus_text'
text_filter(x = NULL, ...)

## S3 method for class 'data.frame'
text_filter(x = NULL, ...)

## Default S3 method:
text_filter(x = NULL, ...,
            map_case = TRUE, map_quote = TRUE,
            remove_ignorable = TRUE,
            combine = NULL,
            stemmer = NULL, stem_dropped = FALSE,
            stem_except = NULL,
            drop_letter = FALSE, drop_number = FALSE,
            drop_punct = FALSE, drop_symbol = FALSE,
            drop = NULL, drop_except = NULL,
            connector = "_",
            sent_crlf = FALSE,
            sent_suppress = corpus::abbreviations_en)

`x`	text or corpus object.
`value`	text filter object, or `NULL` for the default.
`...`	further arguments passed to or from other methods.
`map_case`	a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
`map_quote`	a logical value indicating whether to replace curly single quotes and other Unicode apostrophe characters with ASCII apostrophe (U+0027).
`remove_ignorable`	a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
`combine`	a character vector of multi-word phrases to combine, or `NULL`; see ‘Combining words’.
`stemmer`	a character value giving the name of a Snowball stemming algorithm (see `stem_snowball` for choices), a custom stemming function, or `NULL` to leave words unchanged.
`stem_dropped`	a logical value indicating whether to stem words in the `"drop"` list.
`stem_except`	a character vector of exception words to exempt from stemming, or `NULL`. If left unspecified, `stem_except` is set equal to the `drop` argument.
`drop_letter`	a logical value indicating whether to replace `"letter"` tokens (cased letters, kana, ideographic, letter-like numeric characters and other letters) with `NA`.
`drop_number`	a logical value indicating whether to replace `"number"` tokens (decimal digits, words appearing to be numbers, and other numeric characters) with `NA`.
`drop_punct`	a logical value indicating whether to replace `"punct"` tokens (punctuation) with `NA`.
`drop_symbol`	a logical value indicating whether to replace `"symbol"` tokens (emoji, math, currency, URLs, and other symbols) with `NA`.
`drop`	a character vector of types to replace with `NA`, or `NULL`.
`drop_except`	a character of types to exempt from the drop rules specified by the `drop_letter`, `drop_number`, `drop_punct`, `drop_symbol`, and `drop` arguments, or `NULL`.
`connector`	a character to use as a connector in lieu of white space for types that stem to multi-word phrases.
`sent_crlf`	a logical value indicating whether to break sentences on carriage returns or line feeds.
`sent_suppress`	a character vector of sentence break suppressions.

The set of properties in a text filter determine the tokenization and sentence breaking rules. See the documentation for text_tokens and text_split for details on the tokenization process.

text_filter retrieves an objects text filter, optionally with modifications to some of its properties.

text_filter<- sets an object's text filter. Setting the text filter on a character object is not allowed; the object must have type "corpus_text" or be a data frame with a "text" column of type "corpus_text".

as_corpus_text, text_tokens, text_split, abbreviations, stopwords.

# text filter with default options set
text_filter()

# specify some options but leave others unchanged
f <- text_filter(map_case = FALSE, drop = stopwords_en)

# set the text filter property
x <- as_corpus_text(c("Marnie the Dog is #1 on the internet."))
text_filter(x) <- f
text_tokens(x) # by default, uses x's text_filter to tokenize

# change a filter property
f2 <- text_filter(x, map_case = TRUE)
# equivalent to:
# f2 <- text_filter(x)
# f2$map_case <- TRUE

text_tokens(x, f2) # override text_filter(x)

# setting text_filter on a data frame is allowed if it has a
# column names "text" of type "corpus_text"
d <- data.frame(text = x)
text_filter(d) <- f2
text_tokens(d)

# but you can't set text filters on character objects
y <- "hello world"
## Not run: text_filter(y) <- f2 # gives an error

d2 <- data.frame(text = "hello world", stringsAsFactors = FALSE)
## Not run: text_filter(d2) <- f2 # gives an error

Text filter with the following options:

	map_case: TRUE
	map_quote: TRUE
	remove_ignorable: TRUE
	stemmer: NULL
	stem_dropped: FALSE
	stem_except: NULL
	combine:  chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." "AB." ...
	drop_letter: FALSE
	drop_number: FALSE
	drop_punct: FALSE
	drop_symbol: FALSE
	drop: NULL
	drop_except: NULL
	sent_crlf: FALSE
	sent_suppress:  chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." ...
[[1]]
 [1] "Marnie"   NA         "Dog"      NA         "#"        "1"       
 [7] NA         NA         "internet" "."       

[[1]]
 [1] "marnie"   NA         "dog"      NA         "#"        "1"       
 [7] NA         NA         "internet" "."       

[[1]]
 [1] "marnie"   NA         "dog"      NA         "#"        "1"       
 [7] NA         NA         "internet" "."