texts: Get or assign corpus texts

Description Usage Arguments Details Value Note Examples

View source: R/corpus-methods-quanteda.R

Description

Get or replace the texts in a corpus, with grouping options. Works for plain character vectors too, if groups is a factor.

Usage

1
2
3
4
5
6
texts(x, groups = NULL, spacer = " ")

texts(x) <- value

## S3 method for class 'corpus'
as.character(x, ...)

Arguments

x

a corpus or character object

groups

either: a character vector containing the names of document variables to be used for grouping; or a factor or object that can be coerced into a factor equal in length or rows to the number of documents. NA values of the grouping value are dropped. See groups for details.

spacer

when concatenating texts by using groups, this will be the spacing added between texts. (Default is two spaces.)

value

character vector of the new texts

...

unused

Details

as.character(x) where x is a corpus is equivalent to calling texts(x)

Value

For texts, a character vector of the texts in the corpus.

For texts <-, the corpus with the updated texts.

for texts <-, a corpus with the texts replaced by value

as.character(x) is equivalent to texts(x)

Note

The groups will be used for concatenating the texts based on shared values of groups, without any specified order of aggregation.

You are strongly encouraged as a good practice of text analysis workflow not to modify the substance of the texts in a corpus. Rather, this sort of processing is better performed through downstream operations. For instance, do not lowercase the texts in a corpus, or you will never be able to recover the original case. Rather, apply tokens_tolower() after applying tokens() to a corpus, or use the option tolower = TRUE in dfm().

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
nchar(texts(corpus_subset(data_corpus_inaugural, Year < 1806)))

# grouping on a document variable
nchar(texts(corpus_subset(data_corpus_inaugural, Year < 1806), groups = "President"))

# grouping a character vector using a factor
nchar(texts(data_corpus_inaugural[1:5],
      groups = "President"))
nchar(texts(data_corpus_inaugural[1:5],
      groups = factor(c("W", "W", "A", "J", "J"))))

corp <- corpus(c("We must prioritise honour in our neighbourhood.",
                 "Aluminium is a valourous metal."))
texts(corp) <-
    stringi::stri_replace_all_regex(texts(corp),
                                   c("ise", "([nlb])our", "nium"),
                                   c("ize", "$1or", "num"),
                                   vectorize_all = FALSE)
texts(corp)
texts(corp)[2] <- "New text number 2."
texts(corp)

koheiw/quanteda.core documentation built on Sept. 21, 2020, 3:44 p.m.