corpus | R Documentation |
Creates a corpus object from available sources. The currently available sources are:
a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
a data.frame (or a tibble tbl_df
), whose default
document id is a variable identified by docid_field
; the text of the
document is a variable identified by text_field
; and other variables
are imported as document-level meta-data. This matches the format of
data.frames constructed by the the readtext package.
a kwic object constructed by kwic()
.
a tm VCorpus or SimpleCorpus class object, with the fixed metadata fields imported as docvars and corpus-level metadata imported as meta information.
a corpus object.
corpus(x, ...)
## S3 method for class 'corpus'
corpus(
x,
docnames = quanteda::docnames(x),
docvars = quanteda::docvars(x),
meta = quanteda::meta(x),
...
)
## S3 method for class 'character'
corpus(
x,
docnames = NULL,
docvars = NULL,
meta = list(),
unique_docnames = TRUE,
...
)
## S3 method for class 'data.frame'
corpus(
x,
docid_field = "doc_id",
text_field = "text",
meta = list(),
unique_docnames = TRUE,
...
)
## S3 method for class 'kwic'
corpus(
x,
split_context = TRUE,
extract_keyword = TRUE,
meta = list(),
concatenator = " ",
...
)
## S3 method for class 'Corpus'
corpus(x, ...)
x |
a valid corpus source object |
... |
not used directly |
docnames |
Names to be assigned to the texts. Defaults to the names of
the character vector (if any); |
docvars |
a data.frame of document-level variables associated with each text |
meta |
a named list that will be added to the corpus as corpus-level,
user meta-data. This can later be accessed or updated using
|
unique_docnames |
logical; if |
docid_field |
optional column index of a document identifier; defaults
to "doc_id", but if this is not found, then will use the rownames of the
data.frame; if the rownames are not set, it will use the default sequence
based on |
text_field |
the character name or numeric index of the source
|
split_context |
logical; if |
extract_keyword |
logical; if |
concatenator |
character between tokens, default is the whitespace. |
The texts and document variables of corpus objects can also be
accessed using index notation and the $
operator for accessing or assigning
docvars. For details, see [.corpus()
.
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.
For quanteda >= 2.0, this is a specially classed character vector. It has many additional attributes but you should not access these attributes directly, especially if you are another package author. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change. Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.
corpus, docvars()
,
meta()
, as.character.corpus()
, ndoc()
,
docnames()
# create a corpus from texts
corpus(data_char_ukimmig2010)
# create a corpus from texts and assign meta-data and document variables
summary(corpus(data_char_ukimmig2010,
docvars = data.frame(party = names(data_char_ukimmig2010))), 5)
# import a tm VCorpus
if (requireNamespace("tm", quietly = TRUE)) {
data(crude, package = "tm") # load in a tm example VCorpus
vcorp <- corpus(crude)
summary(vcorp)
data(acq, package = "tm")
summary(corpus(acq), 5)
vcorp2 <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010))
corp <- corpus(vcorp2)
summary(corp)
}
# construct a corpus from a data.frame
dat <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)),
some_ints = 1L:6L,
some_text = paste0("This is text number ", 1:6, "."),
stringsAsFactors = FALSE,
row.names = paste0("fromDf_", 1:6))
dat
summary(corpus(dat, text_field = "some_text",
meta = list(source = "From a data.frame called mydf.")))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.