as.sento_corpus: Convert a quanteda or tm corpus object into a sento_corpus...

Description Usage Arguments Value Author(s) See Also Examples

Description

Converts most common quanteda and tm corpus objects into a sento_corpus object. Appropriate available metadata is integrated as features; for a quanteda corpus, this can come from docvars(x), for a tm corpus, only meta(x, type = "indexed") metadata is considered.

Usage

1
as.sento_corpus(x, dates = NULL, do.clean = FALSE)

Arguments

x

a quanteda corpus object, a tm SimpleCorpus or a tm VCorpus object. For tm corpora, every corpus element should consist of a single "content" character vector as the document unit.

dates

an optional sequence of dates as "yyyy-mm-dd", of the same length as the number of documents in the input corpus, to define the "date" column. If dates = NULL, the "date" metadata element in the input corpus, if available, will be used but should be in the same "yyyy-mm-dd" format.

do.clean

see sento_corpus.

Value

A sento_corpus object, as returned by the sento_corpus function.

Author(s)

Samuel Borms

See Also

corpus, SimpleCorpus, VCorpus, sento_corpus

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
data("usnews", package = "sentometrics")
txt <- system.file("texts", "txt", package = "tm")
reuters <- system.file("texts", "crude", package = "tm")

# reshuffle usnews data.frame for use in quanteda and tm
dates <- usnews$date
usnews$wrong <- "notNumeric"
colnames(usnews)[c(1, 3)] <- c("doc_id", "text")

# conversion from a quanteda corpus
qcorp <- quanteda::corpus(usnews,
                          text_field = "text", docid_field = "doc_id")
corp1 <- as.sento_corpus(qcorp)
corp2 <- as.sento_corpus(qcorp, sample(dates)) # overwrites "date" column

# conversion from a tm SimpleCorpus corpus (DataframeSource)
tmSCdf <- tm::SimpleCorpus(tm::DataframeSource(usnews))
corp3 <- as.sento_corpus(tmSCdf)

# conversion from a tm SimpleCorpus corpus (DirSource)
tmSCdir <- tm::SimpleCorpus(tm::DirSource(txt))
corp4 <- as.sento_corpus(tmSCdir, dates[1:length(tmSCdir)])

# conversion from a tm VCorpus corpus (DataframeSource)
tmVCdf <- tm::VCorpus(tm::DataframeSource(usnews))
corp5 <- as.sento_corpus(tmVCdf)

# conversion from a tm VCorpus corpus (DirSource)
tmVCdir <- tm::VCorpus(tm::DirSource(reuters),
                       list(reader = tm::readReut21578XMLasPlain))
corp6 <- as.sento_corpus(tmVCdir, dates[1:length(tmVCdir)])

sborms/sentometrics documentation built on Aug. 21, 2021, 6:40 a.m.