polmineR-package | R Documentation |
A library for corpus analysis using the Corpus Workbench (CWB) as an efficient back end for indexing and querying large corpora.
polmineR()
The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co- occurrence matrices etc.) can be created based on the indexed corpora.
A session registry directory (see registry()
) combines the registry
files for corpora that may reside in anywhere on the system. Upon loading
'polmineR', the files in the registry directory defined by the
environment variable CORPUS_REGISTRY are copied to the session registry
directory. To see whether the environment variable CORPUS_REGISTRY is set,
use the Sys.getenv()
-function. Corpora wrapped in R data packages can be
activated using the function use()
.
The package includes a draft shiny app that can be called using
polmineR()
.
polmineR.p_attribute: The default positional attribute.
polmineR.left: Default value for left context.
polmineR.lineview: A logical
value to activate lineview output of
kwic()
.
polmineR.pagelength: Maximum number of lines to show when preparing output
using DT::datatable()
(defaults to 10L).
polmineR.meta: Default metadata (s-attributes) to show.
polmineR.mc: Whether to use multiple cores.
polmineR.cores: Number of cores to use. Passed as argument cl
into
mclapply()
.
polmineR.browse: Whether to show output in browser.
polmineR.buttons: A logical
value, whether to display buttons when
preparing htmlwidget using DT::datatable()
.
polmineR.specialChars:
polmineR.cutoff: Maximum number of characters to display when preparing html output.
polmineR.mdsub: A list of pairs of character vectors defining regular expression substitutions applied as part of preprocessing documents for html display. Intended usage: Remove characters that would be misinterpreted as markdown formatting instructions.
polmineR.corpus_registry: The system corpus registry directory defined by the environment variable CORPUS_REGISTRY before the polmineR package has been loaded. The polmineR package uses a temporary registry directory to be able to use corpora stored at multiple locations in one session. The path to the system corpus registry directory captures this setting to keep it available if necessary.
polmineR.shiny: A logical
value, whether polmineR is
used in the context of a shiny app. Used to control the apprearance of
progress bars depending on whether shiny app is running, or not.
polmineR.warn.size: When generating HTML table widgets (e.g.
when preparing kwic output to be displayed in RStudio's Viewe pane), the
function DT::datatable()
that is used internally will issue a
warning by default if the object size of the table is greater than 1500000.
The warning adresses a client-server scenario that is not applicable in the
context of a local RStudio session, so you may want to turn it of.
Internally, the warning can be suppressed by setting the option
"DT.warn.size" to FALSE
. The polmineR option
"polmineR.warn.size" is processed by functions calling DT::datatable()
to set and reset the value of "DT.warn.size". Please note: The
formulation of the warning does not match the scenario of a local RStudio
session, but it may still be useful to get a warning when tables are large
and slow to process. Therefore, the default value of the setting is
FALSE
.
Andreas Blaette (andreas.blaette@uni-due.de)
Jockers, Matthew L. (2014): Text Analysis with R for Students of Literature. Cham et al: Springer.
Baker, Paul (2006): Using Corpora in Discourse Analysis. London: continuum.
# The REUTERS corpus included in the RcppCWB package is used in examples
use(pkg = "RcppCWB", corpus = "REUTERS") # activate REUTERS corpus
r <- corpus("REUTERS")
if (interactive()) show_info(r)
# The package includes GERMAPARLMINI as sample data
use("polmineR") # activate GERMAPARLMINI
gparl <- corpus("GERMAPARLMINI")
if (interactive()) show_info(gparl)
# Core methods
count("REUTERS", query = "oil")
count("REUTERS", query = c("oil", "barrel"))
count("REUTERS", query = '"Saudi" "Arab.*"', breakdown = TRUE, cqp = TRUE)
dispersion("REUTERS", query = "oil", s_attribute = "id")
k <- kwic("REUTERS", query = "oil")
coocs <- cooccurrences("REUTERS", query = "oil")
# Core methods applied to partition
kuwait <- partition("REUTERS", places = "kuwait", regex = TRUE)
C <- count(kuwait, query = "oil")
D <- dispersion(kuwait, query = "oil", s_attribute = "id")
K <- kwic(kuwait, query = "oil", meta = "id")
CO <- cooccurrences(kuwait, query = "oil")
# Go back to full text
p <- partition("REUTERS", id = 127)
if (interactive()) read(p)
h <- html(p) %>%
highlight(highlight = list(yellow = "oil"))
if (interactive()) h_highlighted
# Generate term document matrix (not run by default to save time)
pb <- partition_bundle("REUTERS", s_attribute = "id")
cnt <- count(pb, p_attribute = "word")
tdm <- as.TermDocumentMatrix(cnt, col = "count")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.