Description Usage Arguments Details Value Author(s) Examples
Set up an object of the partition
class. Frequency lists are computeted and kept
in the stat-slot if pAttribute is not NULL.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | partition(.Object, ...)
## S4 method for signature 'character'
partition(.Object, def = NULL, name = "",
encoding = NULL, pAttribute = NULL, meta = NULL, regex = FALSE,
xml = "flat", id2str = TRUE, type = NULL, mc = FALSE,
verbose = TRUE, ...)
## S4 method for signature 'list'
partition(.Object, ...)
## S4 method for signature 'environment'
partition(.Object, slots = c("name", "corpus", "size",
"pAttribute"))
## S4 method for signature 'partition'
partition(.Object, def = NULL, name = "",
regex = FALSE, pAttribute = NULL, id2str = TRUE, type = NULL,
verbose = TRUE, mc = FALSE, ...)
|
.Object |
character-vector - the CWB-corpus to be used |
... |
parameters passed into the partition-method |
def |
list consisting of a set of character vectors (see details and examples) |
name |
name of the new partition, defaults to " |
encoding |
encoding of the corpus (typically "LATIN1 or "(UTF-8)), if NULL, the encoding provided in the registry file of the corpus (charset="...") will be used b |
pAttribute |
the pAttribute(s) for which term frequencies shall be retrieved |
meta |
a character vector |
regex |
logical (defaults to FALSE), if TRUE, the s-attributes provided will be handeled as regular expressions; the length of the character vectors with s-attributes then needs to be 1 |
xml |
either 'flat' (default) or 'nested' |
id2str |
whether to turn token ids to strings (set FALSE to minimize object.size / memory consumption) |
type |
character vector (length 1) specifying the type of corpus / partition (e.g. "plpr") |
mc |
whether to use multicore (for counting terms) |
verbose |
logical, defaults to TRUE |
slots |
character vector |
The function sets up a partition based on a list of s-attributes with respective values.
The s-attributes defining the partition are a list, e.g. list(text_type="speech", text_year="2013").
The values of the list may contain regular expressions. To use regular expression syntax, set the
parameter regex to "TRUE"
. Regular expressions are passed into grep, i.e. the regex syntax
used in R needs to be used (double backlashes etc.).
The XML imported into the CWB may be "flat" or "nested". This needs to be indicated with the
parameter xml
(default is "flat"). If you generate a partition based on a
flat XML structure, some performance gain may be achieved when ordering the sAttributes
with decreasingly restrictive conditions. If you have a nested XML, it is mandatory that the
order of the sAttributes provided reflects the hierarchy of the XML: The top-level elements
need to be positioned at the beginning of the list with the s-attributes, the the most restrictive
elements at the end.
If pAttribute is not NULL, a count of tokens in the corpus will be performed and kept in the
stat
-slot of the partition-object. The length of the pAttribute character vector may be 1
or more. If two or more p-attributes are provided, The occurrence of combinations will be counted.
A typical scenario is to combine the p-attributes "word" or "lemma" and "pos".
An object of the S4 class 'partition'
Andreas Blaette
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | ## Not run:
use(polmineR.sampleCorpus)
spd <- partition(
"PLPRBTTXT", text_party="SPD", text_type="speech"
)
kauder <- partition(
"PLPRBTTXT", text_name="Volker Kauder", pAttribute="word"
)
merkel <- partition(
"PLPRBTTXT", text_name=".*Merkel",
pAttribute="word", regex=TRUE
)
sAttributes(merkel, "text_date")
sAttributes(merkel, "text_name")
merkel <- partition(
"PLPRBTTXT", text_name="Angela Dorothea Merkel",
text_date="2009-11-10", text_type="speech", pAttribute="word"
)
merkel <- subset(merkel, !word %in% punctuation)
merkel <- subset(merkel, !word %in% tm::stopwords("de"))
# a certain defined time segment
if (require("chron")){
firstDay <- "2009-10-28"
lastDay <- "2009-11-11"
days <- strftime(
chron::seq.dates(
from = strftime(firstDay, format="%m/%d/%Y"),
to = strftime(lastDay, format="%m/%d/%Y"),
by="days"),
format="%Y-%m-%d"
)
period <- partition("PLPRBTTXT", text_date=days)
}
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.