parse.corpus: Perform pre-processing (tokenization, n-gram extracting,...
In computationalstylistics/stylo: Stylometric Multivariate Analyses

parse.corpus

R Documentation

Perform pre-processing (tokenization, n-gram extracting, etc.)

Description

A high-level function that controls a number of other functions responsible for dealing with a raw corpus stored as list, including deleting markup, sampling from texts, converting samples to n-grams, etc. It is build on top of a number of functions and thus it requires a large number of arguments. The only obligatory argument, however, is an R object containing a raw corpus: it is either an object of the class sylo.corpus, or a list of vectors, their elements being particular texts.

Usage

parse.corpus(input.data, markup.type = "plain",
                      corpus.lang = "English", splitting.rule = NULL,
                      sample.size = 10000, sampling = "no.sampling",
                      sample.overlap = 0, number.of.samples = 1,
                      sampling.with.replacement = FALSE, features = "w", 
                      ngram.size = 1, preserve.case = FALSE,
                      encoding = "UTF-8")

Arguments

`input.data`	a list (preferably of the class `stylo.corpus`) containing a raw corpus, i.e. a vector of texts.
`markup.type`	choose one of the following values: `plain` (nothing will happen), `html` (all tags will be deleted as well as HTML header), `xml` (TEI header, any text between <note> </note> tags, and all the tags will be deleted), `xml.drama` (as above; additionally, speaker's names will be deleted, or strings within the <speaker> </speaker> tags), `xml.notitles` (as above; but, additionally, all the chapter/section (sub)titles will be deleted, or strings within each the <head> </head> tags); see `delete.markup` for further details.
`corpus.lang`	an optional argument indicating the language of the texts analyzed; the values that will affect the function's behavior are: `English.contr`, `English.all`, `Latin.corr` (type `help(txt.to.words.ext)` for explanation). The default value is `English`.
`splitting.rule`	if you are not satisfied with the default language settings (or your input string of characters is not a regular text, but a sequence of, say, dance movements represented using symbolic signs), you can indicate your custom splitting regular expression here. This option will overwrite the above language settings. For further details, refer to `help(txt.to.words)`.
`sample.size`	desired size of samples, expressed in number of words; default value is 10,000.
`sampling`	one of three values: `no.sampling` (default), `normal.sampling`, `random.sampling`. See `make.samples` for explanation.
`sample.overlap`	if this opion is used, a reference text is segmented into consecutive, equal-sized samples that are allowed to partially overlap. If one specifies the `sample.size` parameter of 5,000 and the `sample.overlap` of 1,000, for example, the first sample of a text contains words 1–5,000, the second 4001–9,000, the third sample 8001–13,000, and so forth.
`number.of.samples`	optional argument which will be used only if `random.sampling` was chosen; it is self-evident.
`sampling.with.replacement`	optional argument which will be used only if `random.sampling` was chosen; it specifies the method used to randomly harvest words from texts.
`features`	an option for specifying the desired type of features: `w` for words, `c` for characters (default: `w`). See `txt.to.features` for further details.
`ngram.size`	an optional argument (integer) specifying the value of n, or the size of n-grams to be produced. If this argument is missing, the default value of 1 is used. See `txt.to.features` for further details.
`preserve.case`	whether ot not to lowercase all characters in the corpus (default = F).
`encoding`	useful if you use Windows and non-ASCII alphabets: French, Polish, Hebrew, etc. In such a situation, it is quite convenient to convert your text files into Unicode and to set this option to `encoding = "UTF-8"`. In Linux and Mac, you are always expected to use Unicode, thus you don't need to set anything.

Value

The function returns an object of the class stylo.corpus. It is a list containing as elements the samples (entire texts or sampled subsets) split into words/characters and combined into n-grams (if applicable).

Author(s)

Maciej Eder

Examples

## Not run: 
data(novels)
# depending on the size of the corpus, it might take a while:
parse.corpus(novels)

## End(Not run)

computationalstylistics/stylo documentation built on Jan. 4, 2025, 1:56 p.m.

computationalstylistics/stylo index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

computationalstylistics/stylo
Stylometric Multivariate Analyses

parse.corpus: Perform pre-processing (tokenization, n-gram extracting,...
In computationalstylistics/stylo: Stylometric Multivariate Analyses

Perform pre-processing (tokenization, n-gram extracting, etc.)

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Related to parse.corpus in computationalstylistics/stylo...

R Package Documentation

Browse R Packages

We want your feedback!

computationalstylistics/stylo Stylometric Multivariate Analyses

parse.corpus: Perform pre-processing (tokenization, n-gram extracting,... In computationalstylistics/stylo: Stylometric Multivariate Analyses

Perform pre-processing (tokenization, n-gram extracting, etc.)

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Related to parse.corpus in computationalstylistics/stylo...

R Package Documentation

Browse R Packages

We want your feedback!

computationalstylistics/stylo
Stylometric Multivariate Analyses

parse.corpus: Perform pre-processing (tokenization, n-gram extracting,...
In computationalstylistics/stylo: Stylometric Multivariate Analyses