parse.corpus | R Documentation |
A high-level function that controls a number of other functions responsible
for dealing with a raw corpus stored as list, including deleting markup,
sampling from texts, converting samples to n-grams, etc. It is build on top
of a number of functions and thus it requires a large number of arguments.
The only obligatory argument, however, is an R object containing a raw corpus:
it is either an object of the class sylo.corpus
, or a list of vectors,
their elements being particular texts.
parse.corpus(input.data, markup.type = "plain",
corpus.lang = "English", splitting.rule = NULL,
sample.size = 10000, sampling = "no.sampling",
sample.overlap = 0, number.of.samples = 1,
sampling.with.replacement = FALSE, features = "w",
ngram.size = 1, preserve.case = FALSE,
encoding = "UTF-8")
input.data |
a list (preferably of the class |
markup.type |
choose one of the following values: |
corpus.lang |
an optional argument indicating the language of the texts
analyzed; the values that will affect the function's behavior are:
|
splitting.rule |
if you are not satisfied with the default language
settings (or your input string of characters is not a regular text,
but a sequence of, say, dance movements represented using symbolic signs),
you can indicate your custom splitting regular expression here. This
option will overwrite the above language settings. For further details,
refer to |
sample.size |
desired size of samples, expressed in number of words; default value is 10,000. |
sampling |
one of three values: |
sample.overlap |
if this opion is used, a reference text is segmented
into consecutive, equal-sized samples that are allowed to partially
overlap. If one specifies the |
number.of.samples |
optional argument which will be used only if
|
sampling.with.replacement |
optional argument which will be used only
if |
features |
an option for specifying the desired type of features:
|
ngram.size |
an optional argument (integer) specifying the value of n,
or the size of n-grams to be produced. If this argument is missing,
the default value of 1 is used. See |
preserve.case |
whether ot not to lowercase all characters in the corpus (default = F). |
encoding |
useful if you use Windows and non-ASCII alphabets: French,
Polish, Hebrew, etc. In such a situation, it is quite convenient to
convert your text files into Unicode and to set this option to
|
The function returns an object of the class stylo.corpus
. It is a list
containing as elements the samples (entire texts or sampled subsets) split into
words/characters and combined into n-grams (if applicable).
Maciej Eder
load.corpus.and.parse
, delete.markup
,
txt.to.words
, txt.to.words.ext
,
txt.to.features
, make.samples
## Not run:
data(novels)
# depending on the size of the corpus, it might take a while:
parse.corpus(novels)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.