txt.to.words.ext: Split text into words: extended version

View source: R/txt.to.words.ext.R

txt.to.words.extR Documentation

Split text into words: extended version

Description

Function for splitting a string of characters into single words, removing punctuation etc., and preserving some language-dependent idiosyncracies, such as common contractions in English.

Usage

txt.to.words.ext(input.text, corpus.lang = "English", splitting.rule = NULL, 
                 preserve.case = FALSE)

Arguments

input.text

a string of characters, usually a text.

corpus.lang

an optional argument specifying the language of the texts analyzed. Values that will affect the function's output are: English.contr, English.all, Latin.corr (their meaning is explained below), JCK for Japanese, Chinese and Korean, as well as other for a variety of non-Latin scripts, including Cyryllic, Greek, Arabic, Hebrew, Coptic, Georgian etc. The default value is English.

splitting.rule

if you are not satisfied with the default language settings (or your input string of characters is not a regular text, but a sequence of, say, dance movements represented using symbolic signs), you can indicate your custom splitting regular expression here. This option will overwrite the above language settings. For further details, refer to help(txt.to.words).

preserve.case

Whether or not to lowercase all character in the corpus (default = FALSE).

Details

Function for splitting a given input text into single words (chains of characters delimited with spaces or punctuation marks). It is build on top of the function txt.to.words and it is designed to manage some language-dependent text features during the tokenization process. In most languages, this is irrelevant. However, it might be important when with English or Latin texts: English.contr treats contractions as single, atomary words, i.e. strings such as "don't", "you've" etc. will not be split into two strings; English.all keeps the contractions (as above), and also prevents the function from splitting compound words (mother-in-law, double-decker, etc.). Latin.corr: since some editions do not distinguish the letters v/u, this setting provides a consistent conversion to "u" in the whole string. The option preserve.case lets you specify whether you wish to lowercase all characters in the corpus.

Author(s)

Maciej Eder, Mike Kestemont

See Also

txt.to.words, txt.to.features, make.ngrams

Examples

txt.to.words.ext("Nel mezzo del cammin di nostra vita / mi ritrovai per 
    una selva oscura, che la diritta via era smarrita.")

# to see the difference between particular options for English,
# consider the following sentence from Joseph Conrad's "Nostromo":
sample.text = "That's how your money-making is justified here."
txt.to.words.ext(sample.text, corpus.lang = "English")
txt.to.words.ext(sample.text, corpus.lang = "English.contr")
txt.to.words.ext(sample.text, corpus.lang = "English.all")

stylo documentation built on May 29, 2024, 1:37 a.m.