txt.to.words.ext: Split text into words: extended version

Description Usage Arguments Details Author(s) See Also Examples

View source: R/txt.to.words.ext.R


Function for splitting a string of characters into single words, removing punctuation etc., and preserving some language-dependent idiosyncracies, such as common contractions in English.


txt.to.words.ext(input.text, corpus.lang = "English", splitting.rule = NULL, 
                 preserve.case = FALSE)



a string of characters, usually a text.


an optional argument specifying the language of the texts analyzed. Values that will affect the function's output are: English.contr, English.all, Latin.corr (their meaning is explained below), JCK for Japanese, Chinese and Korean, as well as other for a variety of non-Latin scripts, including Cyryllic, Greek, Arabic, Hebrew, Coptic, Georgian etc. The default value is English.


if you are not satisfied with the default language settings (or your input string of characters is not a regular text, but a sequence of, say, dance movements represented using symbolic signs), you can indicate your custom splitting regular expression here. This option will overwrite the above language settings. For further details, refer to help(txt.to.words).


Whether or not to lowercase all character in the corpus (default = FALSE).


Function for splitting a given input text into single words (chains of characters delimited with spaces or punctuation marks). It is build on top of the function txt.to.words and it is designed to manage some language-dependent text features during the tokenization process. In most languages, this is irrelevant. However, it might be important when with English or Latin texts: English.contr treats contractions as single, atomary words, i.e. strings such as "don't", "you've" etc. will not be split into two strings; English.all keeps the contractions (as above), and also prevents the function from splitting compound words (mother-in-law, double-decker, etc.). Latin.corr: since some editions do not distinguish the letters v/u, this setting provides a consistent conversion to "u" in the whole string. The option preserve.case lets you specify whether you wish to lowercase all characters in the corpus.


Maciej Eder, Mike Kestemont

See Also

txt.to.words, txt.to.features, make.ngrams


txt.to.words.ext("Nel mezzo del cammin di nostra vita / mi ritrovai per 
    una selva oscura, che la diritta via era smarrita.")

# to see the difference between particular options for English,
# consider the following sentence from Joseph Conrad's "Nostromo":
sample.text = "That's how your money-making is justified here."
txt.to.words.ext(sample.text, corpus.lang = "English")
txt.to.words.ext(sample.text, corpus.lang = "English.contr")
txt.to.words.ext(sample.text, corpus.lang = "English.all")

Example output

### stylo version: 0.7.3 ###

If you plan to cite this software (please do!), use the following reference:
    Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
    a package for computational text analysis. R Journal 8(1): 107-121.

To get full BibTeX entry, type: citation("stylo")
Warning message:
no DISPLAY variable so Tk is not available 
 [1] "nel"      "mezzo"    "del"      "cammin"   "di"       "nostra"  
 [7] "vita"     "mi"       "ritrovai" "per"      "una"      "selva"   
[13] "oscura"   "che"      "la"       "diritta"  "via"      "era"     
[19] "smarrita"
[1] "that"      "s"         "how"       "your"      "money"     "making"   
[7] "is"        "justified" "here"     
[1] "that^s"    "how"       "your"      "money"     "making"    "is"       
[7] "justified" "here"     
[1] "that^s"       "how"          "your"         "money-making" "is"          
[6] "justified"    "here"        

stylo documentation built on Dec. 6, 2020, 5:06 p.m.