header_con <- file("vignette_header.html") writeLines('<meta name="flattr:id" content="4zdzgd" />', header_con) close(header_con)
library(koRpus) # manually add tag definition, the koRpus.lang.en package might be missing koRpus::set.lang.support("kRp.POS.tags", ## tag and class definitions # en -- english list("en"=list( tag.class.def.words=matrix(c( "CC", "conjunction", "Coordinating conjunction", "CD", "number", "Cardinal number", "DT", "determiner", "Determiner", "EX", "existential", "Existential there", "FW", "foreign", "Foreign word", "IN", "preposition", "Preposition or subordinating conjunction", "IN/that", "preposition", "Preposition or subordinating conjunction", "JJ", "adjective", "Adjective", "JJR", "adjective", "Adjective, comparative", "JJS", "adjective", "Adjective, superlative", "LS", "listmarker", "List item marker", "MD", "modal", "Modal", "NN", "noun", "Noun, singular or mass", "NNS", "noun", "Noun, plural", "NP", "name", "Proper noun, singular", "NPS", "name", "Proper noun, plural", "NS", "noun", "Noun, plural", # undocumented, bug in parameter file? "PDT", "predeterminer", "Predeterminer", "POS", "possesive", "Possessive ending", "PP", "pronoun", "Personal pronoun", "PP$", "pronoun", "Possessive pronoun", "RB", "adverb", "Adverb", "RBR", "adverb", "Adverb, comparative", "RBS", "adverb", "Adverb, superlative", "RP", "particle", " Particle", "SYM", "symbol", "Symbol", "TO", "to", "to", "UH", "interjection", "Interjection", "VB", "verb", "Verb, base form of \"to be\"", "VBD", "verb", "Verb, past tense of \"to be\"", "VBG", "verb", "Verb, gerund or present participle of \"to be\"", "VBN", "verb", "Verb, past participle of \"to be\"", "VBP", "verb", "Verb, non-3rd person singular present of \"to be\"", "VBZ", "verb", "Verb, 3rd person singular present of \"to be\"", "VH", "verb", "Verb, base form of \"to have\"", "VHD", "verb", "Verb, past tense of \"to have\"", "VHG", "verb", "Verb, gerund or present participle of \"to have\"", "VHN", "verb", "Verb, past participle of \"to have\"", "VHP", "verb", "Verb, non-3rd person singular present of \"to have\"", "VHZ", "verb", "Verb, 3rd person singular present of \"to have\"", "VV", "verb", "Verb, base form", "VVD", "verb", "Verb, past tense", "VVG", "verb", "Verb, gerund or present participle", "VVN", "verb", "Verb, past participle", "VVP", "verb", "Verb, non-3rd person singular present", "VVZ", "verb", "Verb, 3rd person singular present", "WDT", "determiner", "Wh-determiner", "WP", "pronoun", "Wh-pronoun", "WP$", "pronoun", "Possessive wh-pronoun", "WRB", "adverb", "Wh-adverb" ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))), tag.class.def.punct=matrix(c( ",", "comma", "Comma", # not in guidelines "(", "punctuation", "Opening bracket", # not in guidelines ")", "punctuation", "Closing bracket", # not in guidelines ":", "punctuation", "Punctuation", # not in guidelines "``", "punctuation", "Quote", # not in guidelines "''", "punctuation", "End quote", # not in guidelines "#", "punctuation", "Punctuation", # not in guidelines "$", "punctuation", "Punctuation" # not in guidelines ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))), tag.class.def.sentc=matrix(c( "SENT", "fullstop", "Sentence ending punctuation" # not in guidelines ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))) ) ) ) # we'll also fool hyphen() into believing "en" is an available language, # while actually using a previously hyphenated object fake.hyph.en <- new( "kRp.hyph.pat", lang="en", pattern=matrix( c(".im5b", ".imb", "0050"), ncol=3, dimnames=list(c(), c("orig", "char", "nums")) ) ) set.hyph.support(list("en"=fake.hyph.en))
koRpus started in February 2011, primarily with the goal in mind to examine how similar different texts are. Since then,
it quickly grew into an R package which implements dozens of formulae for readability and lexical diversity, and wrappers for language corpus databases and
a tokenizer/POS tagger.
At the very beginning of almost every analysis with this package, the text you want to examine has to be sliced into its components, and the components
must be identified and named. That is, it has to be split into its semantic parts (tokens), words, numbers, punctuation marks. After that, each token will
be tagged regarding its part-of-speech (POS). For both of these steps,
koRpus can use the third party software TreeTagger [@schmid_TT_1994].
Especially for Windows users installation of TreeTagger might be a little more complex -- e.g., it depends on Perl^[For a free implementation try https://strawberryperl.com], and you need a tool to extract .tar.gz archives.^[Like https://7-zip.org] Detailed installations instructions are beyond the scope of this vignette.
If you don't want to use TreeTagger,
koRpus provides a simple tokenizer of its own called
tokenize(). While the tokenizing itself works quite well,
tokenize() is not as elaborate as is TreeTagger when it comes to POS tagging, as it can merely tell words from numbers, punctuation and abbreviations. Although this is sufficient for most readability formulae, you can't evaluate word classes in detail. If that's what you want, a TreeTagger installation is needed.
Some of the readability formulae depend on special word lists [like @bormuth_cloze_1968; @dale_formula_1948; @spache_new_1953]. For copyright reasons these lists are not included as of now. This means, as long as you don't have copies of these lists, you can't calculate these particular measures, but of course all others. The expected format to use a list with this package is a simple text file with one word per line, preferably in UTF-8 encoding.
The frequency analysis functions in this package can look up how often each word in a text is used in its language, given that a corpus database is provided. Databases in Celex format are support, as is the Leipzig Corpora Collection [@quasthoff_LCC_2006] file format. To use such a database with this package, you simply need to download one of the .zip/.tar files.
If you want to estimate the language of a text, reference texts in known languages are needed. In
koRpus, the Universal Declaration of Human Rights with its more than 350 translations is used.
From now on it is assumed that the above requirements are correctly installed and working. If an optional component is used it will be noted. Further, we'll need a sample text to analyze. We'll use the section on defense mechanisms of Phasmatodea from Wikipedia for this purpose.
In order to do some analysis, you need to load a language support package for each language you would like to work with. For instance, in this vignette we're analyzing an English sample text. Language support packages for
koRpus are named
** is a two-character ID for the respective language, like
en for English.^[Unfortunately, these language packages did not get the approval of the CRAN maintainers and are officially hosted at (https://undocumeantit.github.io/repos/l10n/)[https://undocumeantit.github.io/repos/l10n/]. For your convenience the function
install.koRpus.lang() can be used to easily install them anyway.]
# install the language support package install.koRpus.lang("en") # load the package library(koRpus.lang.en)
koRpus itself is loaded, it will list you all language packages found on your system. To get a list of all installable packages, call
As explained earlier, splitting the text up into its basic components can be done by TreeTagger. To achieve this and have the results available in R, the function
treetag() is used.
At the very least you must provide it with the text, of course, and name the language it is written in. In addition to that you must specify where you installed TreeTagger. If you look at the package documentation you'll see that
treetag() understands a number of options to configure TreeTagger, but in most cases using one of the built-in presets should suffice. TreeTagger comes with batch/shell scripts for installed languages, and the presets of
treetag() are basically just R implementations of these scripts.
tagged.text <- treetag( "sample_text.txt", treetagger="manual", lang="en", TT.options=list( path="~/bin/treetagger/", preset="en" ), doc_id="sample" )
tagged.text <- dget("sample_text_treetagged_dput.txt")
The first argument (file name) and
lang should explain themselves. The
treetagger option can either take the full path to one of the original TreeTagger scripts mentioned above, or the keyword "manual", which will cause the interpretation of what is defined by
TT.options. To use a preset, just put the
path to your local TreeTagger installation and a valid
preset name here.^[Presets are defined in the language support packages, usually named like their respective two-character language identifier. Refer to their documentation.] The document ID is optional and can be omitted.
The resulting S4 object is of a class called
kRp.text. If you call the object directly you get a shortened view of it's main content:
Once you've come this far, i.e., having a valid object of class
kRp.text, all following analyses should run smoothly.
treetag() should fail, you should first re-run it with the extra option
debug=TRUE. Most interestingly, that will print the contents of
sys.tt.call, which is the TreeTagger command given to your operating system for execution. With that it should be possible to examine where exactly the erroneous behavior starts.
If you don't need detailed word class analysis, you should be fine using
koRpus' own function
tokenize(). As you can see,
tokenize() comes to the same results regarding the tokens, but is rather limited in recognizing word classes:
(tokenized.text <- tokenize( "sample_text.txt", lang="en", doc_id="sample" ))
For this class of objects,
koRpus provides some comfortable methods to extract the portions you're interested in. For example, the main results are to be found in the slot
tokens. In addition to TreeTagger's original output (token, tag and lemma)
treetag() also automatically counts letters and assigns tokens to global word classes. To get these results as a data.frame, use the getter method
In case you want to access a subset of the data in the resulting object, e.g., only the column with number of letters or the first five rows of
tokens, you'll be happy to know there's special
[[ methods for these kinds of objects:
[[ methods are basically a useful shortcut replacements for
All results of both
tokenize() also provide various descriptive statistics calculated from the analyzed text. You can get them by calling
describe() on the object:
(txt_desc <- describe(tagged.text)) txt_desc_lttr <- txt_desc[["lttr.distrib"]]
Amongst others, you will find several indices describing the number of characters:
all.chars: Counts each character, including all space characters
all.chars, but clusters of space characters (incl. line breaks) are counted only as one character
chars.no.space: Counts all characters except any space characters
letters.only: Counts only letters, excluding(!) digits (which are counted seperately as
You'll also find the number of
sentences, as well as average word and sentence lengths, and tables describing how the word length is distributed throughout the text (
lttr.distrib). For instance, we see that the text has
r txt_desc_lttr["num",3] words with three letters,
r txt_desc_lttr["cum.sum",3] with three or less, and
r txt_desc_lttr["cum.inv",3] with more than three. The last three lines show the percentages, respectively.
To analyze the lexical diversity of our text we can now simply hand over the tagged text object to the
lex.div() method. You can call it on the object with no further arguments (like
lex.div(tagged.text)), but in this example we'll limit the analysis to a few measures:^[For informtaion on the measures shown see @tweedie_how_1998, @mccarthy_vocd_2007, @mccarthy_mtld_2010.]
lex.div( tagged.text, measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"), char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA") )
lex.div( tagged.text, measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"), char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA"), quiet=TRUE )
Let's look at some particular parts: At first we are informed of the total number of types, tokens and lemmas (if available). After that the actual results are being printed, using the package's
show() method for this particular kind of object. As you can see, it prints the actual value of each measure before a summary of the characteristics.^[Characteristics can be looked at to examine each measure's dependency on text length. They are calculated by computing each measure repeatedly, beginning with only the first token, then adding the next, progressing until the full text was analyzed.]
Some measures return more information than just their actual index value. For instance, when the Mean Segmental Type-Token Ratio is calculated, you'll be informed how much of your text was dropped and hence not examined. A small feature tool of
segment.optimizer(), automatically recommends you with a different segment size if this could decrease the number of lost tokens.
lex.div() calculates every measure of lexical diversity that was implemented. Of course this is fully configurable, e.g. to completely skip the calculation of characteristics just add the option
char=NULL. If you're only interested in one particular measure, it might be more convenient to call the according wrapper function instead of
lex.div(). For example, to calculate only the measures proposed by @maas_ueber_1972:
All wrapper functions have characteristics turned off by default. The following example demonstrates how to calculate and plot the classic type-token ratio with characteristics. The resulting plot shows the typical degredation of TTR values with increasing text length:
ttr.res <- TTR(tagged.text, char=TRUE) plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length")
ttr.res <- TTR(tagged.text, char=TRUE, quiet=TRUE) plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length")
Since this package is intended for research, it is possible to directly influence all relevant values of each measure and examine the effects. For example, as mentioned before
segment.optimizer() recommended a change of segment size for MSTTR to drop less words, which is easily done:
Please see to the documentation for more detailed information on the available measures and their references.
This package has rudimentary support to import corpus databases.^[The package also has a function called
read.corp.custom() which can be used to process language corpora yourself, and store the results in an object of class
kRp.corp.freq, which is the class returned by
read.corp.celex() as well. That is, if you can't get any already analyzed corpus database but have a huge language corpus at hand, you can create your own frequency database. But be warned that depending on corpus size and your hardware, this might take ages. On the other hand,
read.corp.custom() will provide inverse document frequency (idf) values for all types, which is necessary to compute tf-idf with
freq.analysis()] That is, it can read frequency data for words into an R object and use this object for further analysis. Next to the Celex database format (
read.corp.celex()), it can read the LCC flatfile format^[Actually, it unterstands two different LCC formats, both the older .zip and the newer .tar archive format.] (
read.corp.LCC()). The latter might be of special interest, because the needed database archives can be freely downloaded. Once you've downloaded one of these archives, it can be comfortably imported:
LCC.en <- read.corp.LCC("~/downloads/corpora/eng_news_2010_1M-text.tar")
read.corp.LCC() will automatically extract the files it needs from the archive. Alernatively, you can specify the path to the unpacked archive as well. To work with the imported data directly, the tool
query() was added to the package. It helps you to comfortably look up certain words, or ranges of interesting values:
query(LCC.en, "word", "what")
## num word freq pct pmio log10 rank.avg rank.min rank.rel.avg ## 160 210 what 16396 0.000780145 780 2.892095 260759 260759 99.95362 ## rank.rel.min ## 160 99.95362
query(LCC.en, "pmio", c(780, 790))
## num word freq pct pmio log10 rank.avg rank.min rank.rel.avg ## 156 206 many 16588 0.0007892806 789 2.897077 260763 260763 99.95515 ## 157 207 per 16492 0.0007847128 784 2.894316 260762 260762 99.95477 ## 158 208 down 16468 0.0007835708 783 2.893762 260761 260761 99.95439 ## 159 209 since 16431 0.0007818103 781 2.892651 260760 260760 99.95400 ## 160 210 what 16396 0.0007801450 780 2.892095 260759 260759 99.95362 ## rank.rel.min ## 156 99.95515 ## 157 99.95477 ## 158 99.95439 ## 159 99.95400 ## 160 99.95362
We can now conduct a full frequency analysis of our text:
freq.analysis.res <- freq.analysis(tagged.text, corp.freq=LCC.en)
The resulting object holds a lot of information, even if no corpus data was used (i.e.,
corp.freq=NULL). To begin with, it contains the two slots
lang, which are copied from the analyzed tagged text object. In this way analysis results can always be converted back into
kRp.text objects.^[This can easily be done by calling
as(freq.analysis.res, "kRp.text").] However, if corpus data was provided, the tagging results gained three new columns:
## token tag lemma lttr [...] pmio rank.avg rank.min [...] ## 30 an DT an 2 3817 99.98735 99.98735 ## 31 attack NN attack 6 163 99.70370 99.70370 ## 32 has VBZ have 3 4318 99.98888 99.98888 ## 33 been VBN be 4 2488 99.98313 99.98313 ## 34 initiated VBN initiate 9 11 97.32617 97.32137 ## 35 ( ( ( 1 854 99.96013 99.96013 ## 36 secondary JJ secondary 9 21 98.23846 98.23674 ## 37 defense NN defense 7 210 99.77499 99.77499 ## 38 ) ) ) 1 856 99.96052 99.96052 [...]
Perhaps most informatively,
pmio shows how often the respective token appears in a million tokens, according to the corpus data. Adding to this, the previously introduced slot
desc now contains some more descriptive statistics on our text, and if we provided a corpus database, the slot
freq.analysis lists summaries of various frequency information that was calculated.
If the corpus object also provided inverse document frequency (i.e., values in column
freq.analysis() will automatically compute tf-idf statistics and put them in a column called
Amongst others, the descriptives now also give easy access to character vectors with all words (
$all.words) and all lemmata (
$all.lemmata), all tokens sorted^[This sorting depends on proper POS-tagging, so this will only contain useful data if you used
treetag() instead of
tokenize().] into word classes (e.g., all verbs in
$classes$verb), or the number of words in each sentece:
##  34 10 37 16 44 31 14 31 34 23 17 43 40 47 22 19 65 29
As a practical example, the list
$classes has proven to be very helpful to debug the results of TreeTagger, which is remarkably accurate, but of course not free from making a mistake now and then. By looking through
$classes, where all tokens are grouped regarding to the global word class TreeTagger attributed to it, at least obvious errors (like names mistakenly taken for a pronoun) are easily found:^[And can then be corrected by using the function
## $conjunction ##  "both" "and" "and" "and" "and" "or" "or" "and" "and" "or" ##  "and" "or" "and" "or" "and" "and" "and" "and" ## ## $number ##  "20" "one" ## ## $determiner ##  "an" "the" "an" "The" "the" "the" "some" ##  "that" "Some" "the" "a" "a" "a" "the" ##  "that" "the" "the" "Another" "which" "the" "a" ##  "that" "a" "The" "a" "the" "that" "a" [...]
The package comes with implementations of several readability formulae. Some of them depend on the number of syllables in the text.^[Whether this is the case can be looked up in the documentation.] To achieve this, the method
hyphen() takes objects of class
kRp.text and applies an hyphenation algorithm [@liang_word_1983] to each word. This algorithm was originally developed for automatic word hyphenation in $\LaTeX$, and is gracefully misused here to fulfill a slightly different service.^[The
hyphen() method was originally implemented as part of the
koRpus package, but was later split off into its own package called
(hyph.txt.en <- hyphen(tagged.text))
hyph.txt.en <- dget("sample_text_hyphenated_dput.txt")
This seperate hyphenation step can actually be skipped, as
readability() will do it automatically if needed. But similar to TreeTagger,
hyphen() will most likely not produce perfect results. As a rule of thumb, if in doubt it seems to behave rather conservative, that is, is underestimates the real number of syllables in a text. This, however, would of course affect the results of several readability formulae.
So, the more accurate the end results should be, the less you should rely on the automatic hyphenation alone. But it sure is a good starting point, for there is a method called
correct.hyph() to help you clean these results of errors later on. The most straight forward way to do this is to call
hyphenText(hyph.txt.en), which will get you a data frame with two colums,
word (the hyphenated words) and
syll (the number of syllables), in a spread sheet editor:^[For example, this can be comfortably done with RKWard: https://rkward.kde.org]
You can then manually correct wrong hyphenations by removing or inserting "-" as hyphenation indicators, and call
correct.hyph() without further arguments, which will cause it to recount all syllables:
hyph.txt.en <- correct.hyph(hyph.txt.en)
But the method can also be used to alter entries directly, which might be simpler and cleaner than manual changes:
hyph.txt.en <- correct.hyph(hyph.txt.en, word="mech-a-nisms", hyphen="mech-a-ni-sms")
## Changed ## ## syll word ## 2 3 mech-a-nisms ## 6 3 mech-a-nisms ## ## into ## ## syll word ## 2 4 mech-a-ni-sms ## 6 4 mech-a-ni-sms
The hyphenated text object can now be given to
readability(), to calculate the measures of interest:^[Please note that as of version 0.04-18, the correctness of some of these calculations has not been extensively validated yet. The package was released nonetheless, also to find outstanding bugs in the implemented measures. Any information on the validity of its results is very welcome!]
readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en)
suppressWarnings(readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en))
lex.div(), by default
readability() calculates almost^[Measures which rely on word lists will be skipped if no list is provided.] all available measures:
To get a more condensed overview of the results try the
summary() method supports an additional flat format, which basically turns the table into a named numeric vector,
using the raw values (because all indices have raw values, but only a few more than that). This format comes very handy when you
want to use the output in further calculations:
If you're interested in a particular formula, again a wrapper function might be more convenient:
flesch.res <- flesch(tagged.text, hyphen=hyph.txt.en) lix.res <- LIX(tagged.text) # LIX doesn't need syllable count lix.res
It is possible to calculate the readability measures from the relevant key values directly, rather than analyze an actual text, by using
readability.num() instead of
readability(). If you need to reanalyze a particular text, this can be considerably faster. Therefore, all objects returned by
readability() can directly be fed to
readability.num(), since all relevant data is present in the
Another feature of this package is the detection of the language a text was (most probably) written in. This is done by gzipping reference texts in known languages, gzipping them again with addition of a small sample of the text in unknown language, and determining the case where the additional sample causes the smallest increase in file size [as described in @benedetto_gzip_2002]. By default, the compressed objects will be created in memory only.
To use the function
guess.lang(), you first need to download the reference material. In this implementation, the Universal Declaration of Human Rights in unicode formatting is used, because the document holds the world record of beeing the text translated into the most languages, and is publicly available. Please get the zipped archive with all translations in .txt format. You can, but don't have to unzip the archive. The text to find the language of must also be in a unicode .txt file:
guessed <- guess.lang( file.path(find.package("koRpus"),"tests","testthat","sample_text.txt"), udhr.path="~/downloads/udhr_txt.zip" ) summary(guessed)
## Estimated language: English ## Identifier: eng ## Region: Europe ## ## 435 different languages were checked. ## ## Distribution of compression differences: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 136.0 168.0 176.0 190.7 184.0 280.0 ## ## SD: 38.21 ## ## Top 5 guesses: ## name iso639-3 bcp47 region diff diff.std ## 1 English eng en Europe 136 -1.430827 ## 2 Scots sco sco Europe 136 -1.430827 ## 3 Pidgin, Nigerian pcm pcm Africa 144 -1.221473 ## 4 Catalan-Valencian-Balear cat ca Europe 152 -1.012119 ## 5 French fra fr Europe 152 -1.012119 ## ## Last 5 guesses: ## name iso639-3 bcp47 region diff diff.std ## 431 Burmese mya my Asia 280 2.337547 ## 432 Shan shn shn Asia 280 2.337547 ## 433 Tamil tam ta Asia 280 2.337547 ## 434 Vietnamese (Han nom) vie vi-Hani Asia 280 2.337547 ## 435 Chinese, Yue yue yue Asia 280 2.337547
The language support of this package has a modular design. There are some pre-built language packages in the
l10n repository, and with a little effort you should be able to add new languages yourself. You need the package sources for this, then basically you will have to add a new file to it and rebuild/reinstall the package. More details on this topic can be found in
inst/README.languages. Once you got a new language to work with
koRpus, I'd be happy to include your module in the official distribution.
Despite its name, the scope of
koRpus is single texts. If you would like to do analysis an a full corpus of texts, have a look at the plugin package
The APA style used in this vignette was kindly provided by the CSL project, licensed under Creative Commons Attribution-ShareAlike 3.0 Unported license.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.