Using the koRpus Package for Text Analysis

header_con <- file("vignette_header.html")
writeLines('<meta name="flattr:id" content="4zdzgd" />', header_con)
# manually add tag definition, the koRpus.lang.en package might be missing"kRp.POS.tags",
  ## tag and class definitions
  # en -- english
      "CC", "conjunction", "Coordinating conjunction",
      "CD", "number", "Cardinal number",
      "DT", "determiner", "Determiner",
      "EX", "existential", "Existential there",
      "FW", "foreign", "Foreign word",
      "IN", "preposition", "Preposition or subordinating conjunction",
      "IN/that", "preposition", "Preposition or subordinating conjunction",
      "JJ", "adjective", "Adjective",
      "JJR", "adjective", "Adjective, comparative",
      "JJS", "adjective", "Adjective, superlative",
      "LS", "listmarker", "List item marker",
      "MD", "modal", "Modal",
      "NN", "noun", "Noun, singular or mass",
      "NNS", "noun", "Noun, plural",
      "NP", "name", "Proper noun, singular",
      "NPS", "name", "Proper noun, plural",
      "NS", "noun", "Noun, plural", # undocumented, bug in parameter file?
      "PDT", "predeterminer", "Predeterminer",
      "POS", "possesive", "Possessive ending",
      "PP", "pronoun", "Personal pronoun",
      "PP$", "pronoun", "Possessive pronoun",
      "RB", "adverb", "Adverb",
      "RBR", "adverb", "Adverb, comparative",
      "RBS", "adverb", "Adverb, superlative",
      "RP", "particle", " Particle",
      "SYM", "symbol", "Symbol",
      "TO", "to", "to",
      "UH", "interjection", "Interjection",
      "VB", "verb", "Verb, base form of \"to be\"",
      "VBD", "verb", "Verb, past tense of \"to be\"",
      "VBG", "verb", "Verb, gerund or present participle of \"to be\"",
      "VBN", "verb", "Verb, past participle of \"to be\"",
      "VBP", "verb", "Verb, non-3rd person singular present of \"to be\"",
      "VBZ", "verb", "Verb, 3rd person singular present of \"to be\"",
      "VH", "verb", "Verb, base form of \"to have\"",
      "VHD", "verb", "Verb, past tense of \"to have\"",
      "VHG", "verb", "Verb, gerund or present participle of \"to have\"",
      "VHN", "verb", "Verb, past participle of \"to have\"",
      "VHP", "verb", "Verb, non-3rd person singular present of \"to have\"",
      "VHZ", "verb", "Verb, 3rd person singular present of \"to have\"",
      "VV", "verb", "Verb, base form",
      "VVD", "verb", "Verb, past tense",
      "VVG", "verb", "Verb, gerund or present participle",
      "VVN", "verb", "Verb, past participle",
      "VVP", "verb", "Verb, non-3rd person singular present",
      "VVZ", "verb", "Verb, 3rd person singular present",
      "WDT", "determiner", "Wh-determiner",
      "WP", "pronoun", "Wh-pronoun",
      "WP$", "pronoun", "Possessive wh-pronoun",
      "WRB", "adverb", "Wh-adverb"
      ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))),
      ",", "comma", "Comma", # not in guidelines
      "(", "punctuation", "Opening bracket", # not in guidelines
      ")", "punctuation", "Closing bracket", # not in guidelines
      ":", "punctuation", "Punctuation", # not in guidelines
      "``", "punctuation", "Quote", # not in guidelines
      "''", "punctuation", "End quote", # not in guidelines
      "#", "punctuation", "Punctuation", # not in guidelines
      "$", "punctuation", "Punctuation" # not in guidelines
      ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))),
      "SENT", "fullstop", "Sentence ending punctuation" # not in guidelines
      ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc")))
# we'll also fool hyphen() into believing "en" is an available language,
# while actually using a previously hyphenated object
fake.hyph.en <- new(
    c(".im5b", ".imb", "0050"),
    dimnames=list(c(), c("orig", "char", "nums"))

What is koRpus?

Work on koRpus started in February 2011, primarily with the goal in mind to examine how similar different texts are. Since then, it quickly grew into an R package which implements dozens of formulae for readability and lexical diversity, and wrappers for language corpus databases and a tokenizer/POS tagger.



At the very beginning of almost every analysis with this package, the text you want to examine has to be sliced into its components, and the components must be identified and named. That is, it has to be split into its semantic parts (tokens), words, numbers, punctuation marks. After that, each token will be tagged regarding its part-of-speech (POS). For both of these steps, koRpus can use the third party software TreeTagger [@schmid_TT_1994].

Especially for Windows users installation of TreeTagger might be a little more complex -- e.g., it depends on Perl^[For a free implementation try], and you need a tool to extract .tar.gz archives.^[Like] Detailed installations instructions are beyond the scope of this vignette.

If you don't want to use TreeTagger, koRpus provides a simple tokenizer of its own called tokenize(). While the tokenizing itself works quite well, tokenize() is not as elaborate as is TreeTagger when it comes to POS tagging, as it can merely tell words from numbers, punctuation and abbreviations. Although this is sufficient for most readability formulae, you can't evaluate word classes in detail. If that's what you want, a TreeTagger installation is needed.

Word lists

Some of the readability formulae depend on special word lists [like @bormuth_cloze_1968; @dale_formula_1948; @spache_new_1953]. For copyright reasons these lists are not included as of now. This means, as long as you don't have copies of these lists, you can't calculate these particular measures, but of course all others. The expected format to use a list with this package is a simple text file with one word per line, preferably in UTF-8 encoding.

Language corpora

The frequency analysis functions in this package can look up how often each word in a text is used in its language, given that a corpus database is provided. Databases in Celex format are support, as is the Leipzig Corpora Collection [@quasthoff_LCC_2006] file format. To use such a database with this package, you simply need to download one of the .zip/.tar files.

Translated Human Rights Declaration

If you want to estimate the language of a text, reference texts in known languages are needed. In koRpus, the Universal Declaration of Human Rights with its more than 350 translations is used.

A sample session

From now on it is assumed that the above requirements are correctly installed and working. If an optional component is used it will be noted. Further, we'll need a sample text to analyze. We'll use the section on defense mechanisms of Phasmatodea from Wikipedia for this purpose.

Loading a language package

In order to do some analysis, you need to load a language support package for each language you would like to work with. For instance, in this vignette we're analyzing an English sample text. Language support packages for koRpus are named koRpus.lang.**, where ** is a two-character ID for the respective language, like en for English.^[Unfortunately, these language packages did not get the approval of the CRAN maintainers and are officially hosted at ([]. For your convenience the function install.koRpus.lang() can be used to easily install them anyway.]

# install the language support package
# load the package

When koRpus itself is loaded, it will list you all language packages found on your system. To get a list of all installable packages, call available.koRpus.lang().

Tokenizing and POS tagging

As explained earlier, splitting the text up into its basic components can be done by TreeTagger. To achieve this and have the results available in R, the function treetag() is used.


At the very least you must provide it with the text, of course, and name the language it is written in. In addition to that you must specify where you installed TreeTagger. If you look at the package documentation you'll see that treetag() understands a number of options to configure TreeTagger, but in most cases using one of the built-in presets should suffice. TreeTagger comes with batch/shell scripts for installed languages, and the presets of treetag() are basically just R implementations of these scripts.

tagged.text <- treetag(
tagged.text <- dget("sample_text_treetagged_dput.txt")

The first argument (file name) and lang should explain themselves. The treetagger option can either take the full path to one of the original TreeTagger scripts mentioned above, or the keyword "manual", which will cause the interpretation of what is defined by TT.options. To use a preset, just put the path to your local TreeTagger installation and a valid preset name here.^[Presets are defined in the language support packages, usually named like their respective two-character language identifier. Refer to their documentation.] The document ID is optional and can be omitted.

The resulting S4 object is of a class called kRp.text. If you call the object directly you get a shortened view of it's main content:


Once you've come this far, i.e., having a valid object of class kRp.text, all following analyses should run smoothly.


If treetag() should fail, you should first re-run it with the extra option debug=TRUE. Most interestingly, that will print the contents of, which is the TreeTagger command given to your operating system for execution. With that it should be possible to examine where exactly the erroneous behavior starts.

Alternative: tokenize()

If you don't need detailed word class analysis, you should be fine using koRpus' own function tokenize(). As you can see, tokenize() comes to the same results regarding the tokens, but is rather limited in recognizing word classes:

(tokenized.text <- tokenize(

Accessing data from koRpus objects

For this class of objects, koRpus provides some comfortable methods to extract the portions you're interested in. For example, the main results are to be found in the slot tokens. In addition to TreeTagger's original output (token, tag and lemma) treetag() also automatically counts letters and assigns tokens to global word classes. To get these results as a data.frame, use the getter method taggedText():


In case you want to access a subset of the data in the resulting object, e.g., only the column with number of letters or the first five rows of tokens, you'll be happy to know there's special [ and [[ methods for these kinds of objects:

head(tagged.text[["lttr"]], n=50)

The [ and [[ methods are basically a useful shortcut replacements for taggedText().

Descriptive statistics

All results of both treetag() and tokenize() also provide various descriptive statistics calculated from the analyzed text. You can get them by calling describe() on the object:

(txt_desc <- describe(tagged.text))
txt_desc_lttr <- txt_desc[["lttr.distrib"]]

Amongst others, you will find several indices describing the number of characters:

You'll also find the number of words and sentences, as well as average word and sentence lengths, and tables describing how the word length is distributed throughout the text (lttr.distrib). For instance, we see that the text has r txt_desc_lttr["num",3] words with three letters, r txt_desc_lttr["cum.sum",3] with three or less, and r txt_desc_lttr["cum.inv",3] with more than three. The last three lines show the percentages, respectively.

Lexical diversity (type token ratios)

To analyze the lexical diversity of our text we can now simply hand over the tagged text object to the lex.div() method. You can call it on the object with no further arguments (like lex.div(tagged.text)), but in this example we'll limit the analysis to a few measures:^[For informtaion on the measures shown see @tweedie_how_1998, @mccarthy_vocd_2007, @mccarthy_mtld_2010.]

  measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"),
  char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA")
  measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"),
  char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA"),

Let's look at some particular parts: At first we are informed of the total number of types, tokens and lemmas (if available). After that the actual results are being printed, using the package's show() method for this particular kind of object. As you can see, it prints the actual value of each measure before a summary of the characteristics.^[Characteristics can be looked at to examine each measure's dependency on text length. They are calculated by computing each measure repeatedly, beginning with only the first token, then adding the next, progressing until the full text was analyzed.]

Some measures return more information than just their actual index value. For instance, when the Mean Segmental Type-Token Ratio is calculated, you'll be informed how much of your text was dropped and hence not examined. A small feature tool of koRpus, segment.optimizer(), automatically recommends you with a different segment size if this could decrease the number of lost tokens.

By default, lex.div() calculates every measure of lexical diversity that was implemented. Of course this is fully configurable, e.g. to completely skip the calculation of characteristics just add the option char=NULL. If you're only interested in one particular measure, it might be more convenient to call the according wrapper function instead of lex.div(). For example, to calculate only the measures proposed by @maas_ueber_1972:


All wrapper functions have characteristics turned off by default. The following example demonstrates how to calculate and plot the classic type-token ratio with characteristics. The resulting plot shows the typical degredation of TTR values with increasing text length:

ttr.res <- TTR(tagged.text, char=TRUE)
plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length")
ttr.res <- TTR(tagged.text, char=TRUE, quiet=TRUE)
plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length")

Since this package is intended for research, it is possible to directly influence all relevant values of each measure and examine the effects. For example, as mentioned before segment.optimizer() recommended a change of segment size for MSTTR to drop less words, which is easily done:

MSTTR(tagged.text, segment=92)

Please see to the documentation for more detailed information on the available measures and their references.

Frequency analysis

Importing language corpora data

This package has rudimentary support to import corpus databases.^[The package also has a function called read.corp.custom() which can be used to process language corpora yourself, and store the results in an object of class kRp.corp.freq, which is the class returned by read.corp.LCC() and read.corp.celex() as well. That is, if you can't get any already analyzed corpus database but have a huge language corpus at hand, you can create your own frequency database. But be warned that depending on corpus size and your hardware, this might take ages. On the other hand, read.corp.custom() will provide inverse document frequency (idf) values for all types, which is necessary to compute tf-idf with freq.analysis()] That is, it can read frequency data for words into an R object and use this object for further analysis. Next to the Celex database format (read.corp.celex()), it can read the LCC flatfile format^[Actually, it unterstands two different LCC formats, both the older .zip and the newer .tar archive format.] (read.corp.LCC()). The latter might be of special interest, because the needed database archives can be freely downloaded. Once you've downloaded one of these archives, it can be comfortably imported:

LCC.en <- read.corp.LCC("~/downloads/corpora/eng_news_2010_1M-text.tar")

read.corp.LCC() will automatically extract the files it needs from the archive. Alernatively, you can specify the path to the unpacked archive as well. To work with the imported data directly, the tool query() was added to the package. It helps you to comfortably look up certain words, or ranges of interesting values:

query(LCC.en, "word", "what")
##     num word  freq         pct pmio    log10 rank.avg rank.min rank.rel.avg
## 160 210 what 16396 0.000780145  780 2.892095   260759   260759     99.95362
##     rank.rel.min
## 160     99.95362
query(LCC.en, "pmio", c(780, 790))
##     num  word  freq          pct pmio    log10 rank.avg rank.min rank.rel.avg
## 156 206  many 16588 0.0007892806  789 2.897077   260763   260763     99.95515
## 157 207   per 16492 0.0007847128  784 2.894316   260762   260762     99.95477
## 158 208  down 16468 0.0007835708  783 2.893762   260761   260761     99.95439
## 159 209 since 16431 0.0007818103  781 2.892651   260760   260760     99.95400
## 160 210  what 16396 0.0007801450  780 2.892095   260759   260759     99.95362
##     rank.rel.min
## 156     99.95515
## 157     99.95477
## 158     99.95439
## 159     99.95400
## 160     99.95362

Conduct a frequency analysis

We can now conduct a full frequency analysis of our text:

freq.analysis.res <- freq.analysis(tagged.text, corp.freq=LCC.en)

The resulting object holds a lot of information, even if no corpus data was used (i.e., corp.freq=NULL). To begin with, it contains the two slots tokens and lang, which are copied from the analyzed tagged text object. In this way analysis results can always be converted back into kRp.text objects.^[This can easily be done by calling as(freq.analysis.res, "kRp.text").] However, if corpus data was provided, the tagging results gained three new columns:

##        token tag     lemma lttr  [...] pmio rank.avg rank.min
## 30        an  DT        an    2        3817 99.98735 99.98735
## 31    attack  NN    attack    6         163 99.70370 99.70370
## 32       has VBZ      have    3        4318 99.98888 99.98888
## 33      been VBN        be    4        2488 99.98313 99.98313
## 34 initiated VBN  initiate    9          11 97.32617 97.32137
## 35         (   (         (    1         854 99.96013 99.96013
## 36 secondary  JJ secondary    9          21 98.23846 98.23674
## 37   defense  NN   defense    7         210 99.77499 99.77499
## 38         )   )         )    1         856 99.96052 99.96052

Perhaps most informatively, pmio shows how often the respective token appears in a million tokens, according to the corpus data. Adding to this, the previously introduced slot desc now contains some more descriptive statistics on our text, and if we provided a corpus database, the slot freq.analysis lists summaries of various frequency information that was calculated.

If the corpus object also provided inverse document frequency (i.e., values in column idf) data, freq.analysis() will automatically compute tf-idf statistics and put them in a column called tfidf.

New to the desc slot

Amongst others, the descriptives now also give easy access to character vectors with all words ($all.words) and all lemmata ($all.lemmata), all tokens sorted^[This sorting depends on proper POS-tagging, so this will only contain useful data if you used treetag() instead of tokenize().] into word classes (e.g., all verbs in $classes$verb), or the number of words in each sentece:

##  [1] 34 10 37 16 44 31 14 31 34 23 17 43 40 47 22 19 65 29

As a practical example, the list $classes has proven to be very helpful to debug the results of TreeTagger, which is remarkably accurate, but of course not free from making a mistake now and then. By looking through $classes, where all tokens are grouped regarding to the global word class TreeTagger attributed to it, at least obvious errors (like names mistakenly taken for a pronoun) are easily found:^[And can then be corrected by using the function correct.tag()]

## $conjunction
## [1] "both" "and"  "and"  "and"  "and"  "or"   "or"   "and"  "and"  "or"  
## [11] "and"  "or"   "and"  "or"   "and"  "and"  "and"  "and" 
## $number
## [1] "20"  "one"
## $determiner
##  [1] "an"      "the"     "an"      "The"     "the"     "the"     "some"   
##  [8] "that"    "Some"    "the"     "a"       "a"       "a"       "the"    
## [15] "that"    "the"     "the"     "Another" "which"   "the"     "a"      
## [22] "that"    "a"       "The"     "a"       "the"     "that"    "a"      


The package comes with implementations of several readability formulae. Some of them depend on the number of syllables in the text.^[Whether this is the case can be looked up in the documentation.] To achieve this, the method hyphen() takes objects of class kRp.text and applies an hyphenation algorithm [@liang_word_1983] to each word. This algorithm was originally developed for automatic word hyphenation in $\LaTeX$, and is gracefully misused here to fulfill a slightly different service.^[The hyphen() method was originally implemented as part of the koRpus package, but was later split off into its own package called sylly.]

(hyph.txt.en <- hyphen(tagged.text))
hyph.txt.en <- dget("sample_text_hyphenated_dput.txt")

This seperate hyphenation step can actually be skipped, as readability() will do it automatically if needed. But similar to TreeTagger, hyphen() will most likely not produce perfect results. As a rule of thumb, if in doubt it seems to behave rather conservative, that is, is underestimates the real number of syllables in a text. This, however, would of course affect the results of several readability formulae.

So, the more accurate the end results should be, the less you should rely on the automatic hyphenation alone. But it sure is a good starting point, for there is a method called correct.hyph() to help you clean these results of errors later on. The most straight forward way to do this is to call hyphenText(hyph.txt.en), which will get you a data frame with two colums, word (the hyphenated words) and syll (the number of syllables), in a spread sheet editor:^[For example, this can be comfortably done with RKWard:]


You can then manually correct wrong hyphenations by removing or inserting "-" as hyphenation indicators, and call correct.hyph() without further arguments, which will cause it to recount all syllables:

hyph.txt.en <- correct.hyph(hyph.txt.en)

But the method can also be used to alter entries directly, which might be simpler and cleaner than manual changes:

hyph.txt.en <- correct.hyph(hyph.txt.en, word="mech-a-nisms", hyphen="mech-a-ni-sms")
## Changed
##   syll         word
## 2    3 mech-a-nisms
## 6    3 mech-a-nisms
##   into
##   syll          word
## 2    4 mech-a-ni-sms
## 6    4 mech-a-ni-sms

The hyphenated text object can now be given to readability(), to calculate the measures of interest:^[Please note that as of version 0.04-18, the correctness of some of these calculations has not been extensively validated yet. The package was released nonetheless, also to find outstanding bugs in the implemented measures. Any information on the validity of its results is very welcome!]

readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en)
suppressWarnings(readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en))

Similar to lex.div(), by default readability() calculates almost^[Measures which rely on word lists will be skipped if no list is provided.] all available measures:


To get a more condensed overview of the results try the summary() method:


The summary() method supports an additional flat format, which basically turns the table into a named numeric vector, using the raw values (because all indices have raw values, but only a few more than that). This format comes very handy when you want to use the output in further calculations:

summary(readbl.txt, flat=TRUE)

If you're interested in a particular formula, again a wrapper function might be more convenient:

flesch.res <- flesch(tagged.text, hyphen=hyph.txt.en)
lix.res <- LIX(tagged.text)   # LIX doesn't need syllable count

Readability from numeric data

It is possible to calculate the readability measures from the relevant key values directly, rather than analyze an actual text, by using readability.num() instead of readability(). If you need to reanalyze a particular text, this can be considerably faster. Therefore, all objects returned by readability() can directly be fed to readability.num(), since all relevant data is present in the desc slot.

Language detection

Another feature of this package is the detection of the language a text was (most probably) written in. This is done by gzipping reference texts in known languages, gzipping them again with addition of a small sample of the text in unknown language, and determining the case where the additional sample causes the smallest increase in file size [as described in @benedetto_gzip_2002]. By default, the compressed objects will be created in memory only.

To use the function guess.lang(), you first need to download the reference material. In this implementation, the Universal Declaration of Human Rights in unicode formatting is used, because the document holds the world record of beeing the text translated into the most languages, and is publicly available. Please get the zipped archive with all translations in .txt format. You can, but don't have to unzip the archive. The text to find the language of must also be in a unicode .txt file:

guessed <- guess.lang(
##   Estimated language: English
##           Identifier: eng
##               Region: Europe
## 435 different languages were checked.
## Distribution of compression differences:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   136.0   168.0   176.0   190.7   184.0   280.0 
##   SD: 38.21 
## Top 5 guesses:
##                         name iso639-3 bcp47 region diff  diff.std
## 1                    English      eng    en Europe  136 -1.430827
## 2                      Scots      sco   sco Europe  136 -1.430827
## 3           Pidgin, Nigerian      pcm   pcm Africa  144 -1.221473
## 4   Catalan-Valencian-Balear      cat    ca Europe  152 -1.012119
## 5                     French      fra    fr Europe  152 -1.012119
## Last 5 guesses:
##                         name iso639-3   bcp47 region diff diff.std
## 431                  Burmese      mya      my   Asia  280 2.337547
## 432                     Shan      shn     shn   Asia  280 2.337547
## 433                    Tamil      tam      ta   Asia  280 2.337547
## 434     Vietnamese (Han nom)      vie vi-Hani   Asia  280 2.337547
## 435             Chinese, Yue      yue     yue   Asia  280 2.337547

Extending koRpus

The language support of this package has a modular design. There are some pre-built language packages in the l10n repository, and with a little effort you should be able to add new languages yourself. You need the package sources for this, then basically you will have to add a new file to it and rebuild/reinstall the package. More details on this topic can be found in inst/README.languages. Once you got a new language to work with koRpus, I'd be happy to include your module in the official distribution.

Analyzing full corpora

Despite its name, the scope of koRpus is single texts. If you would like to do analysis an a full corpus of texts, have a look at the plugin package tm.plugin.koRpus.


The APA style used in this vignette was kindly provided by the CSL project, licensed under Creative Commons Attribution-ShareAlike 3.0 Unported license.


Try the koRpus package in your browser

Any scripts or data that you put into this service are public.

koRpus documentation built on May 18, 2021, 1:13 a.m.