readability-methods: Measure readability

Description Usage Arguments Details Value Note References Examples

Description

These methods calculate several readability indices.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
readability(txt.file, ...)

## S4 method for signature 'kRp.text'
readability(
  txt.file,
  hyphen = NULL,
  index = c("ARI", "Bormuth", "Coleman", "Coleman.Liau", "Dale.Chall",
    "Danielson.Bryan", "Dickes.Steiwer", "DRP", "ELF", "Farr.Jenkins.Paterson", "Flesch",
    "Flesch.Kincaid", "FOG", "FORCAST", "Fucks", "Gutierrez", "Harris.Jacobson",
    "Linsear.Write", "LIX", "nWS", "RIX", "SMOG", "Spache", "Strain", "Traenkle.Bailer",
    "TRI", "Tuldava", "Wheeler.Smith"),
  parameters = list(),
  word.lists = list(Bormuth = NULL, Dale.Chall = NULL, Harris.Jacobson = NULL, Spache =
    NULL),
  fileEncoding = "UTF-8",
  sentc.tag = "sentc",
  nonword.class = "nonpunct",
  nonword.tag = c(),
  quiet = FALSE,
  keep.input = NULL,
  as.feature = FALSE
)

## S4 method for signature 'missing'
readability(txt.file, index)

## S4 method for signature 'kRp.readability,ANY,ANY,ANY'
x[i]

## S4 method for signature 'kRp.readability'
x[[i]]

Arguments

txt.file

An object of class kRp.text.

...

Additional arguments for the generics.

hyphen

An object of class kRp.hyphen. If NULL, the text will be hyphenated automatically. All syllable handling will be skipped automatically if it's not needed for the selected indices.

index

A character vector, indicating which indices should actually be computed. If set to "all", then all available indices will be tried (meaning all variations of all measures). If set to "fast", a subset of the default values is used that is known to compute fast (currently, this only excludes "FOG"). You can also set it to "validation" to get information on the current status of validation.

parameters

A list with named magic numbers, defining the relevant parameters for each index. If none are given, the default values are used.

word.lists

A named list providing the word lists for indices which need one. If NULL or missing, the indices will be skipped and a warning is giving. Actual word lists can be provided as either a vector (or matrix or data.frame with only one column), or as a file name, where this file must contain one word per line. Alternatively, you can provide the number of words which are not on the list, directly.

fileEncoding

A character string defining the character encoding of the word.lists in case they are provided as files, like "Latin1" or "UTF-8".

sentc.tag

A character vector with POS tags which indicate a sentence ending. The default value "sentc" has special meaning and will cause the result of kRp.POS.tags(lang, tags="sentc", list.tags=TRUE) to be used.

nonword.class

A character vector with word classes which should be ignored for readability analysis. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used. Will only be of consequence if hyphen is not set!

nonword.tag

A character vector with POS tags which should be ignored for readability analysis. Will only be of consequence if hyphen is not set!

quiet

Logical. If FALSE, short status messages will be shown. TRUE will also suppress all potential warnings regarding the validation status of measures.

keep.input

Logical. If FALSE, neither the object provided by (or generated from) txt.file nor hyphen will be kept in the output object. By default (NULL) they are kept if the input was not already of the needed object class (e.g., kRp.text) or missing, to allow for re-use without the need to tag or hyphenate the text again. If TRUE, they are always kept. In cases where you want smaller object sizes, set this to FALSE to always drop these slots.

as.feature

Logical, whether the output should be just the analysis results or the input object with the results added as a feature. Use corpusReadability to get the results from such an aggregated object.

x

An object of class kRp.readability.

i

Defines the row selector ([) or the name to match ([[).

Details

In the following formulae, W stands for the number of words, St for the number of sentences, C for the number of characters (usually meaning letters), Sy for the number of syllables, W_{3Sy} for the number of words with at least three syllables, W_{<3Sy} for the number of words with less than three syllables, W^{1Sy} for words with exactly one syllable, W_{6C} for the number of words with at least six letters, and W_{-WL} for the number of words which are not on a certain word list (explained where needed).

"ARI":

Automated Readability Index:

ARI = 0.5 * W / St + 4.71 * C / W - 21.43

If parameters is set to ARI="NRI", the revised parameters from the Navy Readability Indexes are used:

ARI_NRI = 0.4 * W / St + 6 * C / W - 27.4

If parameters is set to ARI="simple", the simplified formula is calculated:

ARI_simple = W / St + 9 * C / W

Wrapper function: ARI

"Bormuth":

Bormuth Mean Cloze & Grade Placement:

B_MC = 0.886593 - (0.08364 * C / W) + 0.161911 * (W_-WL / W)^3

- 0.21401 * (W / St) + 0.000577 * (W / St)^2

- 0.000005 * (W / St)^3

Note: This index needs the long Dale-Chall list of 3000 familiar (english) words to compute W_-WL. That is, you must have a copy of this word list and provide it via the word.lists=list(Bormuth=<your.list>) parameter!

B_GP = 4.275 + 12.881 * B_MC - (34.934 * B_MC^2) + (20.388 * B_MC^3)

+ (26.194C - 2.046 C_CS^2) - (11.767 C_CS^3) - (44.285 * B_MC * C_CS)

+ (97.620 * (B_MC * C_CS)^2) - (59.538 * (B_MC * C_CS)^3)

Where C_CS represents the cloze criterion score (35% by default).

Wrapper function: bormuth

"Coleman":

Coleman's Readability Formulas:

C_1 = 1.29 * (100 * W^1Sy / W) - 38.45

C_2 = 1.16 * (100 * W^1Sy / W) + 1.48 * (100 * St / W) - 37.95

C_3 = 1.07 * (100 * W^1Sy / W) + 1.18 * (100 * St / W) + 0.76 * (100 * W_pron / W) - 34.02

C_4 = 1.04 * (100 * W^1Sy / W) + 1.06 * (100 * St / W) + 0.56 * (100 * W_pron / W) - 0.36 * (100 * W_prep / W) - 26.01

Where W_pron is the number of pronouns, and W_prep the number of prepositions.

Wrapper function: coleman

"Coleman.Liau":

First estimates cloze percentage, then calculates grade equivalent:

CL_ECP = 141.8401 - 0.214590 * 100 * C / W + 1.079812 * 100 * St / W

CL_grade = -27.4004 * CL_ECP / 100 + 23.06395

The short form is also calculated:

CL_short = 5.88 * C / W - 29.6 * St / W - 15.8

Wrapper function: coleman.liau

"Dale.Chall":

New Dale-Chall Readability Formula. By default the revised formula (1995) is calculated:

DC_new = 64 - 0.95 * 100 * W_-WL / W - 0.69 * W / St

This will result in a cloze score which is then looked up in a grading table. If parameters is set to Dale.Chall="old", the original formula (1948) is used:

DC_old = 0.1579 * 100 * W_-WL / W + 0.0496 * W / St + 3.6365

If parameters is set to Dale.Chall="PSK", the revised parameters by Powers-Sumner-Kearl (1958) are used:

DC_PSK = 0.1155 * 100 * W_-WL / W + 0.0596 * W / St + 3.2672

Note: This index needs the long Dale-Chall list of 3000 familiar (english) words to compute W_-WL. That is, you must have a copy of this word list and provide it via the word.lists=list(Dale.Chall=<your.list>) parameter!

Wrapper function: dale.chall

"Danielson.Bryan":

DB_1 = ( 1.0364 * C / Bl) + ( 0.0194 * C / St ) - 0.6059

DB_2 = 131.059 - ( 10.364 * C / Bl ) - ( 0.194 * C / St )

Where Bl means blanks between words, which is not really counted in this implementation, but estimated by words - 1. C is interpreted as literally all characters.

Wrapper function: danielson.bryan

"Dickes.Steiwer":

Dickes-Steiwer Handformel:

DS = 235.95993 - (73.021 * C / W) - (12.56438 * W / St) - (50.03293 * TTR)

Where TTR refers to the type-token ratio, which will be calculated case-insensitive by default.

Wrapper function: dickes.steiwer

"DRP":

Degrees of Reading Power. Uses the Bormuth Mean Cloze Score:

DRP = (1 - B_MC) * 100

This formula itself has no parameters. Note: The Bormuth index needs the long Dale-Chall list of 3000 familiar (english) words to compute W_-WL. That is, you must have a copy of this word list and provide it via the word.lists=list(Bormuth=<your.list>) parameter! Wrapper function: DRP

"ELF":

Fang's Easy Listening Formula:

ELF = W_2Sy / St

Wrapper function: ELF

"Farr.Jenkins.Paterson":

A simplified version of Flesch Reading Ease:

FJP = -31.517 - 1.015 * W / St + 1.599 * W^1Sy / W

If parameters is set to Farr.Jenkins.Paterson="PSK", the revised parameters by Powers-Sumner-Kearl (1958) are used:

FJP_PSK = 8.4335 + 0.0923 * W / St - 0.0648 * W^1Sy / W

Wrapper function: farr.jenkins.paterson

"Flesch":

Flesch Reading Ease:

F_EN = 206.835 - 1.015 * W / St - 84.6 * Sy / W

Certain internationalisations of the parameters are also implemented. They can be used by setting the Flesch parameter to one of the following language abbreviations.

"de" (Amstad's Verständlichkeitsindex):

F_DE = 180 - W / St - 58.5 * Sy / W

"es" (Fernandez-Huerta):

F_ES = 206.835 - 1.02 * W / St - 60 * Sy / W

"es-s" (Szigriszt):

F_ES S = 206.835 - W / St - 62.3 * Sy / W

"nl" (Douma):

F_NL = 206.835 - 0.93 * W / St - 77 * Sy / W

"nl-b" (Brouwer Leesindex):

F_NL B = 195 - 2 * W / St - 67 * Sy / W

"fr" (Kandel-Moles):

F_FR = 209 - 1.15 * W / St - 68 * Sy / W

If parameters is set to Flesch="PSK", the revised parameters by Powers-Sumner-Kearl (1958) are used to calculate a grade level:

F_PSK = 0.0778 * W / St + 4.55 * Sy / W - 2.2029

Wrapper function: flesch

"Flesch.Kincaid":

Flesch-Kincaid Grade Level:

FK = 0.39 * W / St + 11.8 * Sy / W - 15.59

Wrapper function: flesch.kincaid

"FOG":

Gunning Frequency of Gobbledygook:

FOG = 0.4 * ( W / St + 100 * W_3Sy / W )

If parameters is set to FOG="PSK", the revised parameters by Powers-Sumner-Kearl (1958) are used:

FOG_PSK = 3.0680 + ( 0.0877 * W / St ) + ( 0.0984 * 100 * W_3Sy / W )

If parameters is set to FOG="NRI", the new FOG count from the Navy Readability Indexes is used:

FOG_new = ( W_<3Sy + ( 3 * W_3Sy) / ( 100 * St / W ) - 3 ) / 2

If the text was POS-tagged accordingly, proper nouns and combinations of only easy words will not be counted as hard words, and the syllables of verbs ending in "-ed", "-es" or "-ing" will be counted without these suffixes.

Due to the need to re-hyphenate combined words after splitting them up, this formula takes considerably longer to compute than most others. If will be omitted if you set index="fast" instead of the default.

Wrapper function: FOG

"FORCAST":

FORCAST = 20 - ( W^1Sy * 150 / W ) / 10

If parameters is set to FORCAST="RGL", the parameters for the precise reading grade level are used (see Klare, 1975, pp. 84–85):

FORCAST_RGL = 20.43 - 0.11 * W^1Sy * 150 / W

Wrapper function: FORCAST

"Fucks":

Fucks' Stilcharakteristik (Fucks, 1955, as cited in Briest, 1974):

Fucks = ( Sy / W ) * ( W / St )

This simple formula has no parameters.

Wrapper function: fucks

"Gutierrez":

Gutiérrez de Polini's Fórmula de comprensibilidad (Gutiérrez, 1972, as cited in Fernández, 2016) for Spanish:

Gutierrez = 95.2 - 9.7 * C / W - 0.35 * W / St

Wrapper function: gutierrez

"Harris.Jacobson":

Revised Harris-Jacobson Readability Formulas (Harris & Jacobson, 1974): For primary-grade material:

HJ_1 = 0.094 * 100 * W_-WL / W + 0.168 * W / St + 0.502

For material above third grade:

HJ_2 = 0.140 * 100 * W_-WL / W + 0.153 * W / St + 0.560

For material below forth grade:

HJ_3 = 0.158 * W / St + 0.055 * 100 * W_6C / W + 0.355

For material below forth grade:

HJ_4 = 0.070 * 100 * W_-WL / W + 0.125 * W / St + 0.037 * 100 * W_6C / W + 0.497

For material above third grade:

HJ_5 = 0.118 * 100 * W_-WL / W + 0.134 * W / St + 0.032 * 100 * W_6C / W + 0.424

Note: This index needs the short Harris-Jacobson word list for grades 1 and 2 (english) to compute W_{-WL}. That is, you must have a copy of this word list and provide it via the word.lists=list(Harris.Jacobson=<your.list>) parameter!

Wrapper function: harris.jacobson

"Linsear.Write" (O'Hayre, undated, see Klare, 1975, p. 85):

LW_raw = ( 100 - 100 * W_<3Sy / W + ( 3 * 100 * W_3Sy / W ) ) / ( 100 * St / W )

LW(LW_raw <= 20) = LW_raw - 2 / 2

LW(LW_raw > 20) = LW_raw / 2

Wrapper function: linsear.write

"LIX"

Björnsson's Läsbarhetsindex. Originally proposed for Swedish texts, calculated by:

LIX = W / St + (W7C * 100) / W

Texts with a LIX < 25 are considered very easy, around 40 normal, and > 55 very difficult to read.

Wrapper function: LIX

"nWS":

Neue Wiener Sachtextformeln (Bamberger & Vanecek, 1984):

nWS_1 = 19.35 * W_3Sy / W + 0.1672 * W / St + 12.97 * W_6C / W - 3.27 * W^1Sy / W - 0.875

nWS_2 = 20.07 * W_3Sy / W + 0.1682 * W / St + 13.73 * W_6C / W - 2.779

nWS_3 = 29.63 * W_3Sy / W + 0.1905 * W / St - 1.1144

nWS_4 = 27.44 * W_3Sy / W + 0.2656 * W / St - 1.693

Wrapper function: nWS

"RIX"

Anderson's Readability Index. A simplified version of LIX:

RIX = W7C / St

Texts with a RIX < 1.8 are considered very easy, around 3.7 normal, and > 7.2 very difficult to read.

Wrapper function: RIX

"SMOG":

Simple Measure of Gobbledygook. By default calculates formula D by McLaughlin (1969):

SMOG = 1.043 * √{W_3Sy * 30 / St} + 3.1291

If parameters is set to SMOG="C", formula C will be calculated:

SMOG_C = 0.9986 * √{W_3Sy * 30 / St + 5} + 2.8795

If parameters is set to SMOG="simple", the simplified formula is used:

SMOG_simple = √{W_3Sy * 30 / St} + 3

If parameters is set to SMOG="de", the formula adapted to German texts ("Qu", Bamberger & Vanecek, 1984, p. 78) is used:

SMOG_de = √{W_3Sy * 30 / St} - 2

Wrapper function: SMOG

"Spache":

Spache Revised Formula (1974):

Spache = 0.121 * W / St + 0.082 * 100 * W_-WL / W + 0.659

If parameters is set to Spache="old", the original parameters (Spache, 1953) are used:

Spache_old = 0.141 * W / St + 0.086 * 100 * W_-WL / W + 0.839

Note: The revised index needs the revised Spache word list (see Klare, 1975, p. 73), and the old index the short Dale-Chall list of 769 familiar (english) words to compute W_{-WL}. That is, you must have a copy of this word list and provide it via the word.lists=list(Spache=<your.list>) parameter!

Wrapper function: spache

"Strain":

Strain Index. This index was proposed in [1]:

S = Sy * 1 / ( St / 3 ) * 1 / 10

Wrapper function: strain

"Traenkle.Bailer":

Tränkle-Bailer Formeln. These two formulas were the result of a re-examination of the ones proposed by Dickes-Steiwer. They try to avoid the usage of the type-token ratio, which is dependent on text length (Tränkle & Bailer, 1984):

TB1 = 224.6814 - ( 79.8304 * C / W ) - (12.24032 * W / St ) - (1.292857 * 100 * W_prep / W )

TB2 = 234.1063 - ( 96.11069 * C / W ) - ( 2.05444 * 100 * W_prep / W ) - (1.02805 * 100 * W_conj / W )

Where W_{prep} refers to the number of prepositions, and W_{conj} to the number of conjunctions.

Wrapper function: traenkle.bailer

"TRI":

Kuntzsch's Text-Redundanz-Index. Intended mainly for German newspaper comments.

TRI = ( 0.449 * W^1Sy ) - ( 2.467 * Ptn ) - ( 0.937 * Frg ) - 14.417

Where Ptn is the number of punctuation marks and Frg the number of foreign words.

Wrapper function: TRI

"Tuldava":

Tuldava's Text Difficulty Formula. Supposed to be rather independent of specific languages (Grzybek, 2010).

TD = Sy / W * ln( W / St )

Wrapper function: tuldava

"Wheeler.Smith":

Intended for english texts in primary grades 1–4 (Wheeler & Smith, 1954):

WS = W / St * 10 * W_2Sy / W

If parameters is set to Wheeler.Smith="de", the calculation stays the same, but grade placement is done according to Bamberger & Vanecek (1984), that is for german texts.

Wrapper function: wheeler.smith

By default, if the text has to be tagged yet, the language definition is queried by calling get.kRp.env(lang=TRUE) internally. Or, if txt has already been tagged, by default the language definition of that tagged object is read and used. Set force.lang=get.kRp.env(lang=TRUE) or to any other valid value, if you want to forcibly overwrite this default behaviour, and only then. See kRp.POS.tags for all supported languages.

Value

Depending on as.feature, either an object of class kRp.readability, or an object of class kRp.text with the added feature readability containing it.

Note

To get a printout of the default parameters like they're set if no other parameters are specified, call readability(parameters="dput"). In case you want to provide different parameters, you must provide a complete set for an index, or special parameters that are mentioned in the index descriptions above (e.g., "PSK", if appropriate).

References

Anderson, J. (1981). Analysing the readability of english and non-english texts in the classroom with Lix. In Annual Meeting of the Australian Reading Association, Darwin, Australia.

Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496.

Bamberger, R. & Vanecek, E. (1984). Lesen–Verstehen–Lernen–Schreiben. Wien: Jugend und Volk.

Briest, W. (1974). Kann man Verständlichkeit messen? Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 27, 543–563.

Coleman, M. & Liau, T.L. (1975). A computer readability formula designed for machine scoring, Journal of Applied Psychology, 60(2), 283–284.

Dickes, P. & Steiwer, L. (1977). Ausarbeitung von Lesbarkeitsformeln für die deutsche Sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 9(1), 20–28.

DuBay, W.H. (2004). The Principles of Readability. Costa Mesa: Impact Information. WWW: http://www.impact-information.com/impactinfo/readability02.pdf; 22.03.2011.

Farr, J.N., Jenkins, J.J. & Paterson, D.G. (1951). Simplification of Flesch Reading Ease formula. Journal of Applied Psychology, 35(5), 333–337.

Fernández, A. M. (2016, November 30). Fórmula de comprensibilidad de Gutiérrez de Polini. https://legible.es/blog/comprensibilidad-gutierrez-de-polini/

Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233.

Grzybek, P. (2010). Text difficulty and the Arens-Altmann law. In Peter Grzybek, Emmerich Kelih, Ján Mačutek (Eds.), Text and Language. Structures – Functions – Interrelations. Quantitative Perspectives. Wien: Praesens, 57–70.

Harris, A.J. & Jacobson, M.D. (1974). Revised Harris-Jacobson readability formulas. In 18th Annual Meeting of the College Reading Association, Bethesda.

Klare, G.R. (1975). Assessing readability. Reading Research Quarterly, 10(1), 62–102.

McLaughlin, G.H. (1969). SMOG grading – A new readability formula. Journal of Reading, 12(8), 639–646.

Powers, R.D, Sumner, W.A, & Kearl, B.E. (1958). A recalculation of four adult readability formulas, Journal of Educational Psychology, 49(2), 99–105.

Smith, E.A. & Senter, R.J. (1967). Automated readability index. AMRL-TR-66-22. Wright-Paterson AFB, Ohio: Aerospace Medical Division.

Spache, G. (1953). A new readability formula for primary-grade reading materials. The Elementary School Journal, 53, 410–413.

Tränkle, U. & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die deutsche Sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231–244.

Wheeler, L.R. & Smith, E.H. (1954). A practical readability formula for the classroom teacher in the primary grades. Elementary English, 31, 397–399.

[1] https://strainindex.wordpress.com/2007/09/25/hello-world/

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  # call readability() on a tokenized text
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  # if you call readability() without arguments,
  # you will get its results directly
  rdb.results <- readability(tokenized.obj)

  # there are [ and [[ methods for these objects
  rdb.results[["ARI"]]

  # alternatively, you can also store those results as a
  # feature in the object itself
  tokenized.obj <- readability(
    tokenized.obj,
    as.feature=TRUE
  )
  # results are now part of the object
  hasFeature(tokenized.obj)
  corpusReadability(tokenized.obj)
} else {}

koRpus documentation built on May 18, 2021, 1:13 a.m.