Description Usage Arguments Details Value References See Also Examples
These methods analyze the lexical diversity/complexity of a text corpus.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | lex.div(txt, ...)
## S4 method for signature 'kRp.text'
lex.div(
txt,
segment = 100,
factor.size = 0.72,
min.tokens = 9,
MTLDMA.steps = 1,
rand.sample = 42,
window = 100,
case.sens = FALSE,
lemmatize = FALSE,
detailed = FALSE,
measure = c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D",
"MTLD", "MTLD-MA"),
char = c("TTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD",
"MTLD-MA"),
char.steps = 5,
log.base = 10,
force.lang = NULL,
keep.tokens = FALSE,
type.index = FALSE,
corp.rm.class = "nonpunct",
corp.rm.tag = c(),
as.feature = FALSE,
quiet = FALSE
)
## S4 method for signature 'character'
lex.div(
txt,
segment = 100,
factor.size = 0.72,
min.tokens = 9,
MTLDMA.steps = 1,
rand.sample = 42,
window = 100,
case.sens = FALSE,
lemmatize = FALSE,
detailed = FALSE,
measure = c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D",
"MTLD", "MTLD-MA"),
char = c("TTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD",
"MTLD-MA"),
char.steps = 5,
log.base = 10,
force.lang = NULL,
keep.tokens = FALSE,
type.index = FALSE,
corp.rm.class = "nonpunct",
corp.rm.tag = c(),
quiet = FALSE
)
## S4 method for signature 'missing'
lex.div(txt, measure)
## S4 method for signature 'kRp.TTR,ANY,ANY,ANY'
x[i]
## S4 method for signature 'kRp.TTR'
x[[i]]
|
txt |
An object of class |
... |
Only used for the method generic. |
segment |
An integer value for MSTTR, defining how many tokens should form one segment. |
factor.size |
A real number between 0 and 1, defining the MTLD factor size. |
min.tokens |
An integer value, how many tokens a full factor must at least have to be considered for the MTLD-MA result. |
MTLDMA.steps |
An integer value for MTLD-MA, defining the step size for the moving window, in tokens. The original proposal uses an incremet of 1. If you increase this value, computation will be faster, but your value can only remain a good estimate if the text is long enough. |
rand.sample |
An integer value, how many tokens should be assumed to be drawn for calculating HD-D. |
window |
An integer value for MATTR, defining how many tokens the moving window should include. |
case.sens |
Logical, whether types should be counted case sensitive. |
lemmatize |
Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms. |
detailed |
Logical, whether full details of the analysis should be calculated. This currently affects MTLD and MTLD-MA, defining if all factors should be kept in the object. This slows down calculations considerably. |
measure |
A character vector defining the measures which should be calculated. Valid elements are |
char |
A character vector defining whether data for plotting characteristic curves should be calculated. Valid elements are
|
char.steps |
An integer value defining the step size for characteristic curves, in tokens. |
log.base |
A numeric value defining the base of the logarithm. See |
force.lang |
A character string defining the language to be assumed for the text, by force. See details. |
keep.tokens |
Logical. If |
type.index |
Logical. If |
corp.rm.class |
A character vector with word classes which should be dropped. The default value
|
corp.rm.tag |
A character vector with POS tags which should be dropped. |
as.feature |
Logical,
whether the output should be just the analysis results or the input object with
the results added as a feature. Use |
quiet |
Logical. If |
x |
An object of class |
i |
Defines the row selector ( |
lex.div
calculates a variety of proposed indices for lexical diversity. In the following formulae,
N refers to
the total number of tokens, and V to the number of types:
"TTR"
:The ordinary Type-Token Ratio:
TTR = V / N
Wrapper function: TTR
"MSTTR"
:For the Mean Segmental Type-Token Ratio (sometimes referred to as Split TTR) tokens are split up into segments of the given size, TTR for each segment is calculated and the mean of these values returned. Tokens at the end which do not make a full segment are ignored. The number of dropped tokens is reported.
Wrapper function: MSTTR
"MATTR"
:The Moving-Average Type-Token Ratio (Covington & McFall, 2010) calculates TTRs for a defined number of tokens (called the "window"), starting at the beginning of the text and moving this window over the text, until the last token is reached. The mean of these TTRs is the MATTR.
Wrapper function: MATTR
"C"
:Herdan's C (Herdan, 1960, as cited in Tweedie & Baayen, 1998; sometimes referred to as LogTTR):
C = lg(V) / lg(N)
Wrapper function: C.ld
"R"
:Guiraud's Root TTR (Guiraud, 1954, as cited in Tweedie & Baayen, 1998):
R = V / sqrt(N)
Wrapper function: R.ld
"CTTR"
:Carroll's Corrected TTR:
CTTR = V / sqrt(2N)
Wrapper function: CTTR
"U"
:Dugast's Uber Index (Dugast, 1978, as cited in Tweedie & Baayen, 1998):
U = lg(N)^2 / lg(N) - lg(V)
Wrapper function: U.ld
"S"
:Summer's index:
S = lg(lg(V)) / lg(lg(N))
Wrapper function: S.ld
"K"
:Yule's K (Yule, 1944, as cited in Tweedie & Baayen, 1998) is calculated by:
K = 10^4 * (sum(fX*X^2) - N) / N^2
where N is the number of tokens, X is a vector with the frequencies of each type, and fX is the frequencies for each X.
Wrapper function: K.ld
"Maas"
:Maas' indices (a, \lg{V_0} & \lg{}_{e}{V_0}):
a^2 = lg(N) - lg(V) / lg(N)^2
lg(V0) = lg(V) / sqrt(1 - (lg(V) / lg(N)^2))
Earlier versions (koRpus
< 0.04-12) reported a^2,
and not a. The measure was derived from a formula by M\"uller (1969, as cited in Maas, 1972).
\lg{}_{e}{V_0} is equivalent to \lg{V_0},
only with e as the base for the logarithms. Also calculated are a, \lg{V_0} (both not the same
as before) and V' as measures of relative vocabulary growth while the text progresses. To calculate these measures,
the first half of the text and the full text
will be examined (see Maas, 1972, p. 67 ff. for details).
Wrapper function: maas
"MTLD"
:For the Measure of Textual Lexical Diversity (McCarthy & Jarvis, 2010) so called factors are counted. Each factor is a subsequent stream of tokens which ends (and is then counted as a full factor) when the TTR value falls below the given factor size. The value of remaining partial factors is estimated by the ratio of their current TTR to the factor size threshold. The MTLD is the total number of tokens divided by the number of factors. The procedure is done twice, both forward and backward for all tokens, and the mean of both calculations is the final MTLD result.
Wrapper function: MTLD
"MTLD-MA"
:The Moving-Average Measure of Textual Lexical Diversity (Jarvis,
no year) combines factor counting and a moving
window similar to MATTR: After each full factor the the next one is calculated from one token after the last starting point. This is repeated
until the end of text is reached for the first time. The average of all full factor lengths is the final MTLD-MA result. Factors below the
min.tokens
threshold are dropped.
Wrapper function: MTLD
"HD-D"
:The HD-D value can be interpreted as the idealized version of vocd-D (see McCarthy & Jarvis, 2007). For each type, the probability is computed (using the hypergeometric distribution) of drawing it at least one time when drawing randomly a certain number of tokens from the text – 42 by default. The sum of these probabilities make up the HD-D value. The sum of probabilities relative to the drawn sample size (ATTR) is also reported.
Wrapper function: HDD
By default, if the text has to be tagged yet,
the language definition is queried by calling get.kRp.env(lang=TRUE)
internally.
Or, if txt
has already been tagged,
by default the language definition of that tagged object is read
and used. Set force.lang=get.kRp.env(lang=TRUE)
or to any other valid value,
if you want to forcibly overwrite this
default behaviour,
and only then. See kRp.POS.tags
for all supported languages.
Depending on as.feature
,
either an object of class kRp.TTR
,
or an object of class kRp.text
with the added feature lex_div
containing it.
Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.
Maas, H.-D., (1972). \"Uber den Zusammenhang zwischen Wortschatzumfang und L\"ange eines Textes. Zeitschrift f\"ur Literaturwissenschaft und Linguistik, 2(8), 73–96.
McCarthy, P.M. & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459–488.
McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaces to lexical diversity assessment. Behaviour Research Methods, 42(2), 381–392.
Tweedie. F.J. & Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323–352.
kRp.POS.tags
,
kRp.text
,
kRp.TTR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
# call lex.div() on a tokenized text
tokenized.obj <- tokenize(
txt=sample_file,
lang="en"
)
# if you call lex.div() without arguments,
# you will get its results directly
ld.results <- lex.div(tokenized.obj, char=c())
# there are [ and [[ methods for these objects
ld.results[["MSTTR"]]
# alternatively, you can also store those results as a
# feature in the object itself
tokenized.obj <- lex.div(
tokenized.obj,
char=c(),
as.feature=TRUE
)
# results are now part of the object
hasFeature(tokenized.obj)
corpusLexDiv(tokenized.obj)
} else {}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.