slma: Stable lexical marker analysis
In mclm: Mastering Corpus Linguistics Methods

View source: R/slma.R

slma	R Documentation

Stable lexical marker analysis

Description

This function conducts a stable lexical marker analysis.

Usage

slma(
  x,
  y,
  file_encoding = "UTF-8",
  sig_cutoff = qchisq(0.95, df = 1),
  small_pos = 1e-05,
  keep_intermediate = FALSE,
  verbose = TRUE,
  min_rank = 1,
  max_rank = 5000,
  keeplist = NULL,
  stoplist = NULL,
  ngram_size = NULL,
  max_skip = 0,
  ngram_sep = "_",
  ngram_n_open = 0,
  ngram_open = "[]",
  ...
)

Arguments

`x, y`	Character vector or `fnames` object with filenames for the two sets of documents.
`file_encoding`	Encoding of all the files to read.
`sig_cutoff`	Numeric value indicating the cutoff value for 'significance in the stable lexical marker analysis. The default value is `qchist(.95, df = 1)`, which is about 3.84.
`small_pos`	Alternative (but sometimes inferior) approach to dealing with zero frequencies, compared to `haldane`. The argument `small_pos` only applies when `haldane` is set to `FALSE`. (See the Details section.) If `haldane` is `FALSE`, and there is at least one zero frequency in a contingency table, adding small positive values to the zero frequency cells is done systematically for all measures calculated for that table, not just for measures that need this to be done.
`keep_intermediate`	Logical. If `TRUE`, results from intermediate calculations are kept in the output as the "intermediate" element. This is necessary if you want to inspect the object with the `details()` method.
`verbose`	Logical. Whether progress should be printed to the console during analysis.
`min_rank, max_rank`	Minimum and maximum frequency rank in the first corpus (`x`) of the items to take into consideration as candidate stable markers. Only tokens or token n-grams with a frequency rank greater than or equal to `min_rank` and lower than or equal to `max_rank` will be included.
`keeplist`	List of types that must certainly be included in the list of candidate markers regardless of their frequency rank and of `stoplist`.
`stoplist`	List of types that must not be included in the list of candidate markers, although, if a type is included in `keeplist`, its inclusion in `stoplist` is disregarded.
`ngram_size`	Argument in support of ngrams/skipgrams (see also `max_skip`). If one wants to identify individual tokens, the value of `ngram_size` should be `NULL` or `1`. If one wants to retrieve token ngrams/skipgrams, `ngram_size` should be an integer indicating the size of the ngrams/skipgrams. E.g. `2` for bigrams, or `3` for trigrams, etc.
`max_skip`	Argument in support of skipgrams. This argument is ignored if `ngram_size` is `NULL` or is `1`. If `ngram_size` is `2` or higher, and `max_skip` is `0`, then regular ngrams are being retrieved (albeit that they may contain open slots; see `ngram_n_open`). If `ngram_size` is `2` or higher, and `max_skip` is `1` or higher, then skipgrams are being retrieved (which in the current implementation cannot contain open slots; see `ngram_n_open`). For instance, if `ngram_size` is `3` and `max_skip` is `2`, then 2-skip trigrams are being retrieved. Or if `ngram_size` is `5` and `max_skip` is `3`, then 3-skip 5-grams are being retrieved.
`ngram_sep`	Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
`ngram_n_open`	If `ngram_size` is `2` or higher, and moreover `ngram_n_open` is a number higher than `0`, then ngrams with 'open slots' in them are retrieved. These ngrams with 'open slots' are generalizations of fully lexically specific ngrams (with the generalization being that one or more of the items in the ngram are replaced by a notation that stands for 'any arbitrary token'). For instance, if `ngram_size` is `4` and `ngram_n_open` is `1`, and if moreover the input contains a 4-gram `"it_is_widely_accepted"`, then the output will contain all modifications of `"it_is_widely_accepted"` in which one (since `ngram_n_open` is `1`) of the items in this n-gram is replaced by an open slot. The first and the last item inside an ngram are never turned into an open slot; only the items in between are candidates for being turned into open slots. Therefore, in the example, the output will contain `"it_[]_widely_accepted"` and `"it_is_[]_accepted"`. As a second example, if `ngram_size` is `5` and `ngram_n_open` is `2`, and if moreover the input contains a 5-gram `"it_is_widely_accepted_that"`, then the output will contain `"it_[]_[]_accepted_that"`, `"it_[]_widely_[]_that"`, and `"it_is_[]_[]_that"`.
`ngram_open`	Character string used to represent open slots in ngrams in the output of this function.
`...`	Additional arguments.

Details

A stable lexical marker analysis of the A-documents in x versus the B-documents in y starts from a separate keyword analysis for all possible document couples (a,b), with a an A-document and b a B-document. If there are n A-documents and m B-documents, then n*m keyword analyses are conducted. The 'stability' of a linguistic item x, as a marker for the collection of A-documents (when compared to the B-documents) corresponds to the frequency and consistency with which x is found to be a keyword for the A-documents across all aforementioned keyword analyses.

In any specific keyword analysis, x is considered a keyword for an A-document if G_signed is positive and moreover p_G is less than sig_cutoff (see assoc_scores() for more information on the measures). Item x is considered a keyword for the B-document if G_signed is negative and moreover p_G is less than sig_cutoff.

Value

An object of class slma, which is a named list with at least the following elements:

A scores dataframe with information about the stability of the chosen lexical items. (See below.)
An intermediate list with a register of intermediate values if keep_intermediate was TRUE.
Named items registering the values of the arguments with the same name, namely sig_cutoff, small_pos, x, and y.

The slma object has as_data_frame() and print methods as well as an ad-hoc details() method. Note that the print method simply prints the main dataframe.

Contents of the `scores` element

The scores element is a dataframe of which the rows are linguistic items for which a stable lexical marker analysis was conducted and the columns are different 'stability measures' and related statistics. By default, the linguistic items are sorted by decreasing 'stability' according to the S_lor measure.

Column	Name	Computation	Range of values
`S_abs`	Absolute stability	`S_att` - `S_rep`	-(nm)* -- (nm)*
`S_nrm`	Normalized stability	`S_abs` / nm*	-1 -- 1
`S_att`	Stability of attraction	Number of (a,b) couples in which the linguistic item is a keyword for the A-documents	0 -- nm*
`S_rep`	Stability of repulsion	Number of (a,b) couples in which the linguistic item is a keyword for the B-documents	0 -- nm*
`S_lor`	Log of odds ratio stability	Mean of `log_OR` across all (a,b) couples but setting to 0 the value when `p_G` is larger than `sig_cutoff`

S_lor is then computed as a fraction with as its numerator the sum of all log_OR values across all (a,b) couples for which p_G is lower than sig_cutoff and as its denominator n*m. For more on log_OR, see the Value section on on assoc_scores(). The final three columns on the output are meant as a tool in support of the interpretation of the log_OR column. Considering all (a,b) couples for which p_G is smaller than sig_cutoff, lor_min, lor_max and lor_sd are their minimum, maximum and standard deviation for each element.

Examples

a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm"))
slma_ex <- slma(a_corp, b_corp)

mclm documentation built on Oct. 3, 2022, 9:07 a.m.