slma | R Documentation |
This function conducts a stable lexical marker analysis.
slma( x, y, file_encoding = "UTF-8", sig_cutoff = qchisq(0.95, df = 1), small_pos = 1e-05, keep_intermediate = FALSE, verbose = TRUE, min_rank = 1, max_rank = 5000, keeplist = NULL, stoplist = NULL, ngram_size = NULL, max_skip = 0, ngram_sep = "_", ngram_n_open = 0, ngram_open = "[]", ... )
x, y |
Character vector or |
file_encoding |
Encoding of all the files to read. |
sig_cutoff |
Numeric value indicating the cutoff value for 'significance
in the stable lexical marker analysis. The default value is |
small_pos |
Alternative (but sometimes inferior) approach to dealing with
zero frequencies, compared to If |
keep_intermediate |
Logical. If |
verbose |
Logical. Whether progress should be printed to the console during analysis. |
min_rank, max_rank |
Minimum and maximum frequency rank in the first
corpus ( |
keeplist |
List of types that must certainly be included in the list of
candidate markers regardless of their frequency rank and of |
stoplist |
List of types that must not be included in the list of candidate
markers, although, if a type is included in |
ngram_size |
Argument in support of ngrams/skipgrams (see also If one wants to identify individual tokens, the value of |
max_skip |
Argument in support of skipgrams. This argument is ignored if
If If For instance, if |
ngram_sep |
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function. |
ngram_n_open |
If For instance, if As a second example, if |
ngram_open |
Character string used to represent open slots in ngrams in the output of this function. |
... |
Additional arguments. |
A stable lexical marker analysis of the A-documents in x
versus the B-documents
in y
starts from a separate keyword analysis for all possible document couples
(a,b), with a an A-document and b a B-document. If there are n
A-documents and m B-documents, then n*m keyword analyses are
conducted. The 'stability' of a linguistic item x, as a marker for the
collection of A-documents (when compared to the B-documents) corresponds
to the frequency and consistency with which x is found to be a keyword for
the A-documents across all aforementioned keyword analyses.
In any specific keyword analysis, x is considered a keyword for an A-document
if G_signed
is positive and moreover p_G
is less than sig_cutoff
(see assoc_scores()
for more information on the measures). Item x is
considered a keyword for the B-document if G_signed
is negative and moreover
p_G
is less than sig_cutoff
.
An object of class slma
, which is a named list with at least the following
elements:
A scores
dataframe with information about the stability of the chosen
lexical items. (See below.)
An intermediate
list with a register of intermediate values if
keep_intermediate
was TRUE
.
Named items registering the values of the arguments with the same name,
namely sig_cutoff
, small_pos
, x
, and y
.
The slma
object has as_data_frame()
and print
methods
as well as an ad-hoc details()
method. Note that the print
method simply prints the main dataframe.
scores
elementThe scores
element is a dataframe of which the rows are linguistic items
for which a stable lexical marker analysis was conducted and the columns are
different 'stability measures' and related statistics. By default, the
linguistic items are sorted by decreasing 'stability' according to the S_lor
measure.
Column | Name | Computation | Range of values |
S_abs | Absolute stability | S_att - S_rep | -(n*m) -- (n*m) |
S_nrm | Normalized stability | S_abs / n*m | -1 -- 1 |
S_att | Stability of attraction | Number of (a,b) couples in which the linguistic item is a keyword for the A-documents | 0 -- n*m |
S_rep | Stability of repulsion | Number of (a,b) couples in which the linguistic item is a keyword for the B-documents | 0 -- n*m |
S_lor | Log of odds ratio stability | Mean of log_OR across all (a,b) couples but setting to 0 the value when p_G is larger than sig_cutoff | |
S_lor
is then computed as a fraction with as its numerator the sum of all
log_OR
values across all (a,b) couples for which p_G
is lower than
sig_cutoff
and as its denominator n*m.
For more on log_OR
, see the Value section on on assoc_scores()
. The final
three columns on the output are meant as a tool in support of the interpretation
of the log_OR
column. Considering all (a,b) couples for which
p_G
is smaller than sig_cutoff
, lor_min
, lor_max
and lor_sd
are their minimum, maximum and standard deviation for each element.
a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm")) b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm")) slma_ex <- slma(a_corp, b_corp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.