Description Usage Arguments Details Value Note See Also Examples
View source: R/nauf_trms_fr_x_z.R
The nauf_contrasts
function returns a list of contrasts applied to
factors in an object created using a function in the nauf
package.
See 'Details'.
1 | nauf_contrasts(object, inc_ordered = FALSE)
|
object |
A |
inc_ordered |
A logical indicating whether or not ordered factor
contrasts should also be returned (default |
In the nauf
package, NA
values are used to encode when an
unordered factor is truly not applicable. This is different than
"not available" or "missing at random". The concept applies only to
unordered factors, and indicates that the factor is simply not meaningful
for an observation, or that while the observation may technically be
definable by one of the factor levels, the interpretation of its belonging to
that level isn't the same as for other observations. For imbalanced
observational data, coding unordered factors as NA
may also be used to
control for a factor that is only contrastive within a subset of the data due
to the sampling scheme. To understand the output of the
nauf_contrasts
function, the treatment of unordered factor contrasts
in the nauf
package will first be discussed, using the
plosives
dataset included in the package as an example.
In the plosives
dataset, the factor ling
is coded as
either Monolingual
, indicating the observation is from a monolingual
speaker of Spanish, or Bilingual
, indicating the observation is from a
Spanish-Quechua bilingual speaker. The dialect
factor indicates the
city the speaker is from (one of Cuzco
, Lima
, or
Valladolid
). The Cuzco dialect has both monolingual and bilingual
speakers, but the Lima and Valladolid dialects have only monolingual
speakers. In the case of Valladolid, the dialect is not in contact with
Quechua, and so being monolingual in Valladolid does not mean the same
thing as it does in Cuzco, where it indicates
monolingual as opposed to bilingual. Lima has Spanish-Quechua
bilingual speakers, but the research questions the dataset serves to answer
are specific to monolingual speakers of Spanish in Lima. If we leave the
ling
factor coded as is in the dataset and use
named_contr_sum
to create the contrasts, we obtain
the following:
dialect | ling | dialectCuzco | dialectLima | lingBilingual |
Cuzco | Bilingual | 1 | 0 | 1 |
Cuzco | Monolingual | 1 | 0 | -1 |
Lima | Monolingual | 0 | 1 | -1 |
Valladolid | Monolingual | -1 | -1 | -1 |
With these contrasts, the regression coefficient dialectLima
would not
represent the difference between the intercept and the mean of the Lima
dialect; the mean of the Lima dialect would be the
(Intercept) + dialectLima - lingBilingual
. The interpretation of the
lingBilingual
coefficient is similarly affected, and the intercept
term averages over the predicted value for the non-existent groups of Lima
bilingual speakers and Valladolid bilingual speakers, losing the
interpretation as the corrected mean (insofar as there can be a corrected
mean in this type of imbalanced data). With the nauf
package, we can
instead code non-Cuzco speakers' observations as NA
for the
ling
factor (i.e. execute
plosives$ling[plosives$dialect != "Cuzco"] <- NA
). These NA
values are allowed to pass into the regression's model matrix, and are then
set to 0
, effectively creating the following contrasts:
dialect | ling | dialectCuzco | dialectLima | lingBilingual |
Cuzco | Bilingual | 1 | 0 | 1 |
Cuzco | Monolingual | 1 | 0 | -1 |
Lima | NA | 0 | 1 | 0 |
Valladolid | NA | -1 | -1 | 0 |
Because sum contrasts are used, a value of 0
for a dummy variable
averages over the effect of the factor, and the coefficient
lingBilingual
only affects the predicted value for observations where
dialect = Cuzco
. In a regression fit with these contrasts, the
coefficient dialectLima
represents what it should, namely the
difference between the intercept and the mean of the Lima dialect, and the
intercept is again the corrected mean. The lingBilingual
coefficient
is now the difference between Cuzco bilingual speakers and the corrected mean
of the Cuzco dialect, which is (Intercept) + dialectCuzco
.
These nauf
contrasts thus allow us to model all of the data in a
single model without sacrificing the interpretability of the results. In
sociolinguistics, this method is called slashing due to the use of a
forward slash in GoldVarb to indicate that a factor is not applicable.
This same methodology can be applied to other parts of the
plosives
dataset where a factor's interpretation is the same
for all observations, but is only contrastive within a subset of the data due
to the sampling scheme. The age
and ed
factors (speaker age
group and education level, respectively) are factors which can apply to
speakers regardless of their dialect, but in the dataset they are only
contrastive within the Cuzco dialect; all the Lima and Valladolid speakers
are 40 years old or younger with a university education (in the case of
Valladolid, the data come from an already-existing corpus; and in the case of
Lima, the data were collected as part of the same dataset as the Cuzco data,
but as a smaller control group). These factors can be treated just as the
ling
factor by setting them to NA
for observations from Lima
and Valladolid speakers. Similarly, there is no read speech data for the
Valladolid speakers, and so spont
could be coded as NA
for
observations from Valladolid speakers.
Using NA
values can also allow the inclusion of a random effects
structure which only applies to a subset of the data. The
plosives
dataset has data from both read (spont = FALSE
;
only Cuzco and Lima) and spontaneous (spont = TRUE
; all three
dialects) speech. For the read speech, there are exactly repeated measures
on 54 items, as indicated by the item
factor. For the
spontaneous speech, there are not exactly repeated measures, and so in this
subset, item
is coded as NA
. In a regression fit using
nauf_lmer
, nauf_glmer
, or nauf_glmer.nb
with item
as a grouping factor, the random effects model matrix is created for the read
speech just as it normally is, and for spontaneous speech observations all of
the columns are set to 0
so that the item
effects only affect
the fitted values for read speech observations. In this way, the noise
introduced by the read speech items can be accounted for while still
including all of the data in one model, and the same random effects for
speaker
can apply to all observations (both read and spontaneous),
which will lead to a more accurate estimation of the fixed, speaker, and item
effects since more information is available than if the read and spontaneous
speech were analyzed in separate models.
There are two situations in which unordered factors will need more than one set
of contrasts: (1) when an unordered factor with NA
values interacts
with another unordered factor, and some levels are collinear with NA
;
and (2) when an unordered factor is included as a slope for a random effects
grouping factor that has NA
values, but only a subset of the levels
for the slope factor occur when the grouping factor is not NA
. As an
example of an interaction requiring new contrasts, consider the interaction
dialect * spont
(that is, suppose we are interested in whether the
effect of spont
is different for Cuzco and Lima). We code
spont
as NA
when dialect = Valladolid
, as mentioned
above. This gives the following contrasts for the main effects:
dialect | spont | dialectCuzco | dialectLima | spontTRUE |
Cuzco | TRUE | 1 | 0 | 1 |
Cuzco | FALSE | 1 | 0 | -1 |
Lima | TRUE | 0 | 1 | 1 |
Lima | FALSE | 0 | 1 | -1 |
Valladolid | NA | -1 | -1 | 0 |
If we simply multiply these dialect
and spont
main effect
contrasts together to obtain the contrasts for the interaction (which is what
is done in the default model.matrix
method), we get
following contrasts:
dialect | spont | dialectCuzco:spontTRUE | dialectLima:spontTRUE |
Cuzco | TRUE | 1 | 0 |
Cuzco | FALSE | -1 | 0 |
Lima | TRUE | 0 | 1 |
Lima | FALSE | 0 | -1 |
Valladolid | NA | 0 | 0 |
However, these contrasts introduce an unnecessary parameter to the model
which causes collinearity with the main effects since
spontTRUE = dialectCuzco:spontTRUE + dialectLima:spontTRUE
in all
cases. The functions in the nauf
package automatically recognize when
this occurs, and create a second set of contrasts for dialect
in which
the Valladolid
level is treated as if it were NA
(through and
additional call to named_contr_sum
):
dialect | dialect.c2.Cuzco |
Cuzco | 1 |
Lima | -1 |
Valladolid | 0 |
This second set of dialect
contrasts is only used when it needs to be.
That is, in this case, these contrasts would be used in the creation of the
model matrix columns for the interaction term dialect:spont
term,
but not in the creation of the model matrix columns for the main effect terms
dialect
and spont
, and when the second set of contrasts is
used, .c2.
will appear between the name of the factor and the level so
it can be easily identified:
dialect | spont | dialectCuzco | dialectLima | spontTRUE | dialect.c2.Cuzco:spontTRUE |
Cuzco | TRUE | 1 | 0 | 1 | 1 |
Cuzco | FALSE | 1 | 0 | -1 | -1 |
Lima | TRUE | 0 | 1 | 1 | -1 |
Lima | FALSE | 0 | 1 | -1 | 1 |
Valladolid | NA | -1 | -1 | 0 | 0 |
Turning now to an example of when a random slope requires new contrasts,
consider a random item
slope for dialect
. Because
dialect = Valladolid
only when item
is NA
, using the
main effect contrasts for dialect
for the item
slope would
result in collinearity with the item
intercept in the random effects
model matrix:
dialect | item | i01:(Intercept) | i01:dialectCuzco | i01:dialectLima |
Cuzco | i01 | 1 | 1 | 0 |
Cuzco | i02 | 0 | 0 | 0 |
Cuzco | NA | 0 | 0 | 0 |
Lima | i01 | 1 | 0 | 1 |
Lima | i02 | 0 | 0 | 0 |
Lima | NA | 0 | 0 | 0 |
Valladolid | NA | 0 | 0 | 0 |
This table shows the random effects model matrix for item i01
for all
possible scenarios, with the rows corresponding to (in order): a Cuzco
speaker producing the read speech plosive in item i01
, a Cuzco speaker
producing a read speech plosive in another item
, a Cuzco speaker
producing a spontaneous speech plosive, a Lima speaker producing the read
speech plosive in item i01
, a Lima speaker producing a read speech
plosive in another item
, a Lima speaker producing a spontaneous speech
plosive, and a Valladolid speaker producing a spontaneous speech plosive.
With the main effect contrasts for dialect
,
i01:(Intercept) = i01:dialectCuzco + i01:dialectLima
in all cases,
causing collinearity. Because this collinearity exists for all read speech
item random effects model matrices, the model is unidentifiable. The
functions in the nauf
package automatically detect that this is the
case, and remedy the situation by creating a new set of contrasts used for
the item
slope for dialect
:
dialect | item | i01:(Intercept) | i01:dialect.c2.Cuzco |
Cuzco | i01 | 1 | 1 |
Cuzco | i02 | 0 | 0 |
Cuzco | NA | 0 | 0 |
Lima | i01 | 1 | -1 |
Lima | i02 | 0 | 0 |
Lima | NA | 0 | 0 |
Valladolid | NA | 0 | 0 |
If we were to, say, fit the model
intdiff ~ dialect * spont + (1 + dialect | item)
, then nauf
would
additionally recognize that the same set of altered contrasts for
dialect
are required in the fixed effects interaction term
dialect:spont
and the item
slope for dialect
, and both
would be labeled with .c2.
. In other (rare) cases, more than two sets
of contrasts may be required for a factor, in which case they would have
.c3.
, .c4.
and so on.
In this way, users only need to code unordered factors as NA
in the
subsets of the data where they are not contrastive, and nauf
handles
the rest. Having described in detail what nauf
contrasts are, we now
return to the nauf_contrasts
function. The function can be used on
objects of any nauf
model, a nauf.terms
object, or a
model frame made by nauf_model.frame
. It returns a named list
with a matrix for each
unordered factor in object
which contains all contrasts associated the
factor. For the model intdiff ~ dialect * spont + (1 + dialect | item)
,
the result would be a list with elements dialect
and spont
that
contain the following matrices (see the 'Examples' section for code to
generate this list):
dialect | Cuzco | Lima | .c2.Cuzco |
Cuzco | 1 | 0 | 1 |
Lima | 0 | 1 | -1 |
Valladolid | -1 | -1 | 0 |
spont | TRUE |
TRUE | 1 |
FALSE | -1 |
NA | 0 |
The default is for the list of contrasts to only contain information about
unordered factors. If inc_ordered = TRUE
, then the contrast matrices
for any ordered factors in object
are also included.
A named list of contrasts for all unordered factors in object
,
and also optionally contrasts for ordered factors in object
. See
'Details'.
The argument ncs_scale
changes what value is used for
the sum contrast deviations. The default value of 1
would give the
contrast matrices in 'Details'. A value of ncs_scale = 0.5
, for example,
would result in replacing 1
with 0.5
and -1
with
-0.5
in all of the contrast matrices.
nauf_model.frame
, nauf_model.matrix
,
nauf_glFormula
, nauf_glm
, and
nauf_glmer
.
1 2 3 4 5 6 7 8 9 | dat <- plosives
dat$spont[dat$dialect == "Valladolid"] <- NA
mf <- nauf_model.frame(intdiff ~ dialect * spont + (1 + dialect | item), dat)
nauf_contrasts(mf)
mf <- nauf_model.frame(intdiff ~ dialect * spont + (1 + dialect | item),
dat, ncs_scale = 0.5)
nauf_contrasts(mf)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.