nauf_contrasts: Not applicable unordered factor contrasts.
In CDEager/nauf: Regression with NA Values in Unordered Factors

Description Usage Arguments Details Value Note See Also Examples

The nauf_contrasts function returns a list of contrasts applied to factors in an object created using a function in the nauf package. See 'Details'.

1	nauf_contrasts(object, inc_ordered = FALSE)

`object`	A `nauf.terms` object, a model frame made with `nauf_model.frame`, a `nauf.glm` model (see `nauf_glm`), or a `nauf.lmerMod` or `nauf.glmerMod` model.
`inc_ordered`	A logical indicating whether or not ordered factor contrasts should also be returned (default `FALSE`).

In the nauf package, NA values are used to encode when an unordered factor is truly not applicable. This is different than "not available" or "missing at random". The concept applies only to unordered factors, and indicates that the factor is simply not meaningful for an observation, or that while the observation may technically be definable by one of the factor levels, the interpretation of its belonging to that level isn't the same as for other observations. For imbalanced observational data, coding unordered factors as NA may also be used to control for a factor that is only contrastive within a subset of the data due to the sampling scheme. To understand the output of the nauf_contrasts function, the treatment of unordered factor contrasts in the nauf package will first be discussed, using the plosives dataset included in the package as an example.

In the plosives dataset, the factor ling is coded as either Monolingual, indicating the observation is from a monolingual speaker of Spanish, or Bilingual, indicating the observation is from a Spanish-Quechua bilingual speaker. The dialect factor indicates the city the speaker is from (one of Cuzco, Lima, or Valladolid). The Cuzco dialect has both monolingual and bilingual speakers, but the Lima and Valladolid dialects have only monolingual speakers. In the case of Valladolid, the dialect is not in contact with Quechua, and so being monolingual in Valladolid does not mean the same thing as it does in Cuzco, where it indicates monolingual as opposed to bilingual. Lima has Spanish-Quechua bilingual speakers, but the research questions the dataset serves to answer are specific to monolingual speakers of Spanish in Lima. If we leave the ling factor coded as is in the dataset and use named_contr_sum to create the contrasts, we obtain the following:

dialect	ling	dialectCuzco	dialectLima	lingBilingual
Cuzco	Bilingual	1	0	1
Cuzco	Monolingual	1	0	-1
Lima	Monolingual	0	1	-1
Valladolid	Monolingual	-1	-1	-1

With these contrasts, the regression coefficient dialectLima would not represent the difference between the intercept and the mean of the Lima dialect; the mean of the Lima dialect would be the (Intercept) + dialectLima - lingBilingual. The interpretation of the lingBilingual coefficient is similarly affected, and the intercept term averages over the predicted value for the non-existent groups of Lima bilingual speakers and Valladolid bilingual speakers, losing the interpretation as the corrected mean (insofar as there can be a corrected mean in this type of imbalanced data). With the nauf package, we can instead code non-Cuzco speakers' observations as NA for the ling factor (i.e. execute plosives$ling[plosives$dialect != "Cuzco"] <- NA). These NA values are allowed to pass into the regression's model matrix, and are then set to 0, effectively creating the following contrasts:

dialect	ling	dialectCuzco	dialectLima	lingBilingual
Cuzco	Bilingual	1	0	1
Cuzco	Monolingual	1	0	-1
Lima	NA	0	1	0
Valladolid	NA	-1	-1	0

Because sum contrasts are used, a value of 0 for a dummy variable averages over the effect of the factor, and the coefficient lingBilingual only affects the predicted value for observations where dialect = Cuzco. In a regression fit with these contrasts, the coefficient dialectLima represents what it should, namely the difference between the intercept and the mean of the Lima dialect, and the intercept is again the corrected mean. The lingBilingual coefficient is now the difference between Cuzco bilingual speakers and the corrected mean of the Cuzco dialect, which is (Intercept) + dialectCuzco. These nauf contrasts thus allow us to model all of the data in a single model without sacrificing the interpretability of the results. In sociolinguistics, this method is called slashing due to the use of a forward slash in GoldVarb to indicate that a factor is not applicable.

This same methodology can be applied to other parts of the plosives dataset where a factor's interpretation is the same for all observations, but is only contrastive within a subset of the data due to the sampling scheme. The age and ed factors (speaker age group and education level, respectively) are factors which can apply to speakers regardless of their dialect, but in the dataset they are only contrastive within the Cuzco dialect; all the Lima and Valladolid speakers are 40 years old or younger with a university education (in the case of Valladolid, the data come from an already-existing corpus; and in the case of Lima, the data were collected as part of the same dataset as the Cuzco data, but as a smaller control group). These factors can be treated just as the ling factor by setting them to NA for observations from Lima and Valladolid speakers. Similarly, there is no read speech data for the Valladolid speakers, and so spont could be coded as NA for observations from Valladolid speakers.

Using NA values can also allow the inclusion of a random effects structure which only applies to a subset of the data. The plosives dataset has data from both read (spont = FALSE; only Cuzco and Lima) and spontaneous (spont = TRUE; all three dialects) speech. For the read speech, there are exactly repeated measures on 54 items, as indicated by the item factor. For the spontaneous speech, there are not exactly repeated measures, and so in this subset, item is coded as NA. In a regression fit using nauf_lmer, nauf_glmer, or nauf_glmer.nb with item as a grouping factor, the random effects model matrix is created for the read speech just as it normally is, and for spontaneous speech observations all of the columns are set to 0 so that the item effects only affect the fitted values for read speech observations. In this way, the noise introduced by the read speech items can be accounted for while still including all of the data in one model, and the same random effects for speaker can apply to all observations (both read and spontaneous), which will lead to a more accurate estimation of the fixed, speaker, and item effects since more information is available than if the read and spontaneous speech were analyzed in separate models.

There are two situations in which unordered factors will need more than one set of contrasts: (1) when an unordered factor with NA values interacts with another unordered factor, and some levels are collinear with NA; and (2) when an unordered factor is included as a slope for a random effects grouping factor that has NA values, but only a subset of the levels for the slope factor occur when the grouping factor is not NA. As an example of an interaction requiring new contrasts, consider the interaction dialect * spont (that is, suppose we are interested in whether the effect of spont is different for Cuzco and Lima). We code spont as NA when dialect = Valladolid, as mentioned above. This gives the following contrasts for the main effects:

dialect	spont	dialectCuzco	dialectLima	spontTRUE
Cuzco	TRUE	1	0	1
Cuzco	FALSE	1	0	-1
Lima	TRUE	0	1	1
Lima	FALSE	0	1	-1
Valladolid	NA	-1	-1	0

If we simply multiply these dialect and spont main effect contrasts together to obtain the contrasts for the interaction (which is what is done in the default model.matrix method), we get following contrasts:

dialect	spont	dialectCuzco:spontTRUE	dialectLima:spontTRUE
Cuzco	TRUE	1	0
Cuzco	FALSE	-1	0
Lima	TRUE	0	1
Lima	FALSE	0	-1
Valladolid	NA	0	0

However, these contrasts introduce an unnecessary parameter to the model which causes collinearity with the main effects since spontTRUE = dialectCuzco:spontTRUE + dialectLima:spontTRUE in all cases. The functions in the nauf package automatically recognize when this occurs, and create a second set of contrasts for dialect in which the Valladolid level is treated as if it were NA (through and additional call to named_contr_sum):

dialect	dialect.c2.Cuzco
Cuzco	1
Lima	-1
Valladolid	0

This second set of dialect contrasts is only used when it needs to be. That is, in this case, these contrasts would be used in the creation of the model matrix columns for the interaction term dialect:spont term, but not in the creation of the model matrix columns for the main effect terms dialect and spont, and when the second set of contrasts is used, .c2. will appear between the name of the factor and the level so it can be easily identified:

dialect	spont	dialectCuzco	dialectLima	spontTRUE	dialect.c2.Cuzco:spontTRUE
Cuzco	TRUE	1	0	1	1
Cuzco	FALSE	1	0	-1	-1
Lima	TRUE	0	1	1	-1
Lima	FALSE	0	1	-1	1
Valladolid	NA	-1	-1	0	0

Turning now to an example of when a random slope requires new contrasts, consider a random item slope for dialect. Because dialect = Valladolid only when item is NA, using the main effect contrasts for dialect for the item slope would result in collinearity with the item intercept in the random effects model matrix:

dialect	item	i01:(Intercept)	i01:dialectCuzco	i01:dialectLima
Cuzco	i01	1	1	0
Cuzco	i02	0	0	0
Cuzco	NA	0	0	0
Lima	i01	1	0	1
Lima	i02	0	0	0
Lima	NA	0	0	0
Valladolid	NA	0	0	0

This table shows the random effects model matrix for item i01 for all possible scenarios, with the rows corresponding to (in order): a Cuzco speaker producing the read speech plosive in item i01, a Cuzco speaker producing a read speech plosive in another item, a Cuzco speaker producing a spontaneous speech plosive, a Lima speaker producing the read speech plosive in item i01, a Lima speaker producing a read speech plosive in another item, a Lima speaker producing a spontaneous speech plosive, and a Valladolid speaker producing a spontaneous speech plosive. With the main effect contrasts for dialect, i01:(Intercept) = i01:dialectCuzco + i01:dialectLima in all cases, causing collinearity. Because this collinearity exists for all read speech item random effects model matrices, the model is unidentifiable. The functions in the nauf package automatically detect that this is the case, and remedy the situation by creating a new set of contrasts used for the item slope for dialect:

dialect	item	i01:(Intercept)	i01:dialect.c2.Cuzco
Cuzco	i01	1	1
Cuzco	i02	0	0
Cuzco	NA	0	0
Lima	i01	1	-1
Lima	i02	0	0
Lima	NA	0	0
Valladolid	NA	0	0

If we were to, say, fit the model intdiff ~ dialect * spont + (1 + dialect | item), then nauf would additionally recognize that the same set of altered contrasts for dialect are required in the fixed effects interaction term dialect:spont and the item slope for dialect, and both would be labeled with .c2.. In other (rare) cases, more than two sets of contrasts may be required for a factor, in which case they would have .c3., .c4. and so on.

In this way, users only need to code unordered factors as NA in the subsets of the data where they are not contrastive, and nauf handles the rest. Having described in detail what nauf contrasts are, we now return to the nauf_contrasts function. The function can be used on objects of any nauf model, a nauf.terms object, or a model frame made by nauf_model.frame. It returns a named list with a matrix for each unordered factor in object which contains all contrasts associated the factor. For the model intdiff ~ dialect * spont + (1 + dialect | item), the result would be a list with elements dialect and spont that contain the following matrices (see the 'Examples' section for code to generate this list):

dialect	Cuzco	Lima	.c2.Cuzco
Cuzco	1	0	1
Lima	0	1	-1
Valladolid	-1	-1	0

spont	TRUE
TRUE	1
FALSE	-1
NA	0

The default is for the list of contrasts to only contain information about unordered factors. If inc_ordered = TRUE, then the contrast matrices for any ordered factors in object are also included.

A named list of contrasts for all unordered factors in object, and also optionally contrasts for ordered factors in object. See 'Details'.

The argument ncs_scale changes what value is used for the sum contrast deviations. The default value of 1 would give the contrast matrices in 'Details'. A value of ncs_scale = 0.5, for example, would result in replacing 1 with 0.5 and -1 with -0.5 in all of the contrast matrices.

nauf_model.frame, nauf_model.matrix, nauf_glFormula, nauf_glm, and nauf_glmer.

dat <- plosives
dat$spont[dat$dialect == "Valladolid"] <- NA

mf <- nauf_model.frame(intdiff ~ dialect * spont + (1 + dialect | item), dat)
nauf_contrasts(mf)

mf <- nauf_model.frame(intdiff ~ dialect * spont + (1 + dialect | item),
  dat, ncs_scale = 0.5)
nauf_contrasts(mf)