Using the nauf package"

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

The nauf package

It is often the case that a factor only makes sense in a subset of a dataset (i.e. for some observations a factor may simply not be meaningful or contrastive), or that with observational datasets there are no observations in some levels of an interaction term. There are also cases where a random effects grouping factor is only applicable in a subset the data, and it is desireable to model the noise introduced by the repeated measures on the group members within the subset of the data where the repeated measures exist. The nauf package allows unordered factors and random effects grouping factors to be coded as NA in the subsets of the data where they are not applicable or otherwise not contrastive. Sum contrasts are used for all unordered factors (using named_contr_sum in the standardize package), and then NA values are set to 0. This allows all of the data to be modeled together without creating collinearity or making the output difficult to interpret. It is highly recommended that regression variables be put on the same scale with the standardize function in the standardize package prior to using nauf functions (though this is not required for the functions to work). The "Using the standardize package" vignette also provides useful information on the difference between unordered and ordered factors.

NA values in random effects grouping factors

The plosives dataset (Eager, 2017) included in the nauf package contains measures of plosive strength for instances of intervocalic Spanish /p/, /t/, /k/, /b/, /d/ and /g/ from speakers from three dialects. For this first example we will focus on the voiceless plosives /ptk/ from the 30 speakers in the Cuzco dialect. The dataset contains a combination of experimental data (with exactly repeated measures on items from a read speech task) and observational data (without exactly repeated measures; from interviews with the same speakers). For the spontaneous speech (as indicated by the logical variable spont) the item factor is coded as NA.

library(nauf)

summary(plosives)

dat <- droplevels(subset(plosives, dialect == "Cuzco" & voicing == "Voiceless"))

xtabs(~ is.na(item) + spont, dat)

For our example, we want to model the voiceless duration of the plosives as a function of place of articulation, stress, and whether or not the speech was spontaneous. When modeling the read speech data (spont = FALSE), we want to account for noise introduced by the repeated measures on item, but we can't apply this random effects structure to the interview data (spont = TRUE). In addition to this, we want to account for the noise introduced by the individual speakers. Rather than leaving out the item effects to analyze all the data together, or keeping the item effects and analyzing the read and spontaneous speech separately, we can model both subsets together and have the item effects apply only within the read speech data using nauf. We just need to have item coded as NA when it is not applicable, which it already is. Then the nauf_lmer function takes care of the rest.

sobj <- standardize(vdur ~ spont + place + stress +
  (1 + spont + place + stress | speaker) + (1 | item), dat)

mod <- nauf_lmer(sobj$formula, sobj$data)

summary(mod)

This way, we are making use of all of the information in the dataset. We can obtain a more principled statistical test for the spont factor, and also get better information about the other fixed effects and the individual speakers, since the same random effects for speaker apply in both the read speech and spontaneous subsets. We can obtain Type III tests using anova (here with method = "S" to indicate Satterthwaite approximation for the denominator degrees of freedom in the F tests).

anova(mod, method = "S")

We see that stress is significant, and can then get predicted marginal means (often called least-squares means) and pairwise comparisons for its levels using nauf_ref.grid to create a reference grid, and then calling nauf_pmmeans.

rg <- nauf_ref.grid(mod)
nauf_pmmeans(rg, "stress", pairwise = TRUE)

NA values in the fixed effects

The fricatives dataset (Hualde and Prieto, 2014) included in the nauf package contains measures of duration and voicing for intervocalic alveolar fricatives in Spanish and Catalan. Spanish has only one such fricative, /s/ (underlyingly voiceless), which can occur in any position in the word (initial, medial, or final). In Catalan, the situation is much more complicated. At the beginning of a word and in the middle of a word, both /s/ (underlyingly voiceless) and /z/ (underlyingly voiced) can occur (though word-initial /z/ is rare, and does not occur in the dataset). In word-final position, there is no contrast between /s/ and /z/ (labeled /S/, underlyingly neutralized), with the voicing of the fricative determined by the following sound. Because all of the fricatives in the dataset are intervocalic, /S/ ought to be like /z/ (according to traditional descriptions of Catalan), but may be shorter and possibly less voiced. That is, in the dataset, we have the following set of unique values.

summary(fricatives)

dat <- fricatives

u <- unique(dat[, c("lang", "wordpos", "uvoi")])
u <- u[order(u$lang, u$wordpos, u$uvoi), ]
rownames(u) <- NULL
u

This raises a problem for the regression analysis, since underlying voicing uvoi is only contrastive within a subset of the data (specifically, when lang = "Catalan" and wordpos = "Medial"), and everywhere else is uniquely determined by lang and wordpos. Ideally, we would like to be able to have uvoi apply only where it is contrastive, and to be able to include random slopes for uvoi for the Catalan speakers, but not the Spanish speakers. Using traditional methods won't help us here. If we were to take the full interaction of the three factors as a new factor with 7 levels, we can't include slopes. If we run separate regressions on different subsets of the data, then we will have to limit the comparisons we can make, and won't be making use of all of the information the data is providing us. Note also that this issue has nothing to do with the way the data were collected; the nature of the languages and the research questions creates these imbalances. With nauf, we can solve this problem by coding uvoi as NA in the subsets of the data where it is not contrastive. That is we can code the unique possible combinations as:

u$uvoi[!(u$lang == "Catalan" & u$wordpos == "Medial")] <- NA
u

dat$uvoi[!(dat$lang == "Catalan" & dat$wordpos == "Medial")] <- NA

In this way, we are recognizing that word-final Catalan fricatives are limited to one value for uvoi, as are all Spanish fricatives, etc. The meaning of NA in this table is not always the same, so we have to keep track of it, and we will definitely want to include an interaction between lang and wordpos since, for example, "Catalan:Final:NA" means neutralized /S/ while "Spanish:Final:NA" means voiceless /s/. As for being able to include uvoi slopes for Catalan speakers and not Spanish speakers, we can create two new random effects grouping factors, one for each language, where the factor is the speaker identifier or NA based on the language the observation comes from:

dat$ca_speaker <- dat$sp_speaker <- dat$speaker
dat$ca_speaker[dat$lang != "Catalan"] <- NA
dat$sp_speaker[dat$lang != "Spanish"] <- NA

With these NA values, we can then fit a model with nauf_lmer to predict the duration of the fricatives based on lang, wordpos, and uvoi:

sobj <- standardize(dur ~ lang * wordpos + uvoi +
  (1 + wordpos + uvoi | ca_speaker) + (1 + wordpos | sp_speaker), dat)

mod <- nauf_lmer(sobj$formula, sobj$data)

summary(mod)

To understand how nauf treats the NA values in uvoi, we call nauf_contrasts:

nauf_contrasts(mod)

The function returns a list with an element for each of the three factors. Sum contrasts have been applied for all three with named_contr_sum from the standardize package, but for uvoi there is an additional row indicating that a value of 0 is used when it is NA. In this way, the uvoiVoiced coefficient in the summary never contributes to the fitted value for any observation that belongs to a subset where underlying voicing is not contrastive, and its estimate represents half the difference between underlyingly voiced and voiceless Catalan word-medial observations. We can run an anova just as we did in the Cuzco example (though in this case it is not as useful since we are interested in making specific comparisons), and also create a reference grid:

anova(mod, method = "S")

rg <- nauf_ref.grid(mod)

We can now use the nauf_pmmeans function's subset, by, and na_as_level arguments to test hypotheses in the subset of the reference grid where comparisons are valid. For example, if we want to test whether Spanish fricatives are longer than Catalan fricatives, the only just comparison we can make is for word-initial and word-medial /s/. To do this, we provide nauf_pmmeans with a list of valid groups, with each group specified as a named list of unordered factor levels:

nauf_pmmeans(rg, "lang", pairwise = TRUE,
  subset = list(
    list(lang = "Catalan", wordpos = "Initial", uvoi = NA),
    list(lang = "Catalan", wordpos = "Medial", uvoi = "Voiceless"),
    list(lang = "Spanish", wordpos = c("Initial", "Medial"), uvoi = NA)
  )
)

Each of the lists specified in the subset list represents a set of valid combinations of factor levels where language is truly contrastive. Any row in the reference grid which belongs to at least one of the specified groups is kept and the others are ignored. So the Catalan estimate is an average of Catalan:Initial:NA and Catalan:Medial:Voiceless, and the Spanish estimate is an average of Spanish:Initial:NA and Spanish:Medial:NA. To test the effect of word position in the two languages separately, we can make use of the by argument, and specify the groups for each language where word position is truly contrastive:

nauf_pmmeans(rg, c("lang", "wordpos"), pairwise = TRUE, by = "lang",
  subset = list(
    list(lang = "Catalan", wordpos = "Initial", uvoi = NA),
    list(lang = "Catalan", wordpos = "Medial", uvoi = "Voiceless"),
    list(lang = "Spanish", uvoi = NA)
  )
)

In this way we obtain 5 predicted marginal means, and pairwise comparisons within-language. Underlying voicing is only contrastive for Catalan medial fricatives, and so we can call:

nauf_pmmeans(rg, "uvoi", pairwise = TRUE,
  subset = list(lang = "Catalan", wordpos = "Medial")
)

In this call, we are only specifying one group in subset, so we don't need to double-list it (i.e. nauf will understnd list(lang = "Catalan", wordpos = "Medial") as list(list(lang = "Catalan", wordpos = "Medial"))). For Catalan word-final neutralized /S/, we may be interested in making a comparison to see whether it is somewhere in between /s/ and /z/ in terms of duration, or not significantly different from /z/. This case requires the additional argument na_as_level to be specified since we want an estimate for a group where uvoi is NA (the default is that estiamtes are not generated for any group that has NA's in the estimate table):

nauf_pmmeans(rg, c("wordpos", "uvoi"), pairwise = TRUE, na_as_level = "uvoi",
  subset = list(
    list(lang = "Catalan", wordpos = "Medial", uvoi = c("Voiced", "Voiceless")),
    list(lang = "Catalan", wordpos = "Final", uvoi = NA)
  )
)

As a final example, we could test if Spanish word-final /s/ (being shorter than Spanish word-medial /s/ above) is as short as Catalan word-final /S/:

nauf_pmmeans(rg, "lang", pairwise = TRUE,
  subset = list(wordpos = "Final", uvoi = NA)
)

In this way, we are able to use all of the information in the dataset when fitting the model, account for the uncertainty introduced by repeated measures on the subjects, assigning different random effects to the two langauges, and then test any variety of hypotheses we might have about the effects of the factors within the subsets of the data where the comparisons make sense.

Multiple sets of contrasts for a factor

There are two situations in which unordered factors will need more than one set of contrasts: (1) when an unordered factor with NA values interacts with another unordered factor, and some levels are collinear with NA; and (2) when an unordered factor is included as a slope for a random effects grouping factor that has NA values, but only a subset of the levels for the slope factor occur when the grouping factor is applicable. Both of these situations occur when we consider all three dialects in the plosives dataset jointly (here we will look at the voiceless subset):

dat <- subset(plosives, voicing == "Voiceless")

xtabs(~ dialect + spont, dat)

The data for Cuzco and Lima consist of two subsets: read speech and spontaneous speech (with the read speech task being the same for both dialects). The Valladolid data, however, come from a different corpus consisting only of spontaneous speech, with the measurements taken in the same way as for Cuzco and Lima. While it would of course be ideal to have read speech for the Valladolid speakers as well, this doesn't mean that we need to split up the data and run multiple regressions. We can simply code spont as NA for Valladolid, split the speaker random effects by dialect, and only include a spont slope for Cuzco and Lima:

dat$spont[dat$dialect == "Valladolid"] <- NA
dat$c_speaker <- dat$l_speaker <- dat$v_speaker <- dat$speaker
dat$c_speaker[dat$dialect != "Cuzco"] <- NA
dat$l_speaker[dat$dialect != "Lima"] <- NA
dat$v_speaker[dat$dialect != "Valladolid"] <- NA

sobj <- standardize(cdur ~ spont * dialect +
  (1 + spont | c_speaker) + (1 + spont | l_speaker) + (1 | v_speaker) +
  (1 + dialect | item),
  dat)

mod <- nauf_lmer(sobj$formula, sobj$data)

summary(mod)

In this way, speaker effects are accounted for in each dialect, and within the spont = FALSE subset, item effects are additionally accounted for. However, note that for the interaction term spont:dialect, .c2. appears before Cuzco, and there is only one coefficient for the interaction term rather than two as we would normally expect. The same goes for the item slope for dialect. This is because using the main effect contrasts for dialect in the spont:dialect interaction term and item dialect slope would lead to collinearity issues (as explained in detail in the help page for the nauf_contrasts function). The nauf package automatically recognizes when this happens, and creates additional sets of contrasts which it uses only when it needs to. To see how the contrasts are coded, we can call nauf_contrasts:

nauf_contrasts(mod)

Regression functions

There are nauf regression functions for fixed effects regressions which would normally be fit with lm, glm, and glm.nb (from the stats and MASS packages), mixed effects regressions which would normally be fit with lmer, glmer, and glmer.nb (from the lme4 package), and Bayesian versions of these six regression functions which would normally be fit with stan_lm, stan_glm, stan_glm.nb, stan_lmer, stan_glmer, and stan_glmer.nb (from the rstanarm package). In each case, the nauf regression function has the same name but preceded by nauf_ (e.g. nauf_lm, nauf_glmer, nauf_stan_glmer.nb, etc.). For Bayesian regressions, there are methods for the generic functions in the rstantools package, as well as functions nauf_kfold and nauf_launch_shinystan, and nauf_ref.grid and nauf_pmmeans work in the same way as illustrated above, but returning summaries based on computing posterior marginal means at each iteration of the model.

Conclusion

The nauf package allows unordered factors and random effects grouping factors to be used even when they are only applicable within a subset of a dataset. The user only needs to code them as NA when they are not applicable/contrastive, and nauf regression fitting functions take care of the rest. Different random effects structures for the same grouping variable can be fit in different subsets by creating new factors from the grouping factor, and setting them to NA in the appropriate subsets. The nauf_pmmeans function can then be used to test hypotheses conditioning on subsets of the data where a just comparison can be made.

References

If you use the nauf package in a publication, please cite:

Eager, Christopher D. (2017). nauf: Regression with NA Values in Unordered Factors. R package version 1.1.0. https://CRAN.R-project.org/package=nauf

If you analyze the plosives dataset in a publication, please cite:

Eager, Christopher D. (2017). Contrast preservation and constraints on individual phonetic variation. Doctoral thesis. University of Illinois at Urbana-Champaign.

If you analyze the fricatives dataset in a publication, please cite:

Hualde, J. I., & Prieto, P. (2014). Lenition of intervocalic alveolar fricatives in Catalan and Spanish. Phonetica, 71(2), 109-127.


Try the nauf package in your browser

Any scripts or data that you put into this service are public.

nauf documentation built on June 20, 2017, 9:05 a.m.