knitr::opts_chunk$set(fig.path="doby-hojsgaard-fig-", tidy=FALSE,
                      cache=!TRUE,
                      fig.width=5, fig.height=2.5, size="small", out.extra='keepaspectratio')
options("show.signif.stars"=FALSE)
options("digits"=4)
suppressPackageStartupMessages({
library(ggplot2)
library(doBy)
##library(Matrix)
library(lme4)
library(magrittr)
library(tibble)
library(broom)
library(multcomp)
})

Introduction

The \CRANpkg{doBy} package [@doby] grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the SAS system, [@procsummary]). The package first appeared on CRAN, \url{https://cran-r-project.org}, in 2006. The name \pkg{doBy} comes from the need to do some computations on data which is stratified By the value of some variables. Today the package contains many additional utilities. In this paper we focus 1) on the "doing by"-functions and 2) on functions related to linear estimates and contrasts (in particular LS-means).

Related functionality

When it comes to data handling, \pkg{doBy} is nowhere nearly as powerful as more contemporary packages, such as those in the \CRANpkg{tidyverse} eco system, [@tidyverse]. The \code{aggregate} function in base R provides functionality similar to \pkg{doBy}s \code{summaryBy} function. Another package to be mentioned in this connection is \CRANpkg{data.table}, @data.table. On the other hand, \pkg{doBy} is based on classical data structures that are unlikely to undergo sudden changes. There is one exception to this, though: The data handling functions work on tibble`s, from \CRANpkg{tibble} @tibble. In relation to linear estimates, the \CRANpkg{multcomp} package [@multcomp] deserves mention, and the \CRANpkg{lsmeans} package [@lsmeans] provides facilities for computing LS-means.

It can be hypothesized that the data handling functions in \pkg{doBy} remain appealing to a group of users because of their simplicity.

A working dataset - the CO2 data

The CO2 data frame comes from an experiment on the cold tolerance of the grass species Echinochloa crus-galli. Type is a factor with levels Quebec or Mississippi giving the origin of the plant. Treatment is a factor levels nonchilled or chilled. Data is balanced with respect to these two factors. However, illustrated certain points we exclude a few rows of data to make data imbalanced. To limit the amount of output we modify names and levels of variables as follows

data(CO2)
CO2 <- within(CO2, {
    Treat = Treatment; Treatment = NULL
    levels(Treat) = c("nchil", "chil"); levels(Type) = c("Que", "Mis")
})
CO2 <- subset(CO2, Plant %in% c("Qn1", "Qc1", "Mn1", "Mc1"))
CO2 <- CO2[-(1:3),]
xtabs(~Treat+Type, data=CO2)
head(CO2, 4)
toothInt <- summaryBy(uptake ~ Treat + Type, data=CO2)
ggplot(CO2, aes(x = Treat, y = uptake, colour = Type)) +
    geom_boxplot(outlier.shape = 4) +
    geom_point(data = toothInt, aes(y = uptake.mean)) +
    geom_line(data = toothInt, aes(y = uptake.mean, group = Type))

Functions related to groupwise computations

The \code{summaryBy} function

The \code{summaryBy} function is used for calculating quantities like the mean and variance of numerical variables x and y for each combination of two factors A and B. Notice: A functionality similar to \code{summaryBy} is provided by \code{aggregate} from base R, but \code{summaryBy} offers additional features.

myfun1 <- function(x){c(m=mean(x), s=sd(x))}
summaryBy(cbind(conc, uptake, lu=log(uptake)) ~ Plant, data=CO2, FUN=myfun1)

The convention is that variables that do not appear in the dataframe (e.g. \code{log(uptake)}) must be named (here as \code{lu}). Various shortcuts are available, e.g. the following, where left hand side dot refers to all numeric variables while the right hand side dot refers to all factor variables. Writing 1 on the right hand side leads to computing over the entire dataset:

summaryBy(. ~ ., data=CO2, FUN=myfun1)
summaryBy(. ~ 1, data=CO2, FUN=myfun1)

Specifications as formulas and lists

We shall refer to all the functions for groupwise computations etc. as the By-functions. The convention for the By-functions is that a two sided formula like can be written in two ways:

cbind(x, y) ~ A + B
list(c("x", "y"), c("A", "B"))

Some By-functions only take a right hand sided formula as input. Such a formula can also be written in two ways:

~ A + B
c("A", "B")

The list-form / vector-form is especially useful if a function is invoked programatically. Hence the calls to \code{summaryBy} above can also be made as

summaryBy(list(c("conc", "uptake", "lu=log(uptake)"), "Plant"), data=CO2, FUN=myfun1)
summaryBy(list(".", "."), data=CO2, FUN=myfun1)
summaryBy(list(".", "1"), data=CO2, FUN=myfun1)

Using the pipe operator

The \code{summaryBy} function has a counterpart called \code{summary_by}. The difference is that a formula is the first argument to the former function while a dataframe (or a tibble) is the first argument to the latter. The same applies to the other By-functions. This allows for elegant use of the pipe operator %>% from \CRANpkg{magrittr}, [@magrittr]:

CO2 %>% summary_by(cbind(conc, uptake) ~ Plant + Type, FUN=myfun1) -> newdat
newdat

The \code{orderBy} function

Ordering (or sorting) a data frame is possible with the orderBy function. Suppose we want to order the rows of the the CO2 data by increasing values of conc and decreasing value of uptake (within conc):

x1 <- orderBy(~ conc - uptake, data=CO2)
head(x1)

The \code{splitBy} function

Suppose we want to split CO2 into a list of dataframes:

x1 <- splitBy(~ Plant + Type, data=CO2)
x1

The result is a list (with a few additional attributes):

lapply(x1, head, 2)

The \code{subsetBy} function

Suppose we want to select those rows within each treatment for which the uptake is larger than 75% quantile of uptake (within the treatment). This is achieved by:

x2 <- subsetBy(~ Treat, subset=uptake > quantile(uptake, prob=0.75), data=CO2)
head(x2, 4)

The \code{transformBy} function

The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example:

x3 <- transformBy(~ Treat, data=CO2, 
                 minU=min(uptake), maxU=max(uptake), range=diff(range(uptake)))
head(x3, 4)

The \code{lmBy} function

The \code{lmBy} function allows for fitting linear models to different strata of data (the vertical bar is used for defining groupings of data):

m <- lmBy(uptake ~ conc | Treat, data=CO2)
coef(m)

The result is a list with a few additional attributes and the list can be processed further as e.g.

lapply(m, function(z) coef(summary(z)))

Functions related linear estimates and contrasts

A linear function of a $p$--dimensional parameter vector $\beta$ has the form \begin{displaymath} C=L\beta \end{displaymath} where $L$ is a $q\times p$ matrix which we call the Linear Estimate Matrix or simply LE-matrix. The corresponding linear estimate is $\hat C = L \hat \beta$. A linear hypothesis has the form $H_0: L\beta=m$ for some $q$ dimensional vector $m$.

In the following we describe what is essentially simple ways of generating such $L$-matrices.

Computing linear estimates

First we focus on an additive model.

co2.add <- lm(uptake ~ Treat + Type, data=CO2)

Consider computing the estimated uptake for each treatment for plants originating from Mississippi: One option: Construct the LE--matrix $L$ directly and then compute $L\hat\beta$:

L <- matrix(c(1, 0, 1, 
              1, 1, 1), nrow=2, byrow=T); L
beta <- coef(co2.add); beta
L %*% beta

However, this approach does not produce standard errors, confidence intervals etc. Once $L$ has been constructed, such quantities can be constructed using \code{linest} (short for linear estimate) and the older but very similar \code{esticon} function (short for estimate contrast)

c1 <- linest(co2.add, L)
coef(c1)
confint(c1)
c1 <- esticon(co2.add, L)
c1

Another option is to invoke \code{glht} (short for general linear hypothesis) from the \CRANpkg{multcomp} package:

library(multcomp)
mc <- glht(co2.add, linfct=L)
summary(mc)

In \pkg{doBy} there are facilities for computing $L$ automatically and for supplying $L\hat\beta$ with standard errors etc.

L <- LE_matrix(co2.add, effect = "Treat", at=list(Type="Mis")); L

Least-squares means (LS--means)

A related question is: What is the estimated uptake for each treatment if we ignore the type (i.e. origin of the plants)? One option would to fit a linear model without Type as explanatory variable:

co2.0 <- update(co2.add, . ~ . - Type)
L0 <- LE_matrix(co2.0, effect="Treat"); L0
linest(co2.0, L=L0)

An alternative would be to keep the focus on the original model but compute the estimated uptake for each treatment for an average location. That would correspond to giving weight $1/2$ to each of the two locations. However, as one of the parameters is already set to zero to obtain identifiability, we obtain the LE--matrix $L$ as

L1 <- matrix(c(1, 0, 0.5, 
               1, 1, 0.5), nrow=2, byrow=T); L1
linest(co2.add, L=L1)

Such a particular linear estimate is sometimes called a least-squares mean, an LS-mean, a marginal mean or a population mean. If data had been balanced, the LS-mean would be identical to the result obtained after fitting a model without Type.

The virtue of the \CRANpkg{doBy} package is in this connection that $L$ can be generated automatically with:

L1 <- LE_matrix(co2.add, effect="Treat"); L1

Notice: One may obtain the LS-mean directly as:

LSmeans(co2.add, effect="Treat")
## same as
linest(co2.add, L=LE_matrix(co2.add, effect="Treat"))

For a model with interactions, the LS-means are computed as above, but the $L$-matrix is different:

co2.int <- lm(uptake ~ Treat * Type, data=CO2)
LE_matrix(co2.int, effect="Treat")

Using (transformed) covariates

From the examples above it should follows that the key aspect of computing LS-means is the (automatic) generation of the $L$ matrix. Therefore we shall in the following focus on form of the $L$ matrix rather than on the computed LS-means. Covariates are fixed at their average value (unless the at=...-argument is used, see below). For example, conc is fixed at the average value:

co2.lm1 <- lm(uptake ~ conc + Type + Treat, data=CO2)
lsm1 <- LSmeans(co2.lm1, effect="Treat")
lsm1$L
lsm1a <- LSmeans(co2.lm1, effect="Treat", at=list(conc=700))
lsm1a$L

A special issue arises in connection with transformed covariates. Consider:

co2.lm2 <- lm(uptake ~ conc + I(conc^2) + log(conc) + Type + Treat, data=CO2)
lsm2 <- LSmeans(co2.lm2, effect="Treat")
lsm2$L

Above \verb'I(conc^2)' is the the square of the average of conc (which is r mean(CO2$conc)^2) - not the average of the squared values of conc (which is r mean(CO2$conc^2)). Likewise log(conc) is the log of the average of conc (which is r log(mean(CO2$conc))) - not the average of the log of conc (which is r mean(log(CO2$conc))). To make computations based on the average value of the square of conc and the average of the log of conc do

co2.lm3 <- lm(uptake ~ conc + conc2 + log.conc + Type + Treat, 
              data=transform(CO2, conc2=conc^2, log.conc=log(conc)))
lsm3 <- LSmeans(co2.lm3, effect="Treat")
lsm3$L

Thus, if we want to evaluate the LS--means at conc=700 then we can do:

lsm4 <- LSmeans(co2.lm3, effect="Treat", at=list(conc=700, conc2=700^2, log.conc=log(700)))
lsm4$L

Alternative models

The functions esticon, linest, LSmeans etc. are available for a range of model classes. We illustrate a few below: We may decide to treat \verb|Type| as a random effect (here with only two levels). This leads to a linear mixed effects model as implemented in \CRANpkg{lme4}, [@lme4]:

library(lme4)
co2.mix <- lmer(uptake ~ Treat + (1|Type), data=CO2)
LSmeans(co2.mix, effect="Treat")

Notice here that the parameter estimates themselves are similar to those of a linear model (had data been completely balanced, the estimates would have been identical). However, the standard errors of the the estimates are much larger under the mixed model. This is due to Type being treated as a random effect. Notice that the degrees of freedom by default are adjusted using a Kenward--Roger approximation (provided that \CRANpkg{pbkrtest} package [@pbkrtest] is installed). Adjustment of degrees of freedom is controlled with the adjust.df argument.

LS-means are also available in a generalized linear model setting as well as for for generalized estimating equations as implemented in the \CRANpkg{geepack} package, [@geepack]. In both cases the LS--means are on the scale of the linear predictor - not on the scale of the response.

Acknowledgements

Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell, Jim Robison-Cox and Erik Jørgensen for reporting various bugs and making various suggestions to the functionality in the \pkg{doBy} package.

\bibliography{doby-hojsgaard}



hojsgaard/doBy documentation built on May 4, 2024, 5:20 a.m.