In doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options("digits"=3)
library(doBy)
library(boot)
#devtools::load_all()

The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities.

library(doBy)

\section{Data used for illustration} \label{sec:co2data}

The description of the \code{doBy} package is based on the \code{mtcars} dataset.

head(mtcars)
tail(mtcars)

Groupwise computations

summaryBy() and summary_by()

\label{sec:summaryBy}

The \summaryby{} function is used for calculating quantities like ``the mean and variance of numerical variables $x$ and $y$ for each combination of two factors $A$ and $B$''. Notice: A functionality similar to \summaryby\ is provided by \code{aggregate()} from base \R.

myfun1 <- function(x){
    c(m=mean(x), s=sd(x))
}
summaryBy(cbind(mpg, cyl, lh=log(hp)) ~ vs, 
          data=mtcars, FUN=myfun1)

A simpler call is

summaryBy(mpg ~ vs, data=mtcars, FUN=mean)

Instead of formula we may specify a list containing the left hand side and the right hand side of a formula\footnote{This is a feature of \summaryby\ and it does not work with \code{aggregate}.} but that is possible only for variables already in the dataframe:

summaryBy(list(c("mpg", "cyl"), "vs"), 
          data=mtcars, FUN=myfun1)

Inspired by the \pkg{dplyr} package, there is a \verb|summary_by| function which does the samme as \summaryby{} but with the data argument being the first so that one may write

mtcars |> summary_by(cbind(mpg, cyl, lh=log(hp)) ~ vs,
                      FUN=myfun1)

orderBy() and order_by()

Ordering (or sorting) a data frame is possible with the \code{orderBy} function. For example, we order the rows according to \code{gear} and \code{carb} (within \code{gear}):

x1 <- orderBy(~ gear + carb, data=mtcars)
head(x1, 4)
tail(x1, 4)

If we want the ordering to be by decreasing values of one of the variables, we can do

x2 <- orderBy(~ -gear + carb, data=mtcars)

Alternative forms are:

x3 <- orderBy(c("gear", "carb"), data=mtcars)
x4 <- orderBy(c("-gear", "carb"), data=mtcars)
x5 <- mtcars |> order_by(c("gear", "carb"))
x6 <- mtcars |> order_by(~ -gear + carb)

splitBy() and split_by()

Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by:

x <- splitBy(~ Month, data=airquality)
x <- splitBy(~ vs, data=mtcars)
lapply(x, head, 4)
attributes(x)

Alternative forms are:

splitBy("vs", data=mtcars)
mtcars |> split_by(~ vs)

subsetBy() and subset_by()

Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by:

x <- subsetBy(~am, subset=mpg > mean(mpg), data=mtcars)
head(x)

Note that the statement \code{Wind > mean(Wind)} is evaluated within each month.

Alternative forms are

x <- subsetBy("am", subset=mpg > mean(mpg), data=mtcars)
x <- mtcars  |> subset_by("vs", subset=mpg > mean(mpg))
x <- mtcars  |> subset_by(~vs, subset=mpg > mean(mpg))

transformBy() and transform_by()

The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example:

head(x)
x <- transformBy(~vs, data=mtcars, 
                 min.mpg=min(mpg), max.mpg=max(mpg))
head(x)

Alternative forms:

x <- transformBy("vs", data=mtcars, 
                 min.mpg=min(mpg), max.mpg=max(mpg))
x <- mtcars |> transform_by("vs",
                             min.mpg=min(mpg), max.mpg=max(mpg))

lapplyBy() and lapply_by()

This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using lapply).

lapplyBy(~vs, data=mtcars,
         FUN=function(d) lm(mpg~cyl, data=d)  |> summary()  |> coef())

Miscellaneous utilities

firstobs() and lastobs()

To obtain the indices of the first/last occurences of an item in a vector do:

x <- c(1, 1, 1, 2, 2, 2, 1, 1, 1, 3)
firstobs(x)
lastobs(x)

The same can be done on variables in a data frame, e.g.

firstobs(~vs, data=mtcars)
lastobs(~vs, data=mtcars)

\subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn}

The location of the $n$ largest / smallest entries in a numeric vector can be obtained with

x <- c(1:4, 0:5, 11, NA, NA)
which.maxn(x, 3)
which.minn(x, 5)

Subsequences - subSeq()

Find (sub) sequences in a vector:

x <- c(1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 1)
subSeq(x)
subSeq(x, item=1)
subSeq(letters[x])
subSeq(letters[x], item="a")

Recoding values of a vector - recodeVar()

x <- c("dec", "jan", "feb", "mar", "apr", "may")
src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may"))
tgt1 <- list("winter", "spring")
recodeVar(x, src=src1, tgt=tgt1)

Renaming columns of a dataframe or matrix - renameCol()

head(renameCol(mtcars, c("vs", "mpg"), c("vs_", "mpg_")))

Time since an event - timeSinceEvent()

Consider the vector

yvar <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)

Imagine that "1" indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable

tvar <- seq_along(yvar) + c(0.1, 0.2)

Now we find time since event as

tse <- timeSinceEvent(yvar, tvar)
tse

The output reads as follows: \begin{itemize} \item \verb"abs.tse": Absolute time since (nearest) event. \item \verb"sign.tse": Signed time since (nearest) event. \item \verb"ewin": Event window: Gives a symmetric window around each event. \item \verb"run": The value of \verb"run" is set to $1$ when the first event occurs and is increased by $1$ at each subsequent event. \item \verb"tae": Time after event. \item \verb"tbe": Time before event. \end{itemize}

plot(sign.tse ~ tvar, data=tse, type="b")
grid()
rug(tse$tvar[tse$yvar == 1], col="blue",lwd=4)
points(scale(tse$run), col=tse$run, lwd=2)
lines(abs.tse + .2 ~ tvar, data=tse, type="b",col=3)

plot(tae ~ tvar, data=tse, ylim=c(-6,6), type="b")
grid()
lines(tbe ~ tvar, data=tse, type="b", col="red")
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
lines(run ~ tvar, data=tse, col="cyan", lwd=2)

plot(ewin ~ tvar, data=tse, ylim=c(1, 4))
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
grid()
lines(run ~ tvar, data=tse, col="red")

We may now find times for which time since an event is at most 1 as

tse$tvar[tse$abs <= 1]

Example: Using subSeq() and timeSinceEvent()

Consider the \verb|lynx| data:

lynx <- as.numeric(lynx)
tvar <- 1821:1934
plot(tvar, lynx, type="l")

Suppose we want to estimate the cycle lengths. One way of doing this is as follows:

yyy <- lynx > mean(lynx)
head(yyy)
sss <- subSeq(yyy, TRUE)
sss

plot(tvar, lynx, type="l")
rug(tvar[sss$midpoint], col="blue", lwd=4)

Create the "event vector"

yvar <- rep(0, length(lynx))
yvar[sss$midpoint] <- 1
str(yvar)

tse <- timeSinceEvent(yvar,tvar)
head(tse, 20)

We get two different (not that different) estimates of period lengths:

len1 <- tapply(tse$ewin, tse$ewin, length)
len2 <- tapply(tse$run, tse$run, length)
c(median(len1), median(len2), mean(len1), mean(len2))

We can overlay the cycles as:

tse$lynx <- lynx
tse2 <- na.omit(tse)
plot(lynx ~ tae, data=tse2)

plot(tvar, lynx, type="l", lty=2)
mm <- lm(lynx ~ tae + I(tae^2) + I(tae^3), data=tse2)
lines(fitted(mm) ~ tvar, data=tse2, col="red")

\section{Acknowledgements} \label{discussion}

Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell and Jim Robison-Cox for reporting various bugs and making various suggestions to the functionality in the \doby{} package.

Any scripts or data that you put into this service are public.

doBy documentation built on Oct. 8, 2024, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

doBy
Groupwise Statistics, LSmeans, Linear Estimates, Utilities

In doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities

Groupwise computations

summaryBy() and summary_by()

orderBy() and order_by()

splitBy() and split_by()

subsetBy() and subset_by()

transformBy() and transform_by()

lapplyBy() and lapply_by()

Miscellaneous utilities

firstobs() and lastobs()

Subsequences - subSeq()

Recoding values of a vector - recodeVar()

Renaming columns of a dataframe or matrix - renameCol()

Time since an event - timeSinceEvent()

Example: Using subSeq() and timeSinceEvent()

Try the doBy package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

doBy Groupwise Statistics, LSmeans, Linear Estimates, Utilities

In doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities

Groupwise computations

summaryBy() and summary_by()

orderBy() and order_by()

splitBy() and split_by()

subsetBy() and subset_by()

transformBy() and transform_by()

lapplyBy() and lapply_by()

Miscellaneous utilities

firstobs() and lastobs()

Subsequences - subSeq()

Recoding values of a vector - recodeVar()

Renaming columns of a dataframe or matrix - renameCol()

Time since an event - timeSinceEvent()

Example: Using subSeq() and timeSinceEvent()

Try the doBy package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

doBy
Groupwise Statistics, LSmeans, Linear Estimates, Utilities