knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options("digits"=3)
library(doBy)
library(boot)
#devtools::load_all()

The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities.

library(doBy)

\section{Data used for illustration} \label{sec:co2data}

The description of the \code{doBy} package is based on the \code{mtcars} dataset.

head(mtcars)
tail(mtcars)

Groupwise computations

summaryBy() and summary_by()

\label{sec:summaryBy}

The \summaryby{} function is used for calculating quantities like ``the mean and variance of numerical variables $x$ and $y$ for each combination of two factors $A$ and $B$''. Notice: A functionality similar to \summaryby\ is provided by \code{aggregate()} from base \R.

myfun1 <- function(x){
    c(m=mean(x), s=sd(x))
}
summaryBy(cbind(mpg, cyl, lh=log(hp)) ~ vs, 
          data=mtcars, FUN=myfun1)

A simpler call is

summaryBy(mpg ~ vs, data=mtcars, FUN=mean)

Instead of formula we may specify a list containing the left hand side and the right hand side of a formula\footnote{This is a feature of \summaryby\ and it does not work with \code{aggregate}.} but that is possible only for variables already in the dataframe:

summaryBy(list(c("mpg", "cyl"), "vs"), 
          data=mtcars, FUN=myfun1)

Inspired by the \pkg{dplyr} package, there is a \verb|summary_by| function which does the samme as \summaryby{} but with the data argument being the first so that one may write

mtcars |> summary_by(cbind(mpg, cyl, lh=log(hp)) ~ vs,
                      FUN=myfun1)

orderBy() and order_by()

Ordering (or sorting) a data frame is possible with the \code{orderBy} function. For example, we order the rows according to \code{gear} and \code{carb} (within \code{gear}):

x1 <- orderBy(~ gear + carb, data=mtcars)
head(x1, 4)
tail(x1, 4)

If we want the ordering to be by decreasing values of one of the variables, we can do

x2 <- orderBy(~ -gear + carb, data=mtcars)

Alternative forms are:

x3 <- orderBy(c("gear", "carb"), data=mtcars)
x4 <- orderBy(c("-gear", "carb"), data=mtcars)
x5 <- mtcars |> order_by(c("gear", "carb"))
x6 <- mtcars |> order_by(~ -gear + carb)

splitBy() and split_by()

Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by:

x <- splitBy(~ Month, data=airquality)
x <- splitBy(~ vs, data=mtcars)
lapply(x, head, 4)
attributes(x)

Alternative forms are:

splitBy("vs", data=mtcars)
mtcars |> split_by(~ vs)

subsetBy() and subset_by()

Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by:

x <- subsetBy(~am, subset=mpg > mean(mpg), data=mtcars)
head(x)

Note that the statement \code{Wind > mean(Wind)} is evaluated within each month.

Alternative forms are

x <- subsetBy("am", subset=mpg > mean(mpg), data=mtcars)
x <- mtcars  |> subset_by("vs", subset=mpg > mean(mpg))
x <- mtcars  |> subset_by(~vs, subset=mpg > mean(mpg))

transformBy() and transform_by()

The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example:

head(x)
x <- transformBy(~vs, data=mtcars, 
                 min.mpg=min(mpg), max.mpg=max(mpg))
head(x)

Alternative forms:

x <- transformBy("vs", data=mtcars, 
                 min.mpg=min(mpg), max.mpg=max(mpg))
x <- mtcars |> transform_by("vs",
                             min.mpg=min(mpg), max.mpg=max(mpg))

lapplyBy() and lapply_by()

This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using lapply).

lapplyBy(~vs, data=mtcars,
         FUN=function(d) lm(mpg~cyl, data=d)  |> summary()  |> coef())

Miscellaneous utilities

firstobs() and lastobs()

To obtain the indices of the first/last occurences of an item in a vector do:

x <- c(1, 1, 1, 2, 2, 2, 1, 1, 1, 3)
firstobs(x)
lastobs(x)

The same can be done on variables in a data frame, e.g.

firstobs(~vs, data=mtcars)
lastobs(~vs, data=mtcars)

\subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn}

The location of the $n$ largest / smallest entries in a numeric vector can be obtained with

x <- c(1:4, 0:5, 11, NA, NA)
which.maxn(x, 3)
which.minn(x, 5)

Subsequences - subSeq()

Find (sub) sequences in a vector:

x <- c(1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 1)
subSeq(x)
subSeq(x, item=1)
subSeq(letters[x])
subSeq(letters[x], item="a")

Recoding values of a vector - recodeVar()

x <- c("dec", "jan", "feb", "mar", "apr", "may")
src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may"))
tgt1 <- list("winter", "spring")
recodeVar(x, src=src1, tgt=tgt1)

Renaming columns of a dataframe or matrix - renameCol()

head(renameCol(mtcars, c("vs", "mpg"), c("vs_", "mpg_")))

Time since an event - timeSinceEvent()

Consider the vector

yvar <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)

Imagine that "1" indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable

tvar <- seq_along(yvar) + c(0.1, 0.2)

Now we find time since event as

tse <- timeSinceEvent(yvar, tvar)
tse

The output reads as follows: \begin{itemize} \item \verb"abs.tse": Absolute time since (nearest) event. \item \verb"sign.tse": Signed time since (nearest) event. \item \verb"ewin": Event window: Gives a symmetric window around each event. \item \verb"run": The value of \verb"run" is set to $1$ when the first event occurs and is increased by $1$ at each subsequent event. \item \verb"tae": Time after event. \item \verb"tbe": Time before event. \end{itemize}

plot(sign.tse ~ tvar, data=tse, type="b")
grid()
rug(tse$tvar[tse$yvar == 1], col="blue",lwd=4)
points(scale(tse$run), col=tse$run, lwd=2)
lines(abs.tse + .2 ~ tvar, data=tse, type="b",col=3)
plot(tae ~ tvar, data=tse, ylim=c(-6,6), type="b")
grid()
lines(tbe ~ tvar, data=tse, type="b", col="red")
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
lines(run ~ tvar, data=tse, col="cyan", lwd=2)
plot(ewin ~ tvar, data=tse, ylim=c(1, 4))
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
grid()
lines(run ~ tvar, data=tse, col="red")

We may now find times for which time since an event is at most 1 as

tse$tvar[tse$abs <= 1]

Example: Using subSeq() and timeSinceEvent()

Consider the \verb|lynx| data:

lynx <- as.numeric(lynx)
tvar <- 1821:1934
plot(tvar, lynx, type="l")

Suppose we want to estimate the cycle lengths. One way of doing this is as follows:

yyy <- lynx > mean(lynx)
head(yyy)
sss <- subSeq(yyy, TRUE)
sss
plot(tvar, lynx, type="l")
rug(tvar[sss$midpoint], col="blue", lwd=4)

Create the "event vector"

yvar <- rep(0, length(lynx))
yvar[sss$midpoint] <- 1
str(yvar)
tse <- timeSinceEvent(yvar,tvar)
head(tse, 20)

We get two different (not that different) estimates of period lengths:

len1 <- tapply(tse$ewin, tse$ewin, length)
len2 <- tapply(tse$run, tse$run, length)
c(median(len1), median(len2), mean(len1), mean(len2))

We can overlay the cycles as:

tse$lynx <- lynx
tse2 <- na.omit(tse)
plot(lynx ~ tae, data=tse2)
plot(tvar, lynx, type="l", lty=2)
mm <- lm(lynx ~ tae + I(tae^2) + I(tae^3), data=tse2)
lines(fitted(mm) ~ tvar, data=tse2, col="red")

\section{Acknowledgements} \label{discussion}

Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell and Jim Robison-Cox for reporting various bugs and making various suggestions to the functionality in the \doby{} package.



Try the doBy package in your browser

Any scripts or data that you put into this service are public.

doBy documentation built on Oct. 8, 2024, 1:06 a.m.