knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options("digits"=3) library(doBy) library(boot) #devtools::load_all()
The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities.
library(doBy)
\section{Data used for illustration} \label{sec:co2data}
The description of the \code{doBy} package is based on the \code{mtcars} dataset.
head(mtcars) tail(mtcars)
\label{sec:summaryBy}
The \summaryby{} function is used for calculating quantities like ``the mean and variance of numerical variables $x$ and $y$ for each combination of two factors $A$ and $B$''. Notice: A functionality similar to \summaryby\ is provided by \code{aggregate()} from base \R.
myfun1 <- function(x){ c(m=mean(x), s=sd(x)) } summaryBy(cbind(mpg, cyl, lh=log(hp)) ~ vs, data=mtcars, FUN=myfun1)
A simpler call is
summaryBy(mpg ~ vs, data=mtcars, FUN=mean)
Instead of formula we may specify a list containing the left hand side and the right hand side of a formula\footnote{This is a feature of \summaryby\ and it does not work with \code{aggregate}.} but that is possible only for variables already in the dataframe:
summaryBy(list(c("mpg", "cyl"), "vs"), data=mtcars, FUN=myfun1)
Inspired by the \pkg{dplyr} package, there is a \verb|summary_by| function which does the samme as \summaryby{} but with the data argument being the first so that one may write
mtcars |> summary_by(cbind(mpg, cyl, lh=log(hp)) ~ vs, FUN=myfun1)
Ordering (or sorting) a data frame is possible with the \code{orderBy} function. For example, we order the rows according to \code{gear} and \code{carb} (within \code{gear}):
x1 <- orderBy(~ gear + carb, data=mtcars) head(x1, 4) tail(x1, 4)
If we want the ordering to be by decreasing values of one of the variables, we can do
x2 <- orderBy(~ -gear + carb, data=mtcars)
Alternative forms are:
x3 <- orderBy(c("gear", "carb"), data=mtcars) x4 <- orderBy(c("-gear", "carb"), data=mtcars) x5 <- mtcars |> order_by(c("gear", "carb")) x6 <- mtcars |> order_by(~ -gear + carb)
Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by:
x <- splitBy(~ Month, data=airquality) x <- splitBy(~ vs, data=mtcars) lapply(x, head, 4) attributes(x)
Alternative forms are:
splitBy("vs", data=mtcars) mtcars |> split_by(~ vs)
Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by:
x <- subsetBy(~am, subset=mpg > mean(mpg), data=mtcars) head(x)
Note that the statement \code{Wind > mean(Wind)} is evaluated within each month.
Alternative forms are
x <- subsetBy("am", subset=mpg > mean(mpg), data=mtcars) x <- mtcars |> subset_by("vs", subset=mpg > mean(mpg)) x <- mtcars |> subset_by(~vs, subset=mpg > mean(mpg))
The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example:
head(x) x <- transformBy(~vs, data=mtcars, min.mpg=min(mpg), max.mpg=max(mpg)) head(x)
Alternative forms:
x <- transformBy("vs", data=mtcars, min.mpg=min(mpg), max.mpg=max(mpg)) x <- mtcars |> transform_by("vs", min.mpg=min(mpg), max.mpg=max(mpg))
This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using lapply).
lapplyBy(~vs, data=mtcars, FUN=function(d) lm(mpg~cyl, data=d) |> summary() |> coef())
To obtain the indices of the first/last occurences of an item in a vector do:
x <- c(1, 1, 1, 2, 2, 2, 1, 1, 1, 3) firstobs(x) lastobs(x)
The same can be done on variables in a data frame, e.g.
firstobs(~vs, data=mtcars) lastobs(~vs, data=mtcars)
\subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn}
The location of the $n$ largest / smallest entries in a numeric vector can be obtained with
x <- c(1:4, 0:5, 11, NA, NA) which.maxn(x, 3) which.minn(x, 5)
Find (sub) sequences in a vector:
x <- c(1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 1) subSeq(x) subSeq(x, item=1) subSeq(letters[x]) subSeq(letters[x], item="a")
x <- c("dec", "jan", "feb", "mar", "apr", "may") src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may")) tgt1 <- list("winter", "spring") recodeVar(x, src=src1, tgt=tgt1)
head(renameCol(mtcars, c("vs", "mpg"), c("vs_", "mpg_")))
Consider the vector
yvar <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
Imagine that "1" indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable
tvar <- seq_along(yvar) + c(0.1, 0.2)
Now we find time since event as
tse <- timeSinceEvent(yvar, tvar) tse
The output reads as follows: \begin{itemize} \item \verb"abs.tse": Absolute time since (nearest) event. \item \verb"sign.tse": Signed time since (nearest) event. \item \verb"ewin": Event window: Gives a symmetric window around each event. \item \verb"run": The value of \verb"run" is set to $1$ when the first event occurs and is increased by $1$ at each subsequent event. \item \verb"tae": Time after event. \item \verb"tbe": Time before event. \end{itemize}
plot(sign.tse ~ tvar, data=tse, type="b") grid() rug(tse$tvar[tse$yvar == 1], col="blue",lwd=4) points(scale(tse$run), col=tse$run, lwd=2) lines(abs.tse + .2 ~ tvar, data=tse, type="b",col=3)
plot(tae ~ tvar, data=tse, ylim=c(-6,6), type="b") grid() lines(tbe ~ tvar, data=tse, type="b", col="red") rug(tse$tvar[tse$yvar==1], col="blue", lwd=4) lines(run ~ tvar, data=tse, col="cyan", lwd=2)
plot(ewin ~ tvar, data=tse, ylim=c(1, 4)) rug(tse$tvar[tse$yvar==1], col="blue", lwd=4) grid() lines(run ~ tvar, data=tse, col="red")
We may now find times for which time since an event is at most 1 as
tse$tvar[tse$abs <= 1]
Consider the \verb|lynx| data:
lynx <- as.numeric(lynx) tvar <- 1821:1934 plot(tvar, lynx, type="l")
Suppose we want to estimate the cycle lengths. One way of doing this is as follows:
yyy <- lynx > mean(lynx) head(yyy) sss <- subSeq(yyy, TRUE) sss
plot(tvar, lynx, type="l") rug(tvar[sss$midpoint], col="blue", lwd=4)
Create the "event vector"
yvar <- rep(0, length(lynx)) yvar[sss$midpoint] <- 1 str(yvar)
tse <- timeSinceEvent(yvar,tvar) head(tse, 20)
We get two different (not that different) estimates of period lengths:
len1 <- tapply(tse$ewin, tse$ewin, length) len2 <- tapply(tse$run, tse$run, length) c(median(len1), median(len2), mean(len1), mean(len2))
We can overlay the cycles as:
tse$lynx <- lynx tse2 <- na.omit(tse) plot(lynx ~ tae, data=tse2)
plot(tvar, lynx, type="l", lty=2) mm <- lm(lynx ~ tae + I(tae^2) + I(tae^3), data=tse2) lines(fitted(mm) ~ tvar, data=tse2, col="red")
\section{Acknowledgements} \label{discussion}
Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell and Jim Robison-Cox for reporting various bugs and making various suggestions to the functionality in the \doby{} package.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.