Description Usage Arguments Details Value Connection to plyr Examples
This is a general purpose complement to the specialised manipulation
functions filter
, select
, mutate
,
summarise
and arrange
. You can use do
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
do
and then flexibly extract components with either another
do
or summarise
.
1 2 3 4 5 6 |
.data |
a tbl |
... |
Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use |
.dots |
Used to work around non-standard evaluation. See
|
.chunk_size |
The size of each chunk to pull into R. If this number is too big, the process will be slow because R has to allocate and free a lot of memory. If it's too small, it will be slow, because of the overhead of talking to the database. |
For an empty data frame, the expressions will be evaluated once, even in the presence of a grouping. This makes sure that the format of the resulting data frame is the same for both empty and non-empty input.
do
always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from ...
. Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.
Groups are preserved for a single unnamed input. This is different to
summarise
because do
generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
rowwise
. This allows other verbs to work in an intuitive
way.
If you're familiar with plyr, do
with named arguments is basically
equivalent to dlply
, and do
with a single unnamed argument
is basically equivalent to ldply
. However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that summarise
applied to the result of do
can
act like ldply
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | by_cyl <- group_by(mtcars, cyl)
do(by_cyl, head(., 2))
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
models
summarise(models, rsq = summary(mod)$r.squared)
models %>% do(data.frame(coef = coef(.$mod)))
models %>% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
models <- by_cyl %>% do(
mod_linear = lm(mpg ~ disp, data = .),
mod_quad = lm(mpg ~ poly(disp, 2), data = .)
)
models
compare <- models %>% do(aov = anova(.$mod_linear, .$mod_quad))
# compare %>% summarise(p.value = aov$`Pr(>F)`)
if (require("nycflights13")) {
# You can use it to do any arbitrary computation, like fitting a linear
# model. Let's explore how carrier departure delays vary over the time
carriers <- group_by(flights, carrier)
group_size(carriers)
mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .))
mods %>% do(as.data.frame(coef(.$mod)))
mods %>% summarise(rsq = summary(mod)$r.squared)
## Not run:
# This longer example shows the progress bar in action
by_dest <- flights %>% group_by(dest) %>% filter(n() > 100)
library(mgcv)
by_dest %>% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
## End(Not run)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.