ddply: Split data frame, apply function, and return results in a...
In plyr: Tools for Splitting, Applying and Combining Data

ddply

R Documentation

Split data frame, apply function, and return results in a data frame.

Description

For each subset of a data frame, apply function then combine results into a data frame. To apply a function for each row, use adply with .margins set to 1.

Usage

ddply(
  .data,
  .variables,
  .fun = NULL,
  ...,
  .progress = "none",
  .inform = FALSE,
  .drop = TRUE,
  .parallel = FALSE,
  .paropts = NULL
)

Arguments

`.data`	data frame to be processed
`.variables`	variables to split data frame by, as `as.quoted` variables, a formula or character vector
`.fun`	function to apply to each piece
`...`	other arguments passed on to `.fun`
`.progress`	name of the progress bar to use, see `create_progress_bar`
`.inform`	produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging
`.drop`	should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)
`.parallel`	if `TRUE`, apply function in parallel, using parallel backend provided by foreach
`.paropts`	a list of additional options passed into the `foreach` function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages: use the `.export` and `.packages` arguments to supply them so that all cluster nodes have the correct environment set up for computing.

Value

A data frame, as described in the output section.

Input

This function splits data frames by variables.

Output

The most unambiguous behaviour is achieved when .fun returns a data frame - in that case pieces will be combined with rbind.fill. If .fun returns an atomic vector of fixed length, it will be rbinded together and converted to a data frame. Any other values will result in an error.

If there are no results, then this function will return a data frame with zero rows and columns (data.frame()).

References

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.

Examples

# Summarize a dataset by two variables
dfx <- data.frame(
  group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
  sex = sample(c("M", "F"), size = 29, replace = TRUE),
  age = runif(n = 29, min = 18, max = 54)
)

# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
 mean = round(mean(age), 2),
 sd = round(sd(age), 2))

# An example using a formula for .variables
ddply(baseball[1:100,], ~ year, nrow)
# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))

# Calculate mean runs batted in for each year
rbi <- ddply(baseball, .(year), summarise,
  mean_rbi = mean(rbi, na.rm = TRUE))
# Plot a line chart of the result
plot(mean_rbi ~ year, type = "l", data = rbi)

# make new variable career_year based on the
# start year for each player (id)
base2 <- ddply(baseball, .(id), mutate,
 career_year = year - min(year) + 1
)

plyr documentation built on Oct. 2, 2023, 9:07 a.m.