capply: Apply a function within each cluster of multilevel data
In gmonette/spida2: Collection of tools developed for the Summer Programme in Data Analysis 2000-2012

capply

R Documentation

Apply a function within each cluster of multilevel data

Description

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain variables and, in contrast with tapply, return within each cell a vector of the same length as the cell, which are then ordered to match the corresponding positions of the cells in the input.

Usage

capply(x, ...)

## S3 method for class 'formula'
capply(formula, data, FUN, ...)

## Default S3 method:
capply(x, by, FUN, ..., sep = "#^#")

Arguments

`x`	a vector or data frame that provides the first argument of `FUN`
`...`	additional variables to be supplied to `FUN`
`FUN`	a function to be applied to `x` within each cluster. `FUN` can return a single value, or a vector whose length is equal to the number of elements in each cluster.
`by`	If `x` is a vector: a 'factor' of the same lenth as `x` whose levels identify clusters. If `x` is a data frame, a one-sided formula that identifies the variable(s) within `x` to be used to clusters.
`fmla`	in `capply.formula`, fmla is a two-sided formula as in `aggregate.formula`. The left-hand side identifies the variable(s) in `data` to be include in a data.frame that is clusterd using the variables in the right-hand side of the formula.

Details

capply is very similar to ave in package:stats. They differ in the way they treat missing values in the clustering variables. ave treats missing values as if they were legitimate clustering levels while capply returns a value of NA within any cluster formed by a combination of clustering variable values that includes a value of NA.

capply extends the function of tapply(x, by, FUN)[ tapply(x, by) ]. The function FUN is applied to each cell of x defined by each value of by. The result in each cell is recycled to a vector of the same length as the cell. These vectors are then arranged to match the input x. Thus, if the value returned within each cell is a scalar, the effect of capply(x, by, FUN) is the same as tapply(x, by, FUN)[ tapply(x, by) ]. capply extends this use of tapply by allowing the value returned within each cell to be a vector of the same length as the cell.

The capply.formula method allows the use of two-sided formula of the form x ~ a + b or cbind(x, y) ~ a + b where the variables on the left-hand side are used to create a data frame that is given as a first argument to FUN. If there is a single variable on the left-hand side then that variable can be treated as a vector by FUN.

Value

When the result in each cell is a scalar, capply can be used to for multilevel analysis to produce 'contextual variables' computed within subgroups of the data and expanded to a constant over elements of each subgroup.

capply( x , by, FUN , ...) where x is a vector

is equivalent to

unsplit ( lapply ( split ( x , by ), FUN, ...), by )

which has the same effect as

tapply( x, by, FUN, ...) [ tapply( x, by) ]

if FUN returns a vector of length 1.

If FUN returns a vector, it is recycled to the length of the input value.

When the first argument is a data frame:

capply ( dd, by, FUN, ...)

uses unsplit - lapply - split to apply FUN to each sub data frame. In this case, by can be a formula that is evaluated in 'dd'.

This syntax makes it easy to compute formulas involving more than one variable in 'dd'. An example:

capply( dd, ~gg, function(x) with( x, mean(Var1) / mean(Var2) ) )

where 'Var1' and 'Var2' are numeric variables and 'gg' a grouping factor in data frame 'dd'. Or, using the with function:

capply( dd, ~gg, with , mean(Var1) / mean(Var2) )

cvar and cvars are intended to create contextual variables in model formulas. If 'x' is numerical, cvar is equivalent to capply(x,id,mean) and cvars is equivalent to capply(x,id,sum).

If x is a factor, cvar generates the equivalent of a model matrix for the factor with indicators replaced by the proportion within each cluster.

dvar is equivalent to x - cvar(x,by) and creates what is commonly known as a version of 'x' that is 'centered within groups' (CWG). It creates the correct matrix for a factor so that the between group interpretation of the effect of cvar(x,by) is that of the 'between group' or 'compositional' effect of the factor.

Methods (by class)

capply(formula): method for class 'formula'
capply(default): default method

Note

capply tends to be slow when there are many cells and by is a factor. This may be due to the need to process all factor levels for each cell. Turning by into a numeric or character vector improves speed: e.g. capply( x, as.numeric(by), FUN).

Examples

## Not run: 
     data( hs )
     head( hs )

     # FUN returns a single value
     hs$ses.mean <- capply( hs$ses, hs$school, mean, na.rm = T)
     hs$ses.hetero <- capply ( hs$ses, hs$school, sd , na.rm = T)
     hs.summ <- up( hs, ~school )
     head( hs.summ )   # variables invariant within school

     # FUN returns a vector
     # with 'x' a data frame
     # Note how the 'with' function provides an easy way to write use a
     #   formula as the '...' variable.

     hs$minority.prop <- capply( hs, ~ school, with, mean( Minority == "Yes"))

     # equivalently:

     hs$minority.prop <- capply( hs$Minority, hs$school, mean)

     # on very large data frames with many columns that are not used, the 'data frame'
     # version of 'capply' can be very slow in comparison with 'vector' version.

     # In contrast with 'tapply' 'FUN' can return a vector, e.g. ranks within groups

     hs$mathach.rank <- capply( hs, ~ school, with , rank(mathach))

     # cvar and dvar in multilevel models

     library( nlme )
     data ( hs )
     fit <- lme( mathach ~ Minority * Sector, hs, random = ~ 1 | school)
     summary ( fit )

     fit.contextual <- lme( mathach ~ (Minority + cvar(Minority, school)) * Sector,
                       hs, random = ~ 1| school)
     summary(fit.contextual) # contextual effect of cvar(Minority)

     fit.compositional <- lme( mathach ~ (dvar(Minority,school) + cvar(Minority, school)) * Sector,
                       hs, random = ~ 1| school)
     summary(fit.compositional) # compositional effect of cvar(Minority)

## End(Not run)

gmonette/spida2 documentation built on June 12, 2025, 9:44 p.m.