summarize: Summary Statistics

View source: R/summarize.R

summarizeR Documentation

Summary Statistics

Description

Compute summary statistics for multiple variables and/or multiple groups and save them in a data frame.

Usage

summarize(
  formula,
  data,
  repetition = NULL,
  columns = NULL,
  FUN = NULL,
  na.action = stats::na.pass,
  na.rm = FALSE,
  level = 0.95,
  skip.reference = FALSE,
  digits = NULL,
  filter = NULL,
  ...
)

Arguments

formula

[formula] on the left hand side the outcome(s) and on the right hand side the grouping variables. E.g. Y1+Y2 ~ Gender + Gene will compute for each gender and gene the summary statistics for Y1 and for Y2.

data

[data.frame] dataset containing the observations.

repetition

[formula] Specify the structure of the data: the time/repetition variable and the grouping variable, e.g. ~ time|id. Used in the long format to count the number of missing values (i.e. add the number of missing rows) and evaluate the correlation.

columns

[character vector] name of the summary statistics to kept in the output. Can be any of, or a combination of:

  • "observed": number of observations with a measurement.

  • "missing": number of missing observations. When specifying a grouping variable, it will also attempt to count missing rows in the dataset.

  • "pc.missing": percentage missing observations.

  • "mean", "mean.lower" "mean.upper": mean with its confidence interval.

  • "median", "median.lower" "median.upper": median with its confidence interval.

  • "sd", "sd.lower", "sd.upper": standard deviation around the mean with its confidence interval.

  • "sd0", "sd0.lower", "sd0.upper": standard deviation around 0 with its confidence interval.

  • "skewness": skewness, as the third standardized moment.

  • "kurtosis": kurtosis, as the fourth standardized moment.

  • "q1", "q3", "IQR": 1st and 3rd quartile, interquartile range.

  • "min", "max": minimum and maximum observation.

  • "predict.lower", "predict.upper": prediction interval for normally distributed outcome.

  • "correlation": correlation matrix between the outcomes (when feasible, see detail section).

FUN

[function] user-defined function for computing summary statistics. It should take a vector as an argument and output a named single value or a named vector.

na.action

[function] a function which indicates what should happen when the data contain 'NA' values. Passed to the stats::aggregate function.

na.rm

[logical] Should the summary statistics be computed by omitting the missing values.

level

[numeric,0-1] the confidence level of the confidence intervals.

skip.reference

[logical] should the summary statistics for the reference level of categorical variables be omitted?

digits

[integer, >=0] the minimum number of significant digits to be used to display the results. Passed to print.data.frame

filter

[character] a regular expression passed to grep to filter the columns of the dataset. Relevant when using . to indicate all other variables.

...

additional arguments passed to argument FUN.

Details

This function is essentially an interface to the stats::aggregate function.
WARNING: it has the same name as a function from the dplyr package. If you have loaded dplyr already, you should use ::: to call summarize i.e. use LMMstar:::summarize.

Confidence intervals (CI) and prediction intervals (PI) for the mean are computed via stats::t.test. Confidence intervals (CI) for the standard deviation are computed using a chi-squared approximation. Confidence intervals (CI) for the median are computed via asht::medianTest.

Value

A data frame containing summary statistics (in columns) for each outcome and value of the grouping variables (rows). It has an attribute "correlation" when it was possible to compute the correlation matrix for each outcome with respect to the grouping variable.

See Also

correlate for correlation matrix.

Examples

#### simulate data (wide format) ####
set.seed(10)
d <- sampleRem(1e2, n.times = 3)
d$treat <-  sample(LETTERS[1:3], NROW(d), replace=TRUE, prob=c(0.3, 0.3, 0.4) )

## add a missing value
d2 <- d
d2[1,"Y2"] <- NA

#### summarize (wide format) ####

## summary statistic (single variable)
summarize(Y1 ~ 1, data = d)
## stratified summary statistic (single variable)
summarize(Y1 ~ X1, data = d2)
## stratified summary statistic (multiple variable)
summarize(Y1+Y2 ~ X1, data = d)
## categorical variable
summarize(treat ~ 1, data = d)
summarize(treat ~ 1, skip.reference = TRUE, data = d)
## aggregate data
summarize( ~ X1 + treat, data = d)
## user defined summary statistic
summarize(Y1 ~ 1, data = d, FUN = quantile)
summarize(Y1 ~ 1, data = d, FUN = quantile, p = c(0.25,0.75))
## complete case summary statistic
summarize(Y1+Y2 ~ X1, data = d2, na.rm = TRUE)
## shortcut to consider all outcomes with common naming 
summarize(. ~ treat, data = d2, na.rm = TRUE, filter = "Y")

#### summarize (long format) ####
dL <- reshape(d2, idvar = "id", direction = "long",
             v.names = "Y", varying = c("Y1","Y2","Y3"))
summarize(Y ~ time + X1, data = dL, na.rm  = TRUE)

## user defined summary statistic (outlier)
summarize(Y ~ time + X1, data = dL, FUN = function(x){
   c(outlier.down = sum(x<mean(x,na.rm=TRUE)-2*sd(x,na.rm=TRUE), na.rm=TRUE),
     outlier.up = sum(x>mean(x,na.rm=TRUE)+2*sd(x,na.rm=TRUE), na.rm=TRUE))
}, na.rm = TRUE)

## user defined summary statistic (auc)
myAUC <- function(Y,time){approxAUC(x = time, y = Y, from = 1, to = 3)}
myAUC(Y = dL[dL$id==1,"Y"], time = dL[dL$id==1,"time"])
summarize(Y ~ id, data = dL, FUN = myAUC, na.rm = TRUE)

## add correlation (see correlate function)
e.S <- summarize(Y ~ time + X1, data = dL, repetition = ~time|id,
                 na.rm = TRUE, columns = add("correlation"), na.rm = TRUE)
e.S

#### summarize (long format, missing lines) ####
## use repetition argument to count missing lines in the number of missing values
dL.NNA <- dL[rowSums(is.na(dL))==0,]
summarize(Y ~ time + X1, data = dL.NNA, repetition =~time|id, na.rm  = TRUE)

bozenne/repeated documentation built on July 16, 2025, 11:16 p.m.