data-transformations: Data Transformations

data-transformationsR Documentation

Data Transformations

Description

collapse provides an ensemble of functions to perform common data transformations efficiently and user friendly:

  • dapply applies functions to rows or columns of matrices and data frames, preserving the data format.

  • BY is an S3 generic for efficient Split-Apply-Combine computing, similar to dapply.

  • A set of arithmetic operators facilitates row-wise %rr%, %r+%, %r-%, %r*%, %r/% and column-wise %cr%, %c+%, %c-%, %c*%, %c/% replacing and sweeping operations involving a vector and a matrix or data frame / list. Since v1.7, the operators %+=%, %-=%, %*=% and %/=% do column- and element- wise math by reference, and the function setop can also perform sweeping out rows by reference.

  • (set)TRA is a more advanced S3 generic to efficiently perform (groupwise) replacing and sweeping out of statistics, either by creating a copy of the data or by reference. Supported operations are:

    Integer-id String-id Description
    0 "na" or "replace_na" replace only missing values
    1 "fill" or "replace_fill" replace everything
    2 "replace" replace data but preserve missing values
    3 "-" subtract
    4 "-+" subtract group-statistics but add group-frequency weighted average of group statistics
    5 "/" divide
    6 "%" compute percentages
    7 "+" add
    8 "*" multiply
    9 "%%" modulus
    10 "-%%" subtract modulus

    All of collapse's Fast Statistical Functions have a built-in TRA argument for faster access (i.e. you can compute (groupwise) statistics and use them to transform your data with a single function call).

  • fscale/STD is an S3 generic to perform (groupwise and / or weighted) scaling / standardizing of data and is orders of magnitude faster than scale.

  • fwithin/W is an S3 generic to efficiently perform (groupwise and / or weighted) within-transformations / demeaning / centering of data. Similarly fbetween/B computes (groupwise and / or weighted) between-transformations / averages (also a lot faster than ave).

  • fhdwithin/HDW, shorthand for 'higher-dimensional within transform', is an S3 generic to efficiently center data on multiple groups and partial-out linear models (possibly involving many levels of fixed effects and interactions). In other words, fhdwithin/HDW efficiently computes residuals from linear models. Similarly fhdbetween/HDB, shorthand for 'higher-dimensional between transformation', computes the corresponding means or fitted values.

  • flag/L/F, fdiff/D/Dlog and fgrowth/G are S3 generics to compute sequences of lags / leads and suitably lagged and iterated (quasi-, log-) differences and growth rates on time series and panel data. fcumsum flexibly computes (grouped, ordered) cumulative sums. More in Time Series and Panel Series.

  • STD, W, B, HDW, HDB, L, D, Dlog and G are parsimonious wrappers around the f- functions above representing the corresponding transformation 'operators'. They have additional capabilities when applied to data-frames (i.e. variable selection, formula input, auto-renaming and id-variable preservation), and are easier to employ in regression formulas, but are otherwise identical in functionality.

Table of Functions

Function / S3 Generic Methods Description
dapply No methods, works with matrices and data frames Apply functions to rows or columns
BY default, matrix, data.frame, grouped_df Split-Apply-Combine computing
%(r/c)(r/+/-/*//)% No methods, works with matrices and data frames / lists Row- and column-arithmetic
(set)TRA default, matrix, data.frame, grouped_df Replace and sweep out statistics (by reference)
fscale/STD default, matrix, data.frame, pseries, pdata.frame, grouped_df Scale / standardize data
fwithin/W default, matrix, data.frame, pseries, pdata.frame, grouped_df Demean / center data
fbetween/B default, matrix, data.frame, pseries, pdata.frame, grouped_df Compute means / average data
fhdwithin/HDW default, matrix, data.frame, pseries, pdata.frame High-dimensional centering and lm residuals
fhdbetween/HDB default, matrix, data.frame, pseries, pdata.frame High-dimensional averages and lm fitted values
flag/L/F, fdiff/D/Dlog, fgrowth/G, fcumsum default, matrix, data.frame, pseries, pdata.frame, grouped_df (Sequences of) lags / leads, differences, growth rates and cumulative sums

See Also

Collapse Overview, Fast Statistical Functions, Time Series and Panel Series


collapse documentation built on Nov. 3, 2024, 9:08 a.m.