pivot | R Documentation |
pivot()
is collapse's data reshaping command. It combines longer-, wider-, and recast-pivoting functionality in a single parsimonious API. Notably, it can also accommodate variable labels.
pivot(data, # Summary of Documentation:
ids = NULL, # identifier cols to preserve
values = NULL, # cols containing the data
names = NULL, # name(s) of new col(s) | col(s) containing names
labels = NULL, # name of new labels col | col(s) containing labels
how = "longer", # method: "longer"/"l", "wider"/"w" or "recast"/"r"
na.rm = FALSE, # remove rows missing 'values' in reshaped data
factor = c("names", "labels"), # create new id col(s) as factor variable(s)?
check.dups = FALSE, # detect duplicate 'ids'+'names' combinations
# Only apply if how = "wider" or "recast"
FUN = "last", # aggregation function (internal or external)
FUN.args = NULL, # list of arguments passed to aggregation function
nthreads = .op[["nthreads"]], # minor gains as grouping remains serial
fill = NULL, # value to insert for unbalanced data (default NA/NULL)
drop = TRUE, # drop unused levels (=columns) if 'names' is factor
sort = FALSE, # "ids": sort 'ids' and/or "names": alphabetic casting
# Only applies if how = "wider" with multiple long columns ('values')
transpose = FALSE # "columns": applies t_list() before flattening, and/or
) # "names": sets names nami_colj. default: colj_nami
data |
data frame-like object (list of equal-length columns). | ||||||||||||||||
ids |
identifier columns to keep. Specified using column names, indices, a logical vector or an identifier function e.g. | ||||||||||||||||
values |
columns containing the data to be reshaped. Specified like | ||||||||||||||||
names |
names of columns to generate, or retrieve variable names from:
| ||||||||||||||||
labels |
names of columns to generate, or retrieve variable labels from:
| ||||||||||||||||
how |
character. The pivoting method: one of | ||||||||||||||||
na.rm |
logical. | ||||||||||||||||
factor |
character. Whether to generate new 'names' and/or 'labels' columns as factor variables. This is generally recommended as factors are more memory efficient than character vectors and also faster in subsequent filtering and grouping. Internally, this argument is evaluated as | ||||||||||||||||
check.dups |
logical. | ||||||||||||||||
FUN |
function to aggregate values. At present, only a single function is allowed. Fast Statistical Functions receive vectorized execution. For maximum efficiency, a small set of internal functions is provided: | ||||||||||||||||
FUN.args |
(optional) list of arguments passed to | ||||||||||||||||
nthreads |
integer. if | ||||||||||||||||
fill |
if | ||||||||||||||||
drop |
logical. if | ||||||||||||||||
sort |
if | ||||||||||||||||
transpose |
if |
Pivot wider essentially works as follows: compute g_rows = group(ids)
and also g_cols = group(names)
(using group
if sort = FALSE
). g_rows
gives the row-numbers of the wider data frame and g_cols
the column numbers.
Then, a C function generates a wide data frame and runs through each long column ('values'), assigning each value to the corresponding row and column in the wide frame. In this process FUN
is always applied. The default, "last"
, does nothing at all, i.e., if there are duplicates, some values are overwritten. "first"
works similarly just that the C-loop is executed the other way around. The other hard-coded options count, sum, average, or compare observations on the fly. Missing values are internally skipped for statistical functions as there is no way to distinguish an incoming NA
from an initial NA
- apart from counting occurrences using an internal structure of the same size as the result data frame which is costly and thus not implemented.
When passing an R-function to FUN
, the data is grouped using g_full = group(list(g_rows, g_cols))
, aggregated by groups, and expanded again to full length using TRA
before entering the reshaping algorithm. Thus, this is significantly more expensive than the optimized internal functions. With Fast Statistical Functions the aggregation is vectorized across groups, other functions are applied using BY
- by far the slowest option.
If check.dups = TRUE
, a check of the form fnunique(list(g_rows, g_cols)) < fnrow(data)
is run, and an informative warning is issued if duplicates are found.
Recast pivoting works similarly. In long pivots FUN
is ignored and the check simply amounts to fnunique(ids) < fnrow(data)
.
A reshaped data frame with the same class and attributes (except for 'names'/'row-names') as the input frame.
Leaving either 'ids' or 'values' empty will assign all other columns (except for "variable"
if how = "wider"|"recast"
) to the non-specified argument. It is also possible to leave both empty, e.g. for complete melting if how = "wider"
or data transposition if how = "recast"
(similar to data.table::transpose
but supporting multiple names columns and variable labels). See Examples.
pivot
currently does not support concurrently melting/pivoting longer to multiple columns. See data.table::melt
or pivot_longer
from tidyr or tidytable for an efficient alternative with this feature. It is also possible to achieve this with just a little bit of programming. An example is provided below.
collap
, vec
, rowbind
, unlist2d
, Data Frame Manipulation, Collapse Overview
# -------------------------------- PIVOT LONGER ---------------------------------
# Simple Melting (Reshaping Long)
pivot(mtcars) |> head()
pivot(iris, "Species") |> head()
pivot(iris, values = 1:4) |> head() # Same thing
# Using collapse's datasets
head(wlddev)
pivot(wlddev, 1:8, na.rm = TRUE) |> head()
pivot(wlddev, c("iso3c", "year"), c("PCGDP", "LIFEEX"), na.rm = TRUE) |> head()
head(GGDC10S)
pivot(GGDC10S, 1:5, names = list("Sectorcode", "Value"), na.rm = TRUE) |> head()
# Can also set by name: variable and/or value. Note that 'value' here remains lowercase
pivot(GGDC10S, 1:5, names = list(variable = "Sectorcode"), na.rm = TRUE) |> head()
# Melting including saving labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = TRUE) |> head()
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = "description") |> head()
# Also assigning new labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = list("description",
c("Sector Code", "Sector Description", "Value"))) |> namlab()
# Can leave out value column by providing named vector of labels
pivot(GGDC10S, 1:5, na.rm = TRUE, labels = list("description",
c(variable = "Sector Code", description = "Sector Description"))) |> namlab()
# Now here is a nice example that is explicit and respects the dataset naming conventions
pivot(GGDC10S, ids = 1:5, na.rm = TRUE,
names = list(variable = "Sectorcode",
value = "Value"),
labels = list(name = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
Value = "Employment or Value Added"))) |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
# Note that pivot() currently does not support melting to multiple columns
# But you can tackle the issue with a bit of programming:
wide <- pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable",
how = "wider", na.rm = TRUE)
head(wide)
library(magrittr)
wide %>% {av(pivot(., 1:2, grep("_VA", names(.))), pivot(gvr(., "_EMP")))} |> head()
wide %>% {av(av(gv(., 1:2), rm_stub(gvr(., "_VA"), "_VA", pre = FALSE)) |>
pivot(1:2, names = list("Sectorcode", "VA"), labels = "Sector"),
EMP = vec(gvr(., "_EMP")))} |> head()
rm(wide)
# -------------------------------- PIVOT WIDER ---------------------------------
iris_long <- pivot(iris, "Species") # Getting a long frame
head(iris_long)
# If 'names'/'values' not supplied, searches for 'variable' and 'value' columns
pivot(iris_long, how = "wider")
# But here the records are not identified by 'Species': thus aggregation with last value:
pivot(iris_long, how = "wider", check = TRUE) # issues a warning
rm(iris_long)
# This works better, these two are inverse operations
wlddev |> pivot(1:8) |> pivot(how = "w") |> head()
# ...but not perfect, we loose labels
namlab(wlddev)
wlddev |> pivot(1:8) |> pivot(how = "w") |> namlab()
# But pivot() supports labels: these are perfect inverse operations
wlddev |> pivot(1:8, labels = "label") |> print(max = 50) |> # Notice the "label" column
pivot(how = "w", labels = "label") |> namlab()
# If the data does not have 'variable'/'value' cols: need to specify 'names'/'values'
# Using a single column:
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w") |> head()
SUM_wide <- pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w", na.rm = TRUE)
head(SUM_wide) # na.rm = TRUE here removes all new rows completely missing data
tail(SUM_wide) # But there may still be NA's, notice the NA in the final row
# We could use fill to set another value
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", how = "w",
na.rm = TRUE, fill = -9999) |> tail()
# This will keep the label of "SUM", unless we supply a column with new labels
namlab(SUM_wide)
# Such a column is not available here, but we could use "Variable" twice
pivot(GGDC10S, c("Country", "Year"), "SUM", "Variable", "Variable", how = "w",
na.rm = TRUE) |> namlab()
# Alternatively, can of course relabel ex-post
SUM_wide |> relabel(VA = "Value Added", EMP = "Employment") |> namlab()
rm(SUM_wide)
# Multiple-column pivots
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE) |> head()
# Here we may prefer a transposed column order
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = "columns") |> head()
# Can also flip the order of names (independently of columns)
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = "names") |> head()
# Can also enable both (complete transposition)
pivot(GGDC10S, c("Country", "Year"), c("AGR", "MAN", "SUM"), "Variable", how = "w",
na.rm = TRUE, transpose = TRUE) |> head() # or tranpose = c("columns", "names")
# Finally, here is a nice, simple way to reshape the entire dataset.
pivot(GGDC10S, values = 6:16, names = "Variable", na.rm = TRUE, how = "w") |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
# -------------------------------- PIVOT RECAST ---------------------------------
# Look at the data again
head(GGDC10S)
# Let's stack the sectors and instead create variable columns
pivot(GGDC10S, .c(Country, Regioncode, Region, Year),
names = list("Variable", "Sectorcode"), how = "r") |> head()
# Same thing (a bit easier)
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"), how = "r") |> head()
# Removing missing values
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"), how = "r",
na.rm = TRUE) |> head()
# Saving Labels
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector"), how = "r", na.rm = TRUE) |> head()
# Supplying new labels for generated columns: as complete as it gets
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
VA = "Value Added",
EMP = "Employment")), how = "r", na.rm = TRUE) |>
namlab(N = TRUE, Nd = TRUE, class = TRUE)
# Now another (slightly unconventional) use case here is data transposition
# Let's get the data for Botswana
BWA <- GGDC10S |> fsubset(Country == "BWA", Variable, Year, AGR:SUM)
head(BWA)
# By supplying no ids or values, we are simply requesting a transpose operation
pivot(BWA, names = list(from = c("Variable", "Year"), to = "Sectorcode"), how = "r")
# Same with labels
pivot(BWA, names = list(from = c("Variable", "Year"), to = "Sectorcode"),
labels = list(to = "Sector"), how = "r")
# For simple cases, data.table::transpose() will be more efficient, but with multiple
# columns to generate names and/or variable labels to be saved/assigned, pivot() is handy
rm(BWA)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.