agg | R Documentation |
Aggregate counts or probabilities from a detailed level of a hierarchical variable to an aggregate level or scale the detailed level values so that the detailed level aggregated together equals the aggregate level.
agg(
dt,
id_cols,
value_cols,
col_stem,
col_type,
mapping,
agg_function = sum,
missing_dt_severity = "stop",
present_agg_severity = "stop",
overlapping_dt_severity = "stop",
na_value_severity = "stop",
collapse_interval_cols = FALSE,
quiet = FALSE
)
scale(
dt,
id_cols,
value_cols,
col_stem,
col_type,
mapping = NULL,
agg_function = sum,
missing_dt_severity = "stop",
overlapping_dt_severity = "stop",
na_value_severity = "stop",
collapse_interval_cols = FALSE,
collapse_missing = FALSE,
quiet = FALSE
)
dt |
[ |
id_cols |
[ |
value_cols |
[ |
col_stem |
[ |
col_type |
[ |
mapping |
[ |
agg_function |
[ |
missing_dt_severity |
[ |
present_agg_severity |
[ |
overlapping_dt_severity |
[ |
na_value_severity |
[ |
collapse_interval_cols |
[ |
quiet |
[ |
collapse_missing |
[ |
The agg
function can be used to aggregate to different levels of a pre
defined hierarchy. For example a categorical variable like location you can
aggregate the country level to global level or for a numeric 'interval'
variable like age you can aggregate from five year age-groups to all-ages
combined.
The scale
function can be used to scale different levels of hierarchical
variables like location, so that the sub-national level aggregated together
equals the national level. Similarly, it can be used to scale a numeric
'interval' variable like age so that the five year age groups aggregated
together equals the all-ages value.
If 'location' is the variable to be aggregated or scaled then
col_stem = 'location'
and 'location' must be included in id_cols.
If
'age' is the variable to be aggregated or scaled then col_stem = 'age'
and
'age_start' and 'age_end' must be included in id_cols
since both variables
are needed to represent interval variables.
The mapping
argument defines how different levels of the hierarchical
variable relate to each other. For numeric interval variables the hierarchy
can be inferred while for categorical variables the full hierarchy needs to
be provided.
mapping
for categorical variables must have columns called 'parent' and
'child' that represent how each possible variable relates to each other.
For example if aggregating or scaling locations then mapping needs to define
how each child location relates to each parent location. It is then assumed
that each parent location in the mapping
hierarchy will need to be
aggregated to.
mapping
for numeric interval variables is only needed when aggregating data
to define exactly which aggregates are needed. It must have columns for
'{col_stem}
_start' and '{col_stem}
_end' defining the start and end of each
aggregate interval that is need. There can be an optional 'include_NA' logical
column that allows 'NA' col_stem
values to be included in the aggregate
for certain requested aggregates. When scaling data, mapping
should be
NULL
since the hierarchy can be inferred from the available intervals in
dt
.
agg
and scale
work even if dt
is not a square dataset. Meaning it is
okay if different combinations of id_vars
have different col_stem
values
available. For example if making age aggregates, it is okay if some
location-years have 5-year age groups while other location-years have 1-year
age groups.
If collapse_interval_cols = TRUE
it is okay if the interval variables
included in id_vars
are not all exactly the same, agg
and scale
will
collapse to the most detailed common intervals
collapse_common_intervals()
prior to aggregation or scaling. An example
of this is when aggregating subnational data to the national level (so
col_stem
is 'location' and col_type
is 'categorical') but each
subnational location contains different age groups. agg()
and scale()
first aggregate to the most detailed common age groups before making location
aggregates.
The agg
and scale
functions currently only work when combining counts or
probabilities. If the data is in rate-space then you need to convert to count
space first, aggregate/scale, and then convert back.
[data.table()
] with id_cols
and value_cols
columns for
requested aggregates or with scaled values.
missing_dt_severity
:
Check for missing levels of col_stem
, the variable being aggregated or
scaled over.
stop
: throw error (this is the default).
warning
or message
: throw warning/message and continue with
aggregation/scaling for requested aggregations/scalings where expected input
data in dt
is available.
none
: don't throw error or warning, continue with aggregation/scaling
for requested aggregations/scalings where expected input data in dt
is
available.
skip
: skip this check and continue with aggregation/scaling.
present_agg_severity
(agg
only):
Check for requested aggregates in mapping
that are already present
stop
: throw error (this is the default).
warning
or message
: throw warning/message, drop aggregates and continue
with aggregation.
none
: don't throw error or warning, drop aggregates and continue with
aggregation.
skip
: skip this check and add to the values already present for the
aggregates.
na_value_severity
:
Check for 'NA' values in the value_cols
.
stop
: throw error (this is the default).
warning
or message
: throw warning/message, drop missing values and
continue with aggregation/scaling where possible (this likely will cause
another error because of missing_dt_severity
, consider setting
missing_dt_severity = "skip"
for functionality similiar to na.rm = TRUE
).
none
: don't throw error or warning, drop missing values and continue
with aggregation/scaling where possible (this likely will cause another error
because of missing_dt_severity
, consider setting
missing_dt_severity = "skip"
for functionality similiar to na.rm = TRUE
).
skip
: skip this check and propagate NA
values through
aggregation/scaling.
overlapping_dt_severity
:
Check for overlapping intervals that prevent collapsing to the most detailed
common set of intervals. Or check for overlapping intervals in col_stem
when aggregating/scaling.
stop
: throw error (this is the default).
warning
or message
: throw warning/message, drop overlapping intervals
and continue with aggregation/scaling where possible (this may cause another
error because of missing_dt_severity
).
none
: don't throw error or warning, drop overlapping intervals and
continue with aggregation/scaling where possible (this may cause another
error because of missing_dt_severity
).
skip
: skip this check and continue with aggregation/scaling.
# aggregate count data from present day Iran provinces to historical
# provinces and Iran as a whole
input_dt <- data.table::CJ(location = iran_mapping[!grepl("[0-9]+", child),
child],
year = 2011,
value = 1)
output_dt <- agg(dt = input_dt,
id_cols = c("location", "year"),
value_cols = "value",
col_stem = "location",
col_type = "categorical",
mapping = iran_mapping)
# scale count data from present day Iran provinces to Iran national value
input_dt <- data.table::CJ(location = iran_mapping[!grepl("[0-9]+", child),
child],
year = 2011,
value = 1)
input_dt_agg <- data.table::data.table(
location = "Iran (Islamic Republic of)",
year = 2011, value = 62
)
input_dt <- rbind(input_dt, input_dt_agg)
output_dt <- scale(dt = input_dt,
id_cols = c("location", "year"),
value_cols = "value",
col_stem = "location",
col_type = "categorical",
mapping = iran_mapping,
collapse_missing = TRUE)
# aggregate age-specific count data
input_dt <- data.table::data.table(year = 2010,
age_start = seq(0, 95, 1),
value1 = 1, value2 = 2)
gen_end(input_dt, id_cols = c("year", "age_start"), col_stem = "age")
age_mapping <- data.table::data.table(age_start = c(0, 15, 85),
age_end = c(5, 60, Inf))
output_dt <- agg(dt = input_dt,
id_cols = c("year", "age_start", "age_end"),
value_cols = c("value1", "value2"),
col_stem = "age",
col_type = "interval",
mapping = age_mapping)
# scale age-specific probability data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.