aggregate_to_symbolic: Aggregate Tabular Data to Symbolic Data

View source: R/utilities.R

aggregate_to_symbolicR Documentation

Aggregate Tabular Data to Symbolic Data

Description

Aggregate tabular numerical data (n by p) into interval-valued or histogram-valued symbolic data (K by p) based on a grouping mechanism.

Usage

aggregate_to_symbolic(x, type = "int", group_by = "kmeans",
  stratify_var = NULL, K = 5, interval = "range",
  quantile_probs = c(0.05, 0.95), bins = 10, nK = NULL,
  zero_width = c("keep", "remove", "regenerate", "adjust"), epsilon = 1e-07)

Arguments

x

A data.frame with n rows and p columns. May contain non-numeric columns used for grouping or stratification; only numeric columns are aggregated.

type

Output symbolic type: "int" for interval data or "hist" for histogram data.

group_by

Grouping mechanism. One of:

"kmeans"

Partition the data into K groups using k-means clustering.

"hclust"

Partition the data into K groups using hierarchical clustering.

"resampling"

Generate K concepts by randomly sampling nK observations with replacement, repeated K times.

A column name or column index

Use the specified categorical variable to define groups.

stratify_var

Optional column name or index for a stratification variable. When provided, grouping and aggregation are performed independently within each level. Default is NULL.

K

Number of groups for clustering (group_by = "kmeans" or "hclust") or resampling (group_by = "resampling"). Ignored when group_by is a variable. Default is 5.

interval

Interval construction method when type = "int": "range" uses min/max; "quantile" uses quantiles given by quantile_probs. Default is "range".

quantile_probs

Numeric vector of length 2 giving the lower and upper quantile probabilities for interval = "quantile". Default is c(0.05, 0.95).

bins

Number of histogram bins when type = "hist". Default is 10.

nK

Number of observations to sample per group when group_by = "resampling". Default is floor(n / K).

zero_width

How to handle zero-width intervals (min == max) produced when type = "int". Such degenerate intervals break downstream tools that divide by interval width (e.g. ggInterval::ggInterval_indexImage()). One of:

"keep"

(default) Leave the aggregated output unchanged; zero-width intervals are returned as-is and no action is taken. Use check_zero_width_intervals to screen the result.

"remove"

Drop every concept (row) that contains at least one zero-width interval.

"regenerate"

Re-run the aggregation (re-clustering or re-sampling) until no zero-width interval remains. Only effective for stochastic group_by ("kmeans", "resampling"); for deterministic grouping (a variable or "hclust") the result cannot change, so an error is raised suggesting another option.

"adjust"

Add a small amount epsilon to the upper endpoint of each zero-width interval.

Ignored when type = "hist".

epsilon

Positive amount added to the upper endpoint of each zero-width interval when zero_width = "adjust". Default is 1e-07.

Details

The function aggregates classical tabular data into symbolic data by:

  1. Partitioning observations into groups via group_by (clustering, resampling, or a categorical variable).

  2. Within each group, summarizing each numeric variable as an interval (min/max or quantiles) or a histogram.

When stratify_var is provided, grouping and aggregation are performed within each level of the stratification variable. Label values are prefixed by the stratum name (e.g., "setosa.cluster_1").

For type = "hist", bin boundaries are computed from the global data range to ensure comparability across groups.

Non-numeric columns (other than those used for grouping or stratification) are silently excluded from aggregation.

Value

  • For type = "int": a symbolic_tbl (RSDA format) with a label column followed by symbolic_interval columns for each numeric variable (K rows, 1 + p columns).

  • For type = "hist": a MatH object (K rows by p columns of histogram-valued data).

Examples

# Group by a categorical variable -> interval data
res1 <- aggregate_to_symbolic(iris, type = "int", group_by = "Species")
res1

# K-means clustering -> interval data
res2 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "kmeans", K = 3)

# Quantile-based intervals
res3 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "kmeans", K = 3,
                               interval = "quantile",
                               quantile_probs = c(0.1, 0.9))

# Resampling -> interval data
set.seed(42)
res4 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "resampling", K = 5, nK = 30)

# Histogram aggregation
res5 <- aggregate_to_symbolic(iris, type = "hist",
                               group_by = "Species", bins = 5)

# Hierarchical clustering -> interval data
res6 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "hclust", K = 3)

# Stratified aggregation
res7 <- aggregate_to_symbolic(iris, type = "int",
                               group_by = "kmeans", K = 2,
                               stratify_var = "Species")


dataSDA documentation built on June 12, 2026, 9:06 a.m.