aggregate_to_symbolic: Aggregate Tabular Data to Symbolic Data
In dataSDA: Datasets and Basic Statistics for Symbolic Data Analysis

aggregate_to_symbolic

R Documentation

Aggregate Tabular Data to Symbolic Data

Description

Aggregate tabular numerical data (n by p) into interval-valued or histogram-valued symbolic data (K by p) based on a grouping mechanism.

Usage

aggregate_to_symbolic(x, type = "int", group_by = "kmeans",
  stratify_var = NULL, K = 5, interval = "range",
  quantile_probs = c(0.05, 0.95), bins = 10, nK = NULL,
  zero_width = c("keep", "remove", "regenerate", "adjust"), epsilon = 1e-07)

Arguments

`x`	A data.frame with n rows and p columns. May contain non-numeric columns used for grouping or stratification; only numeric columns are aggregated.
`type`	Output symbolic type: `"int"` for interval data or `"hist"` for histogram data.
`group_by`	Grouping mechanism. One of: `"kmeans"` Partition the data into `K` groups using k-means clustering. `"hclust"` Partition the data into `K` groups using hierarchical clustering. `"resampling"` Generate `K` concepts by randomly sampling `nK` observations with replacement, repeated `K` times. A column name or column index Use the specified categorical variable to define groups.
`stratify_var`	Optional column name or index for a stratification variable. When provided, grouping and aggregation are performed independently within each level. Default is `NULL`.
`K`	Number of groups for clustering (`group_by = "kmeans"` or `"hclust"`) or resampling (`group_by = "resampling"`). Ignored when `group_by` is a variable. Default is 5.
`interval`	Interval construction method when `type = "int"`: `"range"` uses min/max; `"quantile"` uses quantiles given by `quantile_probs`. Default is `"range"`.
`quantile_probs`	Numeric vector of length 2 giving the lower and upper quantile probabilities for `interval = "quantile"`. Default is `c(0.05, 0.95)`.
`bins`	Number of histogram bins when `type = "hist"`. Default is 10.
`nK`	Number of observations to sample per group when `group_by = "resampling"`. Default is `floor(n / K)`.
`zero_width`	How to handle zero-width intervals (`min == max`) produced when `type = "int"`. Such degenerate intervals break downstream tools that divide by interval width (e.g. `ggInterval::ggInterval_indexImage()`). One of: `"keep"` (default) Leave the aggregated output unchanged; zero-width intervals are returned as-is and no action is taken. Use `check_zero_width_intervals` to screen the result. `"remove"` Drop every concept (row) that contains at least one zero-width interval. `"regenerate"` Re-run the aggregation (re-clustering or re-sampling) until no zero-width interval remains. Only effective for stochastic `group_by` (`"kmeans"`, `"resampling"`); for deterministic grouping (a variable or `"hclust"`) the result cannot change, so an error is raised suggesting another option. `"adjust"` Add a small amount `epsilon` to the upper endpoint of each zero-width interval. Ignored when `type = "hist"`.
`epsilon`	Positive amount added to the upper endpoint of each zero-width interval when `zero_width = "adjust"`. Default is `1e-07`.

Details

The function aggregates classical tabular data into symbolic data by:

Partitioning observations into groups via group_by (clustering, resampling, or a categorical variable).
Within each group, summarizing each numeric variable as an interval (min/max or quantiles) or a histogram.

When stratify_var is provided, grouping and aggregation are performed within each level of the stratification variable. Label values are prefixed by the stratum name (e.g., "setosa.cluster_1").

For type = "hist", bin boundaries are computed from the global data range to ensure comparability across groups.

Non-numeric columns (other than those used for grouping or stratification) are silently excluded from aggregation.

Value

For type = "int": a symbolic_tbl (RSDA format) with a label column followed by symbolic_interval columns for each numeric variable (K rows, 1 + p columns).
For type = "hist": a MatH object (K rows by p columns of histogram-valued data).

Examples

# Group by a categorical variable -> interval data
res1 <- aggregate_to_symbolic(iris, type = "int", group_by = "Species")
res1

# K-means clustering -> interval data
res2 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "kmeans", K = 3)

# Quantile-based intervals
res3 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "kmeans", K = 3,
                               interval = "quantile",
                               quantile_probs = c(0.1, 0.9))

# Resampling -> interval data
set.seed(42)
res4 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "resampling", K = 5, nK = 30)

# Histogram aggregation
res5 <- aggregate_to_symbolic(iris, type = "hist",
                               group_by = "Species", bins = 5)

# Hierarchical clustering -> interval data
res6 <- aggregate_to_symbolic(iris[, 1:4], type = "int",
                               group_by = "hclust", K = 3)

# Stratified aggregation
res7 <- aggregate_to_symbolic(iris, type = "int",
                               group_by = "kmeans", K = 2,
                               stratify_var = "Species")

dataSDA documentation built on June 12, 2026, 9:06 a.m.