search_data: Search Datasets

View source: R/utilities.R

search_dataR Documentation

Search Datasets

Description

Search and filter the dataSDA dataset catalog by metadata criteria including sample size, number of variables, subject area, symbolic format, analytical tasks, keywords, and book reference.

Usage

search_data(...)

Arguments

...

Filter expressions. Each argument is a comparison expression evaluated against the dataset metadata. Supported columns:

n

Sample size (numeric). Operators: ==, >, <, >=, <=.

p

Number of variables (numeric). Operators: ==, >, <, >=, <=.

subject

Subject area (character). Case-insensitive partial match with ==. Areas: Agriculture, Automotive, Biology, Biometrics, Botany, Chemistry, Climate, Criminology, Demographics, Digital media, Economics, Education, Energy, Engineering, Environment, Finance, Food science, Forestry, Genomics, Healthcare, Marine biology, Medical, Methodology, Public services, Socioeconomics, Sociology, Sports, Transportation, Zoology.

type

Symbolic format (character). Exact match with ==. Types correspond to the dataset name suffix: "int" (interval), "hist" (histogram), "mix" (mixed), "distr" (distribution), "its" (interval time series), "modal" (modal), "iGAP" (interval in iGAP format).

task

Analytical tasks (character). Case-insensitive partial match with ==. Tasks: Clustering, Classification, Regression, PCA, Descriptive statistics, Discriminant analysis, Visualization, Spatial analysis, Time series, Aggregation.

tag

Keywords (character). Case-insensitive partial match with ==. Use tag == "all" to list all datasets.

book

Book reference short name (character). Case-insensitive partial match with ==. Available books: SDA_2006 (Billard & Diday, 2006), CMD_2020 (Billard & Diday, 2020), SODAS_2008 (Diday & Noirhomme-Fraiture, 2008).

Details

For character columns (subject, type, task, tag, book), the == operator performs a case-insensitive substring match (using grepl). The type column uses short suffix-based labels that match the dataset name suffix (e.g., type == "int" matches all .int datasets).

For numeric columns (n, p), standard comparison operators are used with exact semantics.

When no arguments are provided, or when tag == "all" is used, all datasets are returned.

Value

A data frame with one row per matching dataset and the following columns: name, n, p, subject, type, task, tag, book.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester.

Billard, L. and Diday, E. (2020). Clustering Methodology for Symbolic Data. Wiley.

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley.

Examples

# List all datasets
search_data()

# Filter by symbolic format (suffix-based)
search_data(type == "hist")

# Filter by analytical task and size
search_data(task == "Regression", n > 10)

# Filter by book reference
search_data(book == "SDA_2006")

# Combine multiple filters
search_data(type == "int", task == "Clustering", subject == "Biology")

# Filter by size range
search_data(n >= 20, n <= 100, p < 10)


dataSDA documentation built on June 12, 2026, 9:06 a.m.