generate_demo_data: Generate a Demo Dataset with Specified Number of Clusters and...

View source: R/utils.R

generate_demo_dataR Documentation

Generate a Demo Dataset with Specified Number of Clusters and Overlap

Description

This function generates a demo dataset with a specified number of subjects, features, and desired number of clusters, ensuring that the generated clusters are not too far apart and have some degree of overlap to simulate real-world data. The generated dataset includes demographic information (outcome, age, and gender), as well as numeric features with a specified probability of missing values.

Usage

generate_demo_data(
  n_subjects = 1000,
  n_features = 200,
  missing_prob = 0.1,
  desired_number_clusters = 3,
  cluster_overlap_sd = 15
)

Arguments

n_subjects

Integer. The number of subjects (rows) to generate. Defaults to 1000.

n_features

Integer. The number of features (columns) to generate. Defaults to 200.

missing_prob

Numeric. The probability of introducing missing values (NA) in the feature columns. Defaults to 0.1.

desired_number_clusters

Integer. The approximate number of clusters to generate in the feature space. Defaults to 3.

cluster_overlap_sd

Numeric. The standard deviation to control cluster overlap. Defaults to 15 for more overlap.

Details

The function generates n_features numeric columns based on Gaussian clusters with some overlap between clusters to simulate more realistic data. Missing values are introduced in each feature column based on the missing_prob.

Value

A data frame containing the generated demo dataset, with columns:

  • outcome: A categorical variable with values "low" or "high".

  • age: A numeric variable representing the age of the subject (range 18-90).

  • gender: A categorical variable with values "male" or "female".

  • ⁠Feature X⁠: Numeric feature columns with random values and some missing data.

Examples


# Generate a demo dataset with 1000 subjects, 200 features, and 3 clusters
demo_data <- generate_demo_data(n_subjects = 1000, n_features = 200, 
                                desired_number_clusters = 3, 
                                cluster_overlap_sd = 15, missing_prob = 0.1)

# View the first few rows of the dataset
head(demo_data)



immunaut documentation built on April 12, 2025, 1:22 a.m.