cachar_sample: Synthetic Cancer Risk Factor Study Data
In riskdiff: Risk Difference Estimation with Multiple Link Functions and Inverse Probability of Treatment Weighting

cachar_sample

R Documentation

Synthetic Cancer Risk Factor Study Data

Description

A synthetic dataset inspired by cancer screening and risk factor patterns observed during an opportunistic screening program conducted at the Cachar Cancer Hospital and Research Centre in Northeast India, specifically designed to reflect authentic epidemiological relationships without using real patient data.

Usage

cachar_sample

Format

A data frame with 2,500 rows and 12 variables:

id: Participant identifier (1 to 2500)
age: Age in years (continuous, range 18-84)
sex: Biological sex: "male" or "female"
residence: Residence type: "rural", "urban", or "urban slum"
smoking: Current smoking status: "No" or "Yes"
tobacco_chewing: Current tobacco chewing: "No" or "Yes"
areca_nut: Current areca nut use: "No" or "Yes"
alcohol: Current alcohol use: "No" or "Yes"
abnormal_screen: Binary outcome: 1 = abnormal screening (precancerous lesions or cancer), 0 = normal
head_neck_abnormal: Binary outcome: 1 = head/neck abnormality detected, 0 = normal
age_group: Age categories: "Under 40", "40-60", "Over 60"
tobacco_areca_both: Combined exposure: "Yes" if both tobacco_chewing and areca_nut are "Yes", "No" otherwise

Details

This synthetic dataset was designed to reflect authentic epidemiological patterns observed in Northeast India, particularly the distinctive tobacco and areca nut use patterns of the region. All data points are mathematically generated rather than collected from real individuals.

Key epidemiological features modeled:

Areca nut use: Very high prevalence (~69%) reflecting regional cultural practices
Tobacco chewing: Moderate to high prevalence (~53%), often used with areca nut
Smoking: Lower prevalence (~13%) with strong male predominance
Cancer outcomes: Realistic prevalence (~3.5%) for population-based screening, including both precancerous lesions and invasive cancers
Geographic patterns: Predominantly rural population (~87%)

Synthetic Data Advantages: The synthetic approach preserves authentic statistical relationships while:

Avoiding any privacy or ethical concerns
Ensuring reproducible examples and tests
Providing controlled demonstration scenarios
Maintaining cultural authenticity for educational purposes

Risk Factor Relationships: The data models realistic dose-response relationships between multiple tobacco exposures and cancer outcomes, with particularly strong associations for areca nut use and head/neck abnormalities, reflecting authentic epidemiological patterns from this region.

Note

This synthetic dataset is designed for educational and software demonstration purposes. While the statistical relationships reflect authentic epidemiological patterns, the data should not be used for research conclusions about real populations. The cultural patterns represented (high areca nut use, specific tobacco consumption practices) are authentic to Northeast India.

Source

Synthetic dataset created for the riskdiff package. Inspired by cancer screening patterns observed in Northeast India but contains no real patient data. Statistical relationships designed to reflect authentic epidemiological patterns from this region for educational and methodological purposes.

References

Epidemiological patterns modeled after studies of tobacco use and cancer risk in Northeast India. For research involving actual populations from this region, consult published literature on areca nut and tobacco-related cancer risks in South Asian populations.

Warnakulasuriya S, Trivedy C, Peters TJ (2002). "Areca nut use: an independent risk factor for oral cancer." BMJ, 324(7341), 799-800.

Gupta PC, Ray CS (2004). "Epidemiology of betel quid use." Annals of the Academy of Medicine, Singapore, 33(4 Suppl), 31-36.

Examples

data(cachar_sample)
head(cachar_sample)

# Basic descriptive statistics
table(cachar_sample$areca_nut, cachar_sample$abnormal_screen)

# Regional tobacco use patterns
with(cachar_sample, table(areca_nut, tobacco_chewing))

# Simple risk difference for areca nut and abnormal screening
rd_areca <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut"
)
print(rd_areca)

# Age-adjusted analysis
rd_adjusted <- calc_risk_diff(
  data = cachar_sample,
  outcome = "abnormal_screen",
  exposure = "areca_nut",
  adjust_vars = "age"
)
print(rd_adjusted)

# Stratified by sex
rd_stratified <- calc_risk_diff(
  data = cachar_sample,
  outcome = "head_neck_abnormal",
  exposure = "smoking",
  strata = "sex"
)
print(rd_stratified)

# Multiple tobacco exposures comparison
rd_smoking <- calc_risk_diff(cachar_sample, "abnormal_screen", "smoking")
rd_chewing <- calc_risk_diff(cachar_sample, "abnormal_screen", "tobacco_chewing")
rd_areca <- calc_risk_diff(cachar_sample, "abnormal_screen", "areca_nut")

# Compare risk differences
cat("Risk differences for abnormal screening:\n")
cat("Smoking:", sprintf("%.1f%%", rd_smoking$rd * 100), "\n")
cat("Tobacco chewing:", sprintf("%.1f%%", rd_chewing$rd * 100), "\n")
cat("Areca nut:", sprintf("%.1f%%", rd_areca$rd * 100), "\n")

# Create summary table
cat(create_simple_table(rd_areca, "Abnormal Screening Risk by Areca Nut Use"))

riskdiff documentation built on June 30, 2025, 9:07 a.m.