extract_metadata: Extract Sample Metadata and Classify into Grade or Stage...

View source: R/extract_metadata.R

extract_metadataR Documentation

Extract Sample Metadata and Classify into Grade or Stage Groups

Description

Reads a CSV file containing sample metadata and classifies each sample into grade or stage groups based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined thresholds, or automatic classification using a normalised Risk Score derived from the data itself.

Usage

extract_metadata(
  file_path,
  column_name,
  disease_type = "auto",
  pattern_col = NULL,
  n_groups = 3,
  group_type = "grade",
  score_min = NULL,
  score_max = NULL,
  thresholds = NULL,
  output_dir = NULL
)

Arguments

file_path

Character. Path to the input CSV file containing sample metadata.

column_name

Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage).

disease_type

Character. Disease type for built-in preset thresholds. Supported values: "prostate", "breast", "colorectal", "lung", "cervical", "lymphoma", "melanoma". Use "custom" to supply your own thresholds via the thresholds argument. Use "auto" to automatically classify using a normalised Risk Score derived from the data. Default is "auto".

pattern_col

Character or NULL. Only used when disease_type = "prostate". Name of the column containing the primary Gleason pattern (pattern1). When provided, Grade Group 2 (Gleason 3+4=7, primary pattern 3) and Grade Group 3 (Gleason 4+3=7, primary pattern 4) are distinguished accurately. If NULL (default), all Gleason 7 samples are assigned to Grade Group 2.

n_groups

Integer. Number of grade or stage groups to create. Only used when disease_type = "auto". Must be between 2 and 5. Default is 3.

group_type

Character. Type of group labels to generate. Only used when disease_type = "auto". One of "grade", "stage", or "risk". Default is "grade".

grade

Labels as Grade 1, Grade 2, ...

stage

Labels as Stage I, Stage II, ...

risk

Labels as low_risk, intermediate_risk, high_risk, ...

score_min

Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data.

score_max

Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data.

thresholds

Named list of functions. Required only when disease_type = "custom". Each element must be a function that takes a numeric or character vector and returns a logical vector. The name of each element becomes the grade or stage group label.

output_dir

Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file.

Details

For prostate cancer, an optional pattern_col parameter allows accurate distinction between Grade Group 2 (Gleason 3+4=7) and Grade Group 3 (Gleason 4+3=7) using the primary histological pattern column.

Prostate cancer Grade Group 2 vs Grade Group 3 distinction:

Both Grade Group 2 (Gleason 3+4=7) and Grade Group 3 (Gleason 4+3=7) have the same total Gleason score of 7, making them indistinguishable from the total score alone. The primary histological pattern determines the correct assignment:

  • Primary pattern 3 + secondary pattern 4 → Grade Group 2

  • Primary pattern 4 + secondary pattern 3 → Grade Group 3

Supply the name of the primary pattern column via pattern_col (typically "pattern1") to enable this distinction. If pattern_col is not supplied, all Gleason 7 samples are assigned to Grade Group 2 and a message is shown.

Auto mode:

When disease_type = "auto", the function computes a normalised Risk Score for each sample using min-max normalisation:

Risk Score = \frac{score - min(score)}{max(score) - min(score)}

The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Group boundaries are determined automatically based on distribution skewness:

  • Symmetric distribution (skewness between -0.5 and +0.5): equal-width boundaries

  • Skewed distribution (skewness outside -0.5 to +0.5): quantile-based boundaries

Value

A named list where each element corresponds to a grade or stage group and contains the sample IDs belonging to that group.

References

  • Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on Gleason Grading. Am J Surg Pathol, 40(2):244-252.

  • Elston CW & Ellis IO. (1991). Pathological prognostic factors in breast cancer. Histopathology, 19(5):403-410.

  • Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.

  • Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.

  • Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.

  • Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.

  • Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.

Examples

sample_file <- system.file("extdata", "sample_data.csv",
                            package = "RiskyCNV")

# Prostate preset — without pattern column (Grade Group 2 and 3 merged)
result <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)
print(names(result))

# Prostate preset — with pattern column (Grade Group 2 and 3 distinguished)
result_full <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  pattern_col  = "pattern1",
  output_dir   = tempdir()
)
print(names(result_full))

# Auto mode
result_auto <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 3,
  group_type   = "grade",
  output_dir   = tempdir()
)
print(names(result_auto))

# Custom thresholds
result_custom <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "custom",
  thresholds   = list(
    "Stage I"   = function(x) x <= 6,
    "Stage II"  = function(x) x == 7,
    "Stage III" = function(x) x == 8,
    "Stage IV"  = function(x) x > 8
  ),
  output_dir   = tempdir()
)
print(names(result_custom))


RiskyCNV documentation built on June 5, 2026, 5:07 p.m.