extract_metadata: Extract Sample Metadata and Classify into Grade or Stage...
In RiskyCNV: Risk Analysis of Genomic Copy Number Variation

View source: R/extract_metadata.R

extract_metadata

R Documentation

Extract Sample Metadata and Classify into Grade or Stage Groups

Description

Reads a CSV file containing sample metadata and classifies each sample into grade or stage groups based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined thresholds, or automatic classification using a normalised Risk Score derived from the data itself.

Usage

extract_metadata(
  file_path,
  column_name,
  disease_type = "auto",
  pattern_col = NULL,
  n_groups = 3,
  group_type = "grade",
  score_min = NULL,
  score_max = NULL,
  thresholds = NULL,
  output_dir = NULL
)

Arguments

`file_path`	Character. Path to the input CSV file containing sample metadata.
`column_name`	Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage).
`disease_type`	Character. Disease type for built-in preset thresholds. Supported values: `"prostate"`, `"breast"`, `"colorectal"`, `"lung"`, `"cervical"`, `"lymphoma"`, `"melanoma"`. Use `"custom"` to supply your own thresholds via the `thresholds` argument. Use `"auto"` to automatically classify using a normalised Risk Score derived from the data. Default is `"auto"`.
`pattern_col`	Character or NULL. Only used when `disease_type = "prostate"`. Name of the column containing the primary Gleason pattern (pattern1). When provided, Grade Group 2 (Gleason 3+4=7, primary pattern 3) and Grade Group 3 (Gleason 4+3=7, primary pattern 4) are distinguished accurately. If NULL (default), all Gleason 7 samples are assigned to Grade Group 2.
`n_groups`	Integer. Number of grade or stage groups to create. Only used when `disease_type = "auto"`. Must be between 2 and 5. Default is `3`.
`group_type`	Character. Type of group labels to generate. Only used when `disease_type = "auto"`. One of `"grade"`, `"stage"`, or `"risk"`. Default is `"grade"`. grade Labels as Grade 1, Grade 2, ... stage Labels as Stage I, Stage II, ... risk Labels as low_risk, intermediate_risk, high_risk, ...
`score_min`	Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data.
`score_max`	Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data.
`thresholds`	Named list of functions. Required only when `disease_type = "custom"`. Each element must be a function that takes a numeric or character vector and returns a logical vector. The name of each element becomes the grade or stage group label.
`output_dir`	Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file.

Details

For prostate cancer, an optional pattern_col parameter allows accurate distinction between Grade Group 2 (Gleason 3+4=7) and Grade Group 3 (Gleason 4+3=7) using the primary histological pattern column.

Prostate cancer Grade Group 2 vs Grade Group 3 distinction:

Both Grade Group 2 (Gleason 3+4=7) and Grade Group 3 (Gleason 4+3=7) have the same total Gleason score of 7, making them indistinguishable from the total score alone. The primary histological pattern determines the correct assignment:

Primary pattern 3 + secondary pattern 4 → Grade Group 2
Primary pattern 4 + secondary pattern 3 → Grade Group 3

Supply the name of the primary pattern column via pattern_col (typically "pattern1") to enable this distinction. If pattern_col is not supplied, all Gleason 7 samples are assigned to Grade Group 2 and a message is shown.

Auto mode:

When disease_type = "auto", the function computes a normalised Risk Score for each sample using min-max normalisation:

Risk Score = \frac{score - min(score)}{max(score) - min(score)}

The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Group boundaries are determined automatically based on distribution skewness:

Symmetric distribution (skewness between -0.5 and +0.5): equal-width boundaries
Skewed distribution (skewness outside -0.5 to +0.5): quantile-based boundaries

Value

A named list where each element corresponds to a grade or stage group and contains the sample IDs belonging to that group.

References

Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on Gleason Grading. Am J Surg Pathol, 40(2):244-252.
Elston CW & Ellis IO. (1991). Pathological prognostic factors in breast cancer. Histopathology, 19(5):403-410.
Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.
Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.
Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.
Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.
Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.

Examples

sample_file <- system.file("extdata", "sample_data.csv",
                            package = "RiskyCNV")

# Prostate preset — without pattern column (Grade Group 2 and 3 merged)
result <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)
print(names(result))

# Prostate preset — with pattern column (Grade Group 2 and 3 distinguished)
result_full <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  pattern_col  = "pattern1",
  output_dir   = tempdir()
)
print(names(result_full))

# Auto mode
result_auto <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 3,
  group_type   = "grade",
  output_dir   = tempdir()
)
print(names(result_auto))

# Custom thresholds
result_custom <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "custom",
  thresholds   = list(
    "Stage I"   = function(x) x <= 6,
    "Stage II"  = function(x) x == 7,
    "Stage III" = function(x) x == 8,
    "Stage IV"  = function(x) x > 8
  ),
  output_dir   = tempdir()
)
print(names(result_custom))

RiskyCNV documentation built on June 5, 2026, 5:07 p.m.