View source: R/extract_metadata.R
| extract_metadata | R Documentation |
Reads a CSV file containing sample metadata and classifies each sample into grade or stage groups based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined thresholds, or automatic classification using a normalised Risk Score derived from the data itself.
extract_metadata(
file_path,
column_name,
disease_type = "auto",
pattern_col = NULL,
n_groups = 3,
group_type = "grade",
score_min = NULL,
score_max = NULL,
thresholds = NULL,
output_dir = NULL
)
file_path |
Character. Path to the input CSV file containing sample metadata. |
column_name |
Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage). |
disease_type |
Character. Disease type for built-in preset thresholds.
Supported values: |
pattern_col |
Character or NULL. Only used when
|
n_groups |
Integer. Number of grade or stage groups to create. Only
used when |
group_type |
Character. Type of group labels to generate. Only used
when
|
score_min |
Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data. |
score_max |
Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data. |
thresholds |
Named list of functions. Required only when
|
output_dir |
Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file. |
For prostate cancer, an optional pattern_col parameter allows
accurate distinction between Grade Group 2 (Gleason 3+4=7) and Grade
Group 3 (Gleason 4+3=7) using the primary histological pattern column.
Prostate cancer Grade Group 2 vs Grade Group 3 distinction:
Both Grade Group 2 (Gleason 3+4=7) and Grade Group 3 (Gleason 4+3=7) have the same total Gleason score of 7, making them indistinguishable from the total score alone. The primary histological pattern determines the correct assignment:
Primary pattern 3 + secondary pattern 4 → Grade Group 2
Primary pattern 4 + secondary pattern 3 → Grade Group 3
Supply the name of the primary pattern column via pattern_col
(typically "pattern1") to enable this distinction. If
pattern_col is not supplied, all Gleason 7 samples are assigned
to Grade Group 2 and a message is shown.
Auto mode:
When disease_type = "auto", the function computes a normalised
Risk Score for each sample using min-max normalisation:
Risk Score = \frac{score - min(score)}{max(score) - min(score)}
The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Group boundaries are determined automatically based on distribution skewness:
Symmetric distribution (skewness between -0.5 and +0.5): equal-width boundaries
Skewed distribution (skewness outside -0.5 to +0.5): quantile-based boundaries
A named list where each element corresponds to a grade or stage group and contains the sample IDs belonging to that group.
Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on Gleason Grading. Am J Surg Pathol, 40(2):244-252.
Elston CW & Ellis IO. (1991). Pathological prognostic factors in breast cancer. Histopathology, 19(5):403-410.
Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.
Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.
Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.
Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.
Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.
sample_file <- system.file("extdata", "sample_data.csv",
package = "RiskyCNV")
# Prostate preset — without pattern column (Grade Group 2 and 3 merged)
result <- extract_metadata(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "prostate",
output_dir = tempdir()
)
print(names(result))
# Prostate preset — with pattern column (Grade Group 2 and 3 distinguished)
result_full <- extract_metadata(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "prostate",
pattern_col = "pattern1",
output_dir = tempdir()
)
print(names(result_full))
# Auto mode
result_auto <- extract_metadata(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "auto",
n_groups = 3,
group_type = "grade",
output_dir = tempdir()
)
print(names(result_auto))
# Custom thresholds
result_custom <- extract_metadata(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "custom",
thresholds = list(
"Stage I" = function(x) x <= 6,
"Stage II" = function(x) x == 7,
"Stage III" = function(x) x == 8,
"Stage IV" = function(x) x > 8
),
output_dir = tempdir()
)
print(names(result_custom))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.