classify_risk: Classify Samples into Risk Categories

View source: R/classify_risk.R

classify_riskR Documentation

Classify Samples into Risk Categories

Description

Reads a CSV file containing sample metadata and assigns each sample to a risk category based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined risk boundaries, or automatic classification using a normalised Risk Score derived from the data itself.

Usage

classify_risk(
  file_path,
  column_name,
  disease_type = "auto",
  n_groups = 3,
  score_min = NULL,
  score_max = NULL,
  risk_groups = NULL,
  output_dir = NULL
)

Arguments

file_path

Character. Path to the input CSV file containing sample metadata.

column_name

Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage).

disease_type

Character. Disease type for built-in preset risk groupings. Supported values: "prostate", "breast", "colorectal", "lung", "cervical", "lymphoma", "melanoma". Use "custom" to supply your own groupings via the risk_groups argument. Use "auto" to automatically classify using a normalised Risk Score derived from the data. Default is "auto".

n_groups

Integer. Number of risk groups to create. Only used when disease_type = "auto". Must be between 2 and 5. Default is 3.

2 groups

low_risk, high_risk

3 groups

low_risk, intermediate_risk, high_risk

4 groups

very_low_risk, low_risk, high_risk, very_high_risk

5 groups

very_low_risk, low_risk, intermediate_risk, high_risk, very_high_risk

score_min

Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data.

score_max

Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data.

risk_groups

Named list of functions. Required only when disease_type = "custom". Each element must be a function that takes a numeric or character vector and returns a logical vector. The name of each element becomes the risk group label.

output_dir

Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file.

Details

When disease_type = "auto", the function computes a normalised Risk Score for each sample using min-max normalisation:

Risk Score = \frac{score - min(score)}{max(score) - min(score)}

The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Risk group boundaries are then determined automatically:

  • If the score distribution is approximately symmetric (skewness between -0.5 and +0.5), equal-width boundaries are used, dividing the 0-1 range into n_groups equal intervals.

  • If the score distribution is skewed (skewness outside -0.5 to +0.5), quantile-based boundaries are used, ensuring approximately equal numbers of samples per group.

The splitting method chosen is reported via a message. Risk group labels are generated automatically based on n_groups.

Built-in presets use clinically validated risk stratification systems:

prostate

D'Amico classification (D'Amico et al., 1998): low_risk (<=6), intermediate_risk (7), high_risk (>=8).

breast

Nottingham Prognostic Index (Galea et al., 1992): low_risk (3-5), intermediate_risk (6-7), high_risk (8-9).

colorectal

Dukes-based risk (Dukes, 1932): low_risk (A), intermediate_risk (B/C), high_risk (D).

lung

TNM stage-based (Goldstraw et al., 2016): low_risk (I), intermediate_risk (II/III), high_risk (IV).

cervical

FIGO stage-based (Bhatla et al., 2019): low_risk (I), intermediate_risk (II/III), high_risk (IV).

lymphoma

Ann Arbor/Lugano (Cheson et al., 2014): limited (I/II), advanced (III/IV).

melanoma

Breslow depth (Breslow, 1970): low_risk (<=1.0mm), intermediate_risk (1.0-4.0mm), high_risk (>4.0mm).

Value

A named list where each element corresponds to a risk group and contains the sample IDs belonging to that group. The number of elements matches the number of risk groups detected or specified.

References

  • D'Amico AV, et al. (1998). Biochemical outcome after radical prostatectomy. JAMA, 280(11):969-974.

  • Galea MH, et al. (1992). The Nottingham prognostic index. Breast Cancer Res Treat, 22(3):207-219.

  • Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.

  • Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.

  • Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.

  • Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.

  • Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.

Examples

# Auto mode - let the function decide risk grouping (any disease)
sample_file <- system.file("extdata", "sample_data.csv",
                            package = "RiskyCNV")
result <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 3,
  output_dir   = tempdir()
)
print(names(result))

# Prostate cancer preset
result_prostate <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)
print(result_prostate$low_risk)

# Custom risk groups for any disease

result_custom <- classify_risk(
  file_path    = "samples.csv",
  column_name  = "risk_score",
  disease_type = "custom",
  risk_groups  = list(
    "low_risk"  = function(x) x <= 5,
    "high_risk" = function(x) x > 5
  ),
  output_dir   = tempdir()
)



RiskyCNV documentation built on June 5, 2026, 5:07 p.m.