classify_risk: Classify Samples into Risk Categories
In RiskyCNV: Risk Analysis of Genomic Copy Number Variation

View source: R/classify_risk.R

classify_risk

R Documentation

Classify Samples into Risk Categories

Description

Reads a CSV file containing sample metadata and assigns each sample to a risk category based on a specified scoring column. Supports built-in presets for seven major disease types, fully custom user-defined risk boundaries, or automatic classification using a normalised Risk Score derived from the data itself.

Usage

classify_risk(
  file_path,
  column_name,
  disease_type = "auto",
  n_groups = 3,
  score_min = NULL,
  score_max = NULL,
  risk_groups = NULL,
  output_dir = NULL
)

Arguments

`file_path`	Character. Path to the input CSV file containing sample metadata.
`column_name`	Character. Name of the column containing the grading or staging score (e.g., Gleason score, Nottingham score, TNM stage).
`disease_type`	Character. Disease type for built-in preset risk groupings. Supported values: `"prostate"`, `"breast"`, `"colorectal"`, `"lung"`, `"cervical"`, `"lymphoma"`, `"melanoma"`. Use `"custom"` to supply your own groupings via the `risk_groups` argument. Use `"auto"` to automatically classify using a normalised Risk Score derived from the data. Default is `"auto"`.
`n_groups`	Integer. Number of risk groups to create. Only used when `disease_type = "auto"`. Must be between 2 and 5. Default is `3`. 2 groups low_risk, high_risk 3 groups low_risk, intermediate_risk, high_risk 4 groups very_low_risk, low_risk, high_risk, very_high_risk 5 groups very_low_risk, low_risk, intermediate_risk, high_risk, very_high_risk
`score_min`	Numeric or NULL. Minimum possible value of the score. If NULL (default), automatically detected from the data.
`score_max`	Numeric or NULL. Maximum possible value of the score. If NULL (default), automatically detected from the data.
`risk_groups`	Named list of functions. Required only when `disease_type = "custom"`. Each element must be a function that takes a numeric or character vector and returns a logical vector. The name of each element becomes the risk group label.
`output_dir`	Character or NULL. Directory to save the output CSV file. If NULL (default), output is saved in the same directory as the input file.

Details

When disease_type = "auto", the function computes a normalised Risk Score for each sample using min-max normalisation:

Risk Score = \frac{score - min(score)}{max(score) - min(score)}

The Risk Score ranges from 0 (lowest risk) to 1 (highest risk). Risk group boundaries are then determined automatically:

If the score distribution is approximately symmetric (skewness between -0.5 and +0.5), equal-width boundaries are used, dividing the 0-1 range into n_groups equal intervals.
If the score distribution is skewed (skewness outside -0.5 to +0.5), quantile-based boundaries are used, ensuring approximately equal numbers of samples per group.

The splitting method chosen is reported via a message. Risk group labels are generated automatically based on n_groups.

Built-in presets use clinically validated risk stratification systems:

prostate: D'Amico classification (D'Amico et al., 1998): low_risk (<=6), intermediate_risk (7), high_risk (>=8).
breast: Nottingham Prognostic Index (Galea et al., 1992): low_risk (3-5), intermediate_risk (6-7), high_risk (8-9).
colorectal: Dukes-based risk (Dukes, 1932): low_risk (A), intermediate_risk (B/C), high_risk (D).
lung: TNM stage-based (Goldstraw et al., 2016): low_risk (I), intermediate_risk (II/III), high_risk (IV).
cervical: FIGO stage-based (Bhatla et al., 2019): low_risk (I), intermediate_risk (II/III), high_risk (IV).
lymphoma: Ann Arbor/Lugano (Cheson et al., 2014): limited (I/II), advanced (III/IV).
melanoma: Breslow depth (Breslow, 1970): low_risk (<=1.0mm), intermediate_risk (1.0-4.0mm), high_risk (>4.0mm).

Value

A named list where each element corresponds to a risk group and contains the sample IDs belonging to that group. The number of elements matches the number of risk groups detected or specified.

References

D'Amico AV, et al. (1998). Biochemical outcome after radical prostatectomy. JAMA, 280(11):969-974.
Galea MH, et al. (1992). The Nottingham prognostic index. Breast Cancer Res Treat, 22(3):207-219.
Dukes CE. (1932). The classification of cancer of the rectum. J Pathol Bacteriol, 35:323-332.
Goldstraw P, et al. (2016). The IASLC Lung Cancer Staging Project. J Thorac Oncol, 11(1):39-51.
Bhatla N, et al. (2019). Revised FIGO staging for carcinoma of the cervix uteri. Int J Gynaecol Obstet, 145(1):129-135.
Cheson BD, et al. (2014). The Lugano Classification. J Clin Oncol, 32(27):3059-3068.
Breslow A. (1970). Thickness and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg, 172(5):902-908.

Examples

# Auto mode - let the function decide risk grouping (any disease)
sample_file <- system.file("extdata", "sample_data.csv",
                            package = "RiskyCNV")
result <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 3,
  output_dir   = tempdir()
)
print(names(result))

# Prostate cancer preset
result_prostate <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)
print(result_prostate$low_risk)

# Custom risk groups for any disease

result_custom <- classify_risk(
  file_path    = "samples.csv",
  column_name  = "risk_score",
  disease_type = "custom",
  risk_groups  = list(
    "low_risk"  = function(x) x <= 5,
    "high_risk" = function(x) x > 5
  ),
  output_dir   = tempdir()
)

RiskyCNV documentation built on June 5, 2026, 5:07 p.m.