breast_cancer: Breast Cancer Dataset

breast_cancerR Documentation

Breast Cancer Dataset

Description

Breast cancer is the most common cancer amongst women in the world. It accounts for 25\% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area. The key challenges against it's detection is how to classify tumors into malignant (cancerous) or benign(non cancerous).

Format

A data frame with 569 rows and 30 covariate variables and 1 response variable

Details

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in:

  • ID number

  • Diagnosis (M = malignant, B = benign)

  • Ten real-valued features are computed for each cell nucleus:

    • radius (mean of distances from center to points on the perimeter)

    • texture (standard deviation of gray-scale values)

    • perimeter

    • area

    • smoothness (local variation in radius lengths)

    • compactness (perimeter^2 / area - 1.0)

    • concavity (severity of concave portions of the contour)

    • concave points (number of concave portions of the contour)

    • symmetry

    • fractal dimension ("coastline approximation" - 1)

Source

https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset?select=breast-cancer.csv and https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

References

Wolberg WH, Street WN, Mangasarian OL. Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. Cancer Lett. 1994 Mar 15;77(2-3):163-71.

See Also

body_fat seeds

Examples

data(breast_cancer)
set.seed(221212)
train <- sample(1:569, 80)
train_data <- data.frame(breast_cancer[train, -1])
test_data <- data.frame(breast_cancer[-train, -1])

forest <- ODRF(diagnosis ~ ., train_data, split = "gini", parallel = FALSE, ntrees = 50)
pred <- predict(forest, test_data[, -1])
# classification error
(mean(pred != test_data[, 1]))

tree <- ODT(diagnosis ~ ., train_data, split = "gini")
pred <- predict(tree, test_data[, -1])
# classification error
(mean(pred != test_data[, 1]))

ODRF documentation built on May 31, 2023, 8:22 p.m.