Wisconsin Prognostic Breast Cancer Data

Description

Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

Usage

1
data("wpbc")

Format

A data frame with 198 observations on the following 34 variables.

status

a factor with levels N (nonrecur) and R (recur)

time

recurrence time (for status == "R") or disease-free time (for status == "N").

mean_radius

radius (mean of distances from center to points on the perimeter) (mean).

mean_texture

texture (standard deviation of gray-scale values) (mean).

mean_perimeter

perimeter (mean).

mean_area

area (mean).

mean_smoothness

smoothness (local variation in radius lengths) (mean).

mean_compactness

compactness (mean).

mean_concavity

concavity (severity of concave portions of the contour) (mean).

mean_concavepoints

concave points (number of concave portions of the contour) (mean).

mean_symmetry

symmetry (mean).

mean_fractaldim

fractal dimension (mean).

SE_radius

radius (mean of distances from center to points on the perimeter) (SE).

SE_texture

texture (standard deviation of gray-scale values) (SE).

SE_perimeter

perimeter (SE).

SE_area

area (SE).

SE_smoothness

smoothness (local variation in radius lengths) (SE).

SE_compactness

compactness (SE).

SE_concavity

concavity (severity of concave portions of the contour) (SE).

SE_concavepoints

concave points (number of concave portions of the contour) (SE).

SE_symmetry

symmetry (SE).

SE_fractaldim

fractal dimension (SE).

worst_radius

radius (mean of distances from center to points on the perimeter) (worst).

worst_texture

texture (standard deviation of gray-scale values) (worst).

worst_perimeter

perimeter (worst).

worst_area

area (worst).

worst_smoothness

smoothness (local variation in radius lengths) (worst).

worst_compactness

compactness (worst).

worst_concavity

concavity (severity of concave portions of the contour) (worst).

worst_concavepoints

concave points (number of concave portions of the contour) (worst).

worst_symmetry

symmetry (worst).

worst_fractaldim

fractal dimension (worst).

tsize

diameter of the excised tumor in centimeters.

pnodes

number of positive axillary lymph nodes observed at time of surgery.

Details

The first 30 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

There are two possible learning problems: predicting status or predicting the time to recur.

1) Predicting field 2, outcome: R = recurrent, N = non-recurrent - Dataset should first be filtered to reflect a particular endpoint; e.g., recurrences before 24 months = positive, non-recurrence beyond 24 months = negative. - 86.3 previous version of this data.

2) Predicting Time To Recur (field 3 in recurrent records) - Estimated mean error 13.9 months using Recurrence Surface Approximation.

The data are originally available from the UCI machine learning repository, see http://www.ics.uci.edu/~mlearn/databases/breast-cancer-wisconsin/.

Source

W. Nick Street, Olvi L. Mangasarian and William H. Wolberg (1995). An inductive learning approach to prognostic prediction. In A. Prieditis and S. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 522–530, San Francisco, Morgan Kaufmann.

Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.

Examples

1
2
3
4
5
    data("wpbc", package = "TH.data")

    ### fit logistic regression model 
    coef(glm(status ~ ., data = wpbc[,colnames(wpbc) != "time"],
             family = binomial()))