| gg_vimp | R Documentation |
gg_vimp Extracts the variable importance (VIMP) information from a
rfsrc or randomForest
object and reshapes it into a tidy data set.
gg_vimp(object, nvar, ...)
object |
A |
nvar |
argument to control the number of variables included in the output. |
... |
arguments passed to the |
gg_vimp() shows permutation (Breiman-Cutler) variable
importance: the forest permutes a variable's observed values across the
out-of-bag (OOB) cases, runs those perturbed cases down the already-grown
trees, and measures how much the OOB prediction error climbs. That
perturbation is synthetic (the variable's link to the response is broken
on purpose) so a large increase means the variable was carrying genuine
signal; near-zero or negative values mean it added noise or nothing at all.
gg_varpro() takes the opposite route, comparing local
estimators on real observed data through varPro's release rules, with no
permutation and no synthetic features. The two approaches answer "which
variables matter?" by opposite mechanisms, so a variable can rank
differently under each, and that disagreement is itself informative: it
often signals interaction structure or non-monotone effects that one
mechanism surfaces and the other obscures.
For survival forests, VIMP is measured against the ensemble cumulative
hazard function (CHF); the error metric is one minus the concordance index
(C-statistic). Variables with non-positive VIMP are flagged in the
positive column and colored differently by
plot.gg_vimp.
gg_vimp object. A data.frame of VIMP measures, in rank
order, optionally containing class-specific scores and a relative importance
column. When randomForest objects lack stored importance values a
warning is issued and NA placeholders are returned so plots remain
reproducible.
Ishwaran H. (2007). Variable importance in binary regression trees and forests, Electronic J. Statist., 1:519-537.
plot.gg_vimp rfsrc
vimp gg_varpro
## ------------------------------------------------------------
## classification example
## ------------------------------------------------------------
## -------- iris data
rfsrc_iris <- randomForestSRC::rfsrc(Species ~ .,
data = iris,
importance = TRUE
)
gg_dta <- gg_vimp(rfsrc_iris)
plot(gg_dta)
## ------------------------------------------------------------
## regression example
## ------------------------------------------------------------
## -------- air quality data
rfsrc_airq <- randomForestSRC::rfsrc(Ozone ~ ., airquality,
importance = TRUE
)
gg_dta <- gg_vimp(rfsrc_airq)
plot(gg_dta)
## -------- Boston data
data(Boston, package = "MASS")
rfsrc_boston <- randomForestSRC::rfsrc(medv ~ ., Boston,
importance = TRUE
)
gg_dta <- gg_vimp(rfsrc_boston)
plot(gg_dta)
## -------- Boston data
rf_boston <- randomForest::randomForest(medv ~ ., Boston)
gg_dta <- gg_vimp(rf_boston)
plot(gg_dta)
## -------- mtcars data
rfsrc_mtcars <- randomForestSRC::rfsrc(mpg ~ .,
data = mtcars,
importance = TRUE
)
gg_dta <- gg_vimp(rfsrc_mtcars)
plot(gg_dta)
## ------------------------------------------------------------
## survival example
## ------------------------------------------------------------
## -------- veteran data
data(veteran, package = "randomForestSRC")
rfsrc_veteran <- randomForestSRC::rfsrc(Surv(time, status) ~ .,
data = veteran,
ntree = 100,
importance = TRUE
)
gg_dta <- gg_vimp(rfsrc_veteran)
plot(gg_dta)
## -------- pbc data
# We need to create this dataset
data(pbc, package = "randomForestSRC", )
# For whatever reason, the age variable is in days...
# makes no sense to me
for (ind in seq_len(dim(pbc)[2])) {
if (!is.factor(pbc[, ind])) {
if (length(unique(pbc[which(!is.na(pbc[, ind])), ind])) <= 2) {
if (sum(range(pbc[, ind], na.rm = TRUE) == c(0, 1)) == 2) {
pbc[, ind] <- as.logical(pbc[, ind])
}
}
} else {
if (length(unique(pbc[which(!is.na(pbc[, ind])), ind])) <= 2) {
if (sum(sort(unique(pbc[, ind])) == c(0, 1)) == 2) {
pbc[, ind] <- as.logical(pbc[, ind])
}
if (sum(sort(unique(pbc[, ind])) == c(FALSE, TRUE)) == 2) {
pbc[, ind] <- as.logical(pbc[, ind])
}
}
}
if (!is.logical(pbc[, ind]) &
length(unique(pbc[which(!is.na(pbc[, ind])), ind])) <= 5) {
pbc[, ind] <- factor(pbc[, ind])
}
}
# Convert age to years
pbc$age <- pbc$age / 364.24
pbc$years <- pbc$days / 364.24
pbc <- pbc[, -which(colnames(pbc) == "days")]
pbc$treatment <- as.numeric(pbc$treatment)
pbc$treatment[which(pbc$treatment == 1)] <- "DPCA"
pbc$treatment[which(pbc$treatment == 2)] <- "placebo"
pbc$treatment <- factor(pbc$treatment)
dta_train <- pbc[-which(is.na(pbc$treatment)), ]
# Create a test set from the remaining patients
pbc_test <- pbc[which(is.na(pbc$treatment)), ]
# ========
# build the forest:
rfsrc_pbc <- randomForestSRC::rfsrc(
Surv(years, status) ~ .,
dta_train,
nsplit = 10,
na.action = "na.impute",
forest = TRUE,
importance = TRUE,
save.memory = TRUE
)
gg_dta <- gg_vimp(rfsrc_pbc)
plot(gg_dta)
# Restrict to only the top 10.
gg_dta <- gg_vimp(rfsrc_pbc, nvar = 10)
plot(gg_dta)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.