file <- "data/CPS1985.rds" stopifnot(file.exists(file)) data <- readRDS(file) head(data)
In this exercise we will use the CPS1985 data set, a random sample from the May 1985 US Current Population Survey. The data set provides information on the hourly wage in US dollars of 534 individuals together with 10 additional variables such as education, age and experience.
Simply download the file r xfun::embed_file(file, text = "CPS1985.rds")
by clicking or download it from the source
http://lib.stat.cmu.edu/datasets/CPS_85_Wages
which also provides more detailed information on the data set.
We can import/read this file using data <- readRDS(...)
. The file
contains the following information:
wage
: wage in US dollars per hour (numeric).education
: education in years (numeric).experience
: potential work experience in years; age - education - 6 (numeric).age
: age in years (numeric).ethnicity
: Caucasian, Hispanic, other (factor).gender
: male or female (factor).union
: Does the individual work on a union job? (factor).We would like to find out how wage depends on the provided additional attributes.
Our response variable is the logarithm of the numeric variable wage
.
As covariates we use the additional variables education
, experience
, age
, ethnicity
, gender
and union
.
Apply the forest-building function cforest
to build a forest model as described in the following points:
"CPS1985.rds"
.ranger
to build a forest model using the same parameters and compare it to the cforest model, e.g., based on predictions, the RMSE on the testing data or variable importance.f <- log(wage) ~ education + experience + age + ethnicity + gender + union library("partykit") set.seed(4) cf <- cforest(formula = f, data = data, ntree = 50) newworker <- data.frame(education = 17, experience = 10, age = 37, ethnicity = "hispanic", gender = "female", union = "no") predict(cf, newdata = newworker) newworker2 <- newworker newworker2$union <- "yes" predict(cf, newdata = newworker2)
set.seed(4) trainid <- sample(1:NROW(data), size = 356, replace = FALSE) train <- data[trainid,] test <- data[-trainid,]
set.seed(4) library("ranger") rf <- ranger(f, data = train, num.trees = 50) cf <- cforest(f, data = train, ntree = 50) pred_cf <- predict(cf, newdata = test) rmse_cf <- sqrt(sum((log(test$wage) - pred_cf)^2)) pred_rf <- predict(rf, data = test)$prediction rmse_rf <- sqrt(sum((log(test$wage) - pred_rf)^2)) varimp(cf) rf <- ranger(f, data = train, num.trees = 50, importance = "impurity") importance(rf)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.