Exercise: Wage

file <- "data/CPS1985.rds"
stopifnot(file.exists(file))
data <- readRDS(file)
head(data)

In this exercise we will use the CPS1985 data set, a random sample from the May 1985 US Current Population Survey. The data set provides information on the hourly wage in US dollars of 534 individuals together with 10 additional variables such as education, age and experience.

Simply download the file r xfun::embed_file(file, text = "CPS1985.rds") by clicking or download it from the source http://lib.stat.cmu.edu/datasets/CPS_85_Wages which also provides more detailed information on the data set.

We can import/read this file using data <- readRDS(...). The file contains the following information:

The Tasks

We would like to find out how wage depends on the provided additional attributes. Our response variable is the logarithm of the numeric variable wage. As covariates we use the additional variables education, experience, age, ethnicity, gender and union.

Apply the forest-building function cforest to build a forest model as described in the following points:

f <- log(wage) ~ education + experience + age + ethnicity + gender + union

library("partykit")

set.seed(4)
cf <- cforest(formula = f, data = data, ntree = 50)

newworker <- data.frame(education = 17,
                        experience = 10,
                        age = 37,
                        ethnicity = "hispanic",
                        gender = "female",
                        union = "no")


predict(cf, newdata = newworker)

newworker2 <- newworker
newworker2$union <- "yes"
predict(cf, newdata = newworker2)
set.seed(4)
trainid <- sample(1:NROW(data), size = 356, replace = FALSE)
train <- data[trainid,]
test <- data[-trainid,]
set.seed(4)
library("ranger")
rf <- ranger(f, data = train, num.trees = 50)

cf <- cforest(f, data = train, ntree = 50)

pred_cf <- predict(cf, newdata = test)
rmse_cf <- sqrt(sum((log(test$wage) - pred_cf)^2))

pred_rf <- predict(rf, data = test)$prediction
rmse_rf <- sqrt(sum((log(test$wage) - pred_rf)^2))

varimp(cf)
rf <- ranger(f, data = train, num.trees = 50, importance = "impurity")
importance(rf)


Try the partykit package in your browser

Any scripts or data that you put into this service are public.

partykit documentation built on April 14, 2023, 5:09 p.m.