Predict whether income exceeds \$50K/yr based on census data. Also known as "Adult" dataset. Extraction was done by Barry Becker from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year. See the data source and description for more information. These data are also used for demonstrating Tensorflow.
The biggest drivers for predicting income over \$50k are: marital status (married is better), education (more is better), and sex (male is better). We will explore the continuous and categorical predictors before building statistical models. Data manipulation is carried out in dplyr
and visualizations are done in ggplot2
and plotly
.
knitr::opts_chunk$set(warning = FALSE, message = FALSE) library(tidyverse) library(plotly)
First we read the raw data, converting missing values from ?
to NA
. We then convert the target variable income_bracket
into a numeric value, create a new column age_buckets
, and remove records with missing values.
train_raw <- read_csv( cloudml::gs_data("gs://rstudio-cloudml-demo-ml/census/data/adult.data"), col_names = c( "age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket" ), na = "?") train <- train_raw %>% mutate(label = ifelse(income_bracket == ">50K" | income_bracket == ">50K.", 1, 0)) %>% mutate(age_buckets = cut(age, c(16, 18, 25, 30, 35, 40, 45, 50, 55, 60, 65, 90))) %>% select(label, gender, native_country, education, education_num, occupation, workclass, marital_status, race, age_buckets) %>% na.omit
Most of the columns in the census data are categorical. We plot a few of the most important columns here. The complete list of categorical columns are:
plot.main.effects <- function(data, x, y){ data %>% mutate_(group = x, metric = y) %>% group_by(group) %>% summarize(percent = 100 * mean(metric)) %>% ggplot(aes(x = reorder(group, percent), percent)) + geom_bar(stat="identity", fill = "lightblue4") + coord_flip() + labs(y = "Percent", x = "") + ggtitle(paste("Percent surveyed with incomes over $50k by", x)) } plot.main.effects(train, "marital_status", "label") plot.main.effects(train, "gender", "label") plot.main.effects(train, "education", "label")
We can compare the distribution of the categorical variables for those who earn more than \$50k and those who earn less. The complete list of categorical variables are:
plot.continuous <- function(data, x, y, alpha = 0.2, ...){ lab <- stringr::str_replace_all(y, "_", " ") %>% stringr::str_to_title(y) data %>% select_(groups = x, y = y) %>% na.omit %>% ggplot(aes(y, fill = groups)) + geom_density(alpha = alpha, ...) + labs(x = lab, y = "") + ggtitle(paste0("Income by ", lab)) } # People who earn more also work more, are better educated, and are older plot.continuous(train_raw, "income_bracket", "age") plot.continuous(train_raw, "income_bracket", "education_num", adjust = 5) plot.continuous(train_raw, "income_bracket", "hours_per_week", adjust = 5)
We can examine some two-way and three-way intearcations with choropleth maps:
train %>% select(education_num, age_buckets, label) %>% group_by(age_buckets, education_num) %>% summarize(percent = 100 * mean(label)) %>% ggplot(aes(education_num, age_buckets, fill = percent)) + geom_tile() + labs(x = "Education", y = "Age") + ggtitle("Percent surveyed with incomes over $50k by age, education") train %>% select(age_buckets, education_num, occupation, label) %>% group_by(age_buckets, education_num, occupation) %>% summarize(percent = 100 * mean(label)) %>% ggplot(aes(education_num, age_buckets, fill = percent)) + geom_tile() + facet_wrap( ~ occupation) + labs(x = "Education", y = "Age") + ggtitle("Percent surveyed with incomes over $50k by age, education, and occupation")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.