In dgrtwo/stacksurveyr: Stack Overflow 2016 Developer Survey Results

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-figures/",
  message = FALSE,
  warning = FALSE
)

library(ggplot2)
theme_set(theme_bw())

2016 Stack Overflow Developer Survey Results

Results of the Stack Overflow Developer Survey, wrapped in a convenient R package for easy analysis.

Install using devtools:

devtools::install_github("dgrtwo/stacksurveyr")

Data

This package shares the survey results as two datasets. First is stack_survey:

library(dplyr)
library(stacksurveyr)
stack_survey

This contains one row for each survey respondent and one column for each question. It follows the format of the the released survey dataset at stackoverflow.com/research, with some post-processing to turn questions with a natural order (such as "experience range") into ordered factors.

The package also contains a schema data frame describing each of the columns in stack_survey, including the original text of each question:

stack_schema

Each question has one of three types:

single columns have a single answer on a multiple choice question
multi columns allowed multiple answers, which are delimited by ; in the text
inferred columns are not themselves survey questions, but are processed versions of other answers

Examples: Basic exploration

There's a lot of simple questions we can answer using this data, particularly using the dplyr package. For example, we can examine the most common occupations among respondents:

stack_survey %>%
  count(occupation, sort = TRUE)

We can also use group_by and summarize to connect between columns- for example, finding the highest paid (on average) occupations:

salary_by_occupation <- stack_survey %>%
  filter(occupation != "other") %>%
  group_by(occupation) %>%
  summarize(average_salary = mean(salary_midpoint, na.rm = TRUE)) %>%
  arrange(desc(average_salary))

salary_by_occupation

This can be visualized in a bar plot:

library(ggplot2)
library(scales)

salary_by_occupation %>%
  mutate(occupation = reorder(occupation, average_salary)) %>%
  ggplot(aes(occupation, average_salary)) +
  geom_bar(stat = "identity") +
  ylab("Average salary (USD)") +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip()

Examples: Multi-response answers

r sum(stack_schema$type == "multi") of the questions allow multiple responses, as can be noted in the stack_schema variable:

stack_schema %>%
  filter(type == "multi")

In these cases, the responses are given delimited by ;. For example, see the tech_do column (""Which of the following languages or technologies have you done extensive development with in the last year?"):

stack_survey %>%
  filter(!is.na(tech_do)) %>%
  select(tech_do)

Often, these columns are easier to work with and analyze when they are "unnested" into one user-answer pair per row. The package provides the stack_multi function as a shortcut for that unnestting:

stack_multi("tech_do")

For example, we could find the most common answers:

stack_multi("tech_do") %>%
  count(tech = answer, sort = TRUE)

We can join this with the stack_survey dataset using the respondent_id column. For example, we could look at the most common development technologies used by data scientists:

stack_survey %>%
  filter(occupation == "Data scientist") %>%
  inner_join(stack_multi("tech_do"), by = "respondent_id") %>%
  count(answer, sort = TRUE)

Or we could find out the average age and salary of people using each technology, and compare them:

stack_survey %>%
  inner_join(stack_multi("tech_do")) %>%
  group_by(answer) %>%
  summarize_each(funs(mean(., na.rm = TRUE)), age_midpoint, salary_midpoint) %>%
  ggplot(aes(age_midpoint, salary_midpoint)) +
  geom_point() +
  geom_text(aes(label = answer), vjust = 1, hjust = 1) +
  xlab("Average age of people using this technology") +
  ylab("Average salary (USD)") +
  scale_y_continuous(labels = dollar_format())

License

The package, code, and examples are licensed under the GPL-3 license.

The survey data itself (which is contained in the data-raw directory and available online here), is made available by Stack Exchange, Inc under the Open Database License (ODbL). Any rights in individual contents of the database are licensed under the Database Contents License (ODbL)

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

dgrtwo/stacksurveyr documentation built on May 15, 2019, 8:20 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com