knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%", dpi = 300 ) ## so jittered figs don't always appear to be changed set.seed(1)
The is a data package with an excerpt from the Gapminder data.
The main object in this package is the gapminder
data frame or "tibble".
There are other goodies, such as the data in tab delimited form, a larger unfiltered dataset, premade color schemes for the countries and continents, and ISO 3166-1 country codes.
The gapminder
and gapminder_unfiltered
data frames include six variables, (Gapminder.org documentation page):
| variable | meaning | |:------------|:-------------------------| | country | | | continent | | | year | | | lifeExp | life expectancy at birth | | pop | total population | | gdpPercap | per-capita GDP |
Per-capita GDP (Gross domestic product) is given in units of international dollars, "a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time" -- 2005, in this case.
The package contains two main data frames or tibbles:
gapminder
: 12 rows for each country (1952, 1957, ..., 2007). It's a subset of ...gapminder_unfiltered
: more lightly filtered and therefore about twice as many rows.Note: this package exists for the purpose of teaching and making code examples. It is an excerpt of data found in specific spreadsheets on Gapminder.org circa 2010. It is not a definitive source of socioeconomic data and I don't update it. Use other data sources if it's important to have the current best estimate of these statistics.
Install gapminder
from CRAN:
install.packages("gapminder")
Load it and test drive with some data aggregation and plotting:
library(gapminder) library(dplyr) library(ggplot2) aggregate(lifeExp ~ continent, gapminder, median) gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarise(lifeExp = median(lifeExp)) ggplot(gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(outlier.colour = "hotpink") + geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1 / 4)
country_colors
and continent_colors
are provided as character vectors where elements are hex colors and the names are countries or continents.
head(country_colors, 4) head(continent_colors)
knitr::include_graphics("../man/figures/gapminder-color-scheme-ggplot2.png")
The color schemes are available as:
continent-colors.tsv
and country-colors.tsv
, which are part of the installed packageProvide country_colors
to scale_color_manual()
like so:
... + scale_color_manual(values = country_colors) + ...
library("ggplot2") ggplot( subset(gapminder, continent != "Oceania"), aes(x = year, y = lifeExp, group = country, color = country) ) + geom_line(linewidth = 0.8, show.legend = FALSE) + facet_wrap(~continent) + scale_color_manual(values = country_colors) + theme_bw() + theme( strip.text = element_text(size = rel(0.8)), axis.text.x = element_text(angle = 45, hjust=1) )
# for convenience, integrate the country colors into the data.frame gap_with_colors <-data.frame( gapminder, cc = I(country_colors[match(gapminder$country, names(country_colors))]) ) # bubble plot, focus just on Africa and Europe in 2007 keepers <- with( gap_with_colors, continent %in% c("Africa", "Europe") & year == 2007 ) plot(lifeExp ~ gdpPercap, gap_with_colors, subset = keepers, log = "x", pch = 21, cex = sqrt(gap_with_colors$pop[keepers] / pi) / 1500, bg = gap_with_colors$cc[keepers] )
The country_codes
data frame provides ISO 3166-1 country codes for all the countries in the gapminder
and gapminder_unfiltered
data frames.
This can be used to practice joining or merging.
library(dplyr) gapminder %>% filter(year == 2007, country %in% c("Kenya", "Peru", "Syria")) %>% select(country, continent) %>% left_join(country_codes)
gapminder
good for?This excerpt has been used in STAT 545, in R-flavored Software Carpentry Workshops and a ggplot2
tutorial. gapminder
is very useful for teaching novices data wrangling and visualization in R.
Description:
r nrow(gapminder)
observations; fills a size niche between iris
(150 rows) and the likes of diamonds
(54K rows)r ncol(gapminder)
variablescountry
a factor with r nlevels(gapminder$country)
levelscontinent
, a factor with r nlevels(gapminder$continent)
levelsyear
: going from 1952 to 2007 in increments of 5 yearspop
: populationgdpPercap
: GDP per capitalifeExp
: life expectancyThere are 12 rows for each country in gapminder
, i.e. complete data for 1952, 1957, ..., 2007.
The two factors provide opportunities to demonstrate factor handling, in aggregation and visualization, for factors with very few and very many levels.
The four quantitative variables are generally quite correlated with each other and these trends have interesting relationships to country
and continent
, so you will find that simple plots and aggregations tell a reasonable story and are not completely boring.
Visualization of the temporal trends in life expectancy, by country, is particularly rewarding, since there are several countries with sharp drops due to political upheaval. This then motivates more systematic investigations via data aggregation to proactively identify all countries whose data exhibits certain properties.
Data cleaning code cannot be clean. It's a sort of sin eater.
— Stat Fact (@StatFact) July 25, 2014
The data-raw
directory contains the Excel spreadsheets downloaded from Gapminder in 2008 and 2009 and all the scripts necessary to create everything in this package, in raw and "compiled notebook" form.
If you want to practice importing from file, various tab delimited files are included:
gapminder.tsv
: the same dataset available via library("gapminder"); gapminder
gapminder-unfiltered.tsv
: the larger dataset available via library("gapminder"); gapminder_unfiltered
.continent-colors.tsv
and country-colors.tsv
: color schemesIn the package's GitHub repo, these delimited files can be found:
inst/extdata/
sub-directoryOnce you've installed the gapminder
package they can be found locally and used like so:
gap_tsv <- system.file("extdata", "gapminder.tsv", package = "gapminder") gap_tsv <- read.delim(gap_tsv) str(gap_tsv) gap_tsv %>% # Bhutan did not make the cut because data for only 8 years :( filter(country == "Bhutan") gap_bigger_tsv <- system.file("extdata", "gapminder-unfiltered.tsv", package = "gapminder") gap_bigger_tsv <- read.delim(gap_bigger_tsv) str(gap_bigger_tsv) gap_bigger_tsv %>% # Bhutan IS here though! :) filter(country == "Bhutan") read.delim( system.file("extdata", "continent-colors.tsv", package = "gapminder") ) head( read.delim( system.file("extdata", "country-colors.tsv", package = "gapminder") ) )
Gapminder's data is released under the Creative Commons Attribution 3.0 Unported license. See their terms of use.
Run this command to get info on how to cite this package. If you've installed gapminder from CRAN, the year will be populated and populated correctly (unlike below).
citation("gapminder")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.