library(wid) library(knitr) library(tidyverse)
The World Wealth and Income Database (WID.world) is an extensive source on the historical evolution of the distribution of income and wealth both within and between countries. It relies on the combined effort of an international network of over a hundred researchers covering more than seventy countries from all continents.
Anyone can access and plot the data through the website WID.world. For more advanced users, we provide the R package wid
, which lets them download the data from WID.world directly into R.\footnote{A similar package for Stata users exists: see \texttt{\url{http://econpapers.repec.org/software/bocbocode/s458357.htm}}.} It exports a single function called download_wid
. This vignette explains how to use it.
The command download_wid
has the following arguments:
download_wid( indicators, # Codes corresponding to indicators to retrieve areas, # Areas (mostly countries) for which to retrieve the indicators years, # Years for which to retrieve the indicators perc, # Percentiles (part of the distribution) ages, # Age groups (adults, all ages, elderly, etc.) pop, # Population type (individual, households, tax units, etc.) metadata, # Logical: should it fetch metadata too (eg. sources, etc.) verbose, # Logical: should it display messages showing progress include_extrapolations # Logical: should it include data based on extrapolations/interpolations ) ```` \paragraph{Indicators} The argument `indicators` is a vector of 6-letter codes that corresponds to a given series type for a given income or wealth concept. The first letter correspond to the type of series. Some of the most common possibilities include: \begin{tabularx}{\textwidth}{p{5cm}X} \toprule one-letter code & description \\ \midrule \texttt{a} & average \\ \texttt{s} & share \\ \texttt{t} & threshold \\ \texttt{m} & macroeconomic total \\ \texttt{w} & wealth/income ratio \\ \bottomrule \end{tabularx} Type `?wid_series_type` to access the complete list. The next five letters correspond a concept (usually of income and wealth). Some of the most common possibilities include: \begin{tabularx}{\textwidth}{p{5cm}X} \toprule five-letter code & description \\ \midrule \texttt{ptinc} & pre-tax national income \\ \texttt{pllin} & pre-tax labor income \\ \texttt{pkkin} & pre-tax capital income \\ \texttt{fiinc} & fiscal income \\ \texttt{hweal} & net personal wealth \\ \bottomrule \end{tabularx} Type `?wid_concepts` to access the complete list. For example, \texttt{sfiinc} corresponds to the share of fiscal income, \texttt{ahweal} corresponds to average personal wealth. If you don't specify any indicator, it defaults to `"all"` and downloads all available indicators. \paragraph{Area codes} All data in WID.world is associated to a given area, which can be a country, a region within a country, an aggregation of countries (eg. a continent), or even the whole world. The argument `areas` is a vector of codes that specify the areas for which to retrieve data. Countries and world regions are coded using 2-letter ISO codes. Country subregions are coded as \texttt{XX-YY} where \texttt{XX} is the country 2-letter code. Type `?wid_area_codes` to access the complete list of area codes. If you don't specify any area, it defaults to `"all"` and downloads data for all available areas. \paragraph{Years} All data in WID.world correspond to a year. Some series go as far back as the 1800s. The argument `years` is a vector of integer that specify those years. If you don't specify any year, it defaults to `"all"` and downloads data for all available years. \paragraph{Percentiles} The key feature of WID.world is that it provides data on the whole distribution, not just totals and averages. The argument `perc` is a vector of strings that indicate for which part of the distribution the data should be retrieved. For share and average variables, percentiles correspond to percentile ranges and take the form \texttt{pXXpYY}. For example the top 1% share correspond to \texttt{p99p100}. The top 10% share excluding the top 1% is \texttt{p90p99}. Thresholds associated to the percentile group \texttt{pXXpYY} correspond to the minimal income or wealth level that gets you into the group. For example, the threshold of the percentile group \texttt{p90p100} or \texttt{p90p91} correspond to the 90% quantile. Variables with no distributional meaning use the percentile p0p100. See \texttt{\url{http://wid.world/percentiles}} for more details. If you don't specify any percentile, it defaults to `"all"` and downloads data for all available parts of the distribution. \paragraph{Age groups} Data may only concern the population in a certain age group. The argument `ages` is a vector of age codes that specify which age categories to retrieve. Ages are coded using 3-digit codes. Some of the most common possibilities include: \begin{tabularx}{\textwidth}{p{5cm}X} \toprule 3-digit code & description \\ \midrule \texttt{999} & all ages \\ \texttt{992} & adults, including elderly (20+) \\ \texttt{996} & adults, excluding elderly (20-65) \\ \bottomrule \end{tabularx} Type `?wid_age_codes` to access the complete list of age codes. If you don't specify any age, it defaults to `"all"` and downloads data for all available age groups. \paragraph{Population types} The data in WID.world can refer to different types of population (i.e. different statistical units). The argument `pop` is a vector of population codes. They are coded using one-letter codes. Some of the most common possibilities include: \begin{tabularx}{\textwidth}{p{4cm}X} \toprule one-letter code & description \\ \midrule \texttt{i} & individuals \\ \texttt{t} & tax units \\ \texttt{j} & equal-split adults (ie. income or wealth divided equally among spouses) \\ \bottomrule \end{tabularx} Type `?wid_population_codes` to access the complete list of population types. If you don't specify any code, it defaults to `"all"` and downloads data for all types of population. \paragraph{Metadata} All data in WID.world is associated to a metadata giving in particular sources and methodological details. If the argument `metadata` is `TRUE`, the command will download those as well. Default is `FALSE`. \paragraph{Extrapolations/interpolations} Some of the data on WID.world is the result of interpolations (when data is only available for a few years) or extrapolations (when data is not available for the most recent years) that are based on much more limited information that other data points. We include these interpolations/extrapolation by default as a convenience, and also because these values are used to perform regional aggregations. Yet we stress that these estimates, especially at the level of individual countries, can be fragile. For many purposes, it can be preferable to exclude these data points. For that, use the option `include_extrapolations = FALSE`. \paragraph{Verbose} By default, the command is silent. If you set `verbose = TRUE`, it will output some information on the progress of the request. # Usage Although all arguments default to `"all"`, you cannot download the entire database by typing `download_wid()`. The command requires you to specify either some indicators or some areas. To download the entire database, please visit \url{https://wid.world/data/} and choose ``download full dataset''. If there is no data matching you selection on WID.world (maybe because you specified an indicator or an area that doesn't exist), the command will return `NULL` with a warning. The command returns a sorted `data.frame` with the following columns: `country`, `variable`, `percentile`, `year` and `value`. All monetary amounts for countries and country subregions are in constant local currency of the reference year (i.e. the previous year, the database being updated every year around July). Monetary amounts for world regions are in EUR PPP of the reference year. You can access the price index using the indicator \texttt{inyixx}, the PPP exchange rates using \texttt{xlcusp} (USD), \texttt{xlceup} (EUR), \texttt{xlcyup} (CNY), and the market exchange rates using \texttt{xlcusx} (USD), \texttt{xlceux} (EUR), \texttt{xlcyux} (CNY). To check the current reference year, you can look at when the price index is equal to 1. Shares and wealth/income ratios are given as a fraction of 1. That is, a top 1% share of 20% is given as 0.2. A wealth/income ratio of 300% is given as 3. # Examples ## Top 1\% income share in the United States, 2010--2015 Here we simply seek the top 1% shares of pre-tax national income in the United States over the period 2010--2015. The function `download_wid` returns a `data.frame` with the desired data. ```r data <- download_wid( indicators = "sptinc", # Shares of pre-tax national income areas = "US", # In the United States years = 2010:2015, # Time period: 2010-2015 perc = "p99p100" # Top 1% only ) kable(data) # Pretty display of the data.frame
If we also request the metadata, the data.frame
also contains additional columns with extra information.
data <- download_wid( indicators = "sptinc", # Shares of pre-tax national income areas = "US", # In the United States years = 2010:2015, # Time period: 2010-2015 perc = "p99p100", # Top 1% only metadata = TRUE # Also request metadata ) colnames(data)
Here, the metadata is the same for all observations because we only requested one variable.
In this example, we still select only one indicator, but we ask for two different percentiles. The function still returns a data.frame
in “long” format, which makes it easy to plot with ggplot2
.
data <- download_wid( indicators = "shweal", # Shares of personal wealth areas = "FR", # In France perc = c("p90p100", "p99p100") # Top 1% and top 10% ) library(ggplot2) library(scales) ggplot(data) + geom_line(aes(x = year, y = value, color = percentile)) + ylab("top share") + scale_y_continuous(label = percent) + scale_color_discrete(labels = c("p90p100" = "top 10%", "p99p100" = "top 1%")) + ggtitle("Top 1% and top 10% personal wealth shares in France, 1800-2015")
We now focus solely on the bottom half of the population (p0p50
), and look at the average pre-tax national income in three different countries (France, United States and China). Since we are looking at monetary amounts for three different countries, we need to convert them into the same currency using the purchasing power parities in the database.
# We use the tidyverse to manipulate the data, see http://tidyverse.org library(tidyverse) # Average incomes data data <- download_wid( indicators = "aptinc", # Average pre-tax national income areas = c("FR", "CN", "US"), # France, China and United States perc = "p0p50", # Bottom half of the population pop = "j", # Equal-split individuals year = 1978:2015 ) %>% rename(value_lcu = value) # Purchasing power parities with US dollar ppp <- download_wid( indicators = "xlcusp", # US PPP areas = c("FR", "CN", "US"), # France, China and United States year = 2016 # Reference year only ) %>% rename(ppp = value) %>% select(-year, -percentile) # Convert from local currency to PPP US dollar data <- merge(data, ppp, by = "country") %>% mutate(value_ppp = value_lcu/ppp) ggplot(data) + geom_line(aes(x = year, y = value_ppp, color = country)) + ylab("2016 $ PPP") + scale_color_discrete(labels = c("CN" = "China", "US" = "USA", "FR" = "France")) + ggtitle("Bottom 50% pre-tax national income")
We now plot the evolution of average net national income per adult in France, Germany, the United Kingdom and the United States.
# Average national income data data <- download_wid( indicators = "anninc", # Average net national income areas = c("FR", "US", "DE", "GB"), ages = 992 # Adults ) %>% rename(value_lcu = value) # Purchasing power parities with US dollar ppp <- download_wid( indicators = "xlcusp", # US PPP areas = c("FR", "US", "DE", "GB"), # France, China and United States year = 2016 # Reference year only ) %>% rename(ppp = value) %>% select(-year, -percentile) # Convert from local currency to PPP US dollar data <- merge(data, ppp, by = "country") %>% mutate(value_ppp = value_lcu/ppp) ggplot(data) + geom_line(aes(x = year, y = value_ppp, color = country)) + scale_y_log10(breaks = c(2e3, 5e3, 1e4, 2e4, 5e4)) + ylab("2016 $ PPP") + scale_color_discrete( labels = c("US" = "USA", "FR" = "France", "DE" = "Germany", "GB" = "UK") ) + ggtitle("Average net national income per adult")
Yet another way of observing an increase in inequality is to observe how the different fractiles of the distribution have evolved since a reference year. In the following graph, you can see that the different percentiles of the US distribution of pre-tax national income had a similar evolution throughout the 1970s, and then started to diverge after 1980.
data <- download_wid( indicators = "tptinc", # Thresholds of pre-tax national income areas = "US", # United States perc = c("p10p100", "p50p100", "p90p100", "p99p100", "p99.9p100") ) # Keep the value for 1970 in a separate data.frame data1970 <- data %>% filter(year == 1970) %>% rename(value1970 = value) %>% select(-year) # Divide series by the reference year (1970) data <- merge(data, data1970, by = c("country", "percentile")) %>% mutate(value = 100*value/value1970) ggplot(data) + geom_line(aes(x = year, y = value, color = percentile)) + ylab("1970 = 100") + scale_color_discrete( labels = c("p10p100" = "P10", "p50p100" = "P50", "p90p100" = "P90", "p99p100" = "P99", "p99.9p100" = "P99.9") ) + ggtitle("Divergence of pre-tax national income in the United States")
In some countries, many data points are the result of interpolations or extrapolations. For example, estimates in most African countries are based on surveys that are only realized every few years, which we interpolate to produce yearly series and perform regional aggregations. For example, takethe inequality series for Mozambique:
data <- download_wid( indicators = "sptinc", # Shares of pre-tax national income areas = "MZ", # Mozambique perc = c("p0p50", "p90p100", "p99p100") # Bottom 50%, top 10% and top 1% ) ggplot(data, aes(x = year, y = value, color = percentile)) + geom_line() + geom_point() + ylab("share of income") + scale_color_discrete( labels = c("p0p50" = "bottom 50%", "p90p100" = "top 10%", "p99p100" = "top 1%") ) + ggtitle("Pre-tax national income inequality in Mozambique")
The linear interpolation is quite visible. In some contexts, this might be undesirable. To exclude interpolated points, use include_extrapolations = FALSE
:
data <- download_wid( indicators = "sptinc", # Shares of pre-tax national income areas = "MZ", # Mozambique perc = c("p0p50", "p90p100", "p99p100"), # Bottom 50%, top 10% and top 1% include_extrapolations = FALSE # Do not include interpolations ) ggplot(data, aes(x = year, y = value, color = percentile)) + geom_line() + geom_point() + ylab("share of income") + scale_color_discrete( labels = c("p0p50" = "bottom 50%", "p90p100" = "top 10%", "p99p100" = "top 1%") ) + ggtitle("Pre-tax national income inequality in Mozambique")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.