An R package for seeing what you're missing

options(width = 150)
library(ggplot2)
library(stargazer)
library(coverage)

The coverage package and associated function provides you with a visual, data frame or latex table summary of your time and unit coverage.

This is important for any analysis conducted with row-wise deletion in the presence of missing data, especially if one suspect that patterns of missingness are non-random with respect to variables of interest.

The function supports N-dimensional data by allowing for and summarizing multiple observations per time-unit combination.

Installation instructions:

library(devtools)
install_github("sondreus/coverage")
techdata <- readRDS("3d_example.RDS")

coverage(timevar = "year", unitvar = "country_name",
          data = techdata,
          variable.names = c("upop", "xlrealgdp", "adoption_lvl"),
          output = "visual")

Arguments:

Example

Let's see how this package works through a simple application. We begin by getting some data from the World Bank Development Indicators, using the WDI package (by Vincent Arel-Bundock). Let's get data on GDP per capita, trade in services as a percentage of GDP, adult female literacy rates, agriculture as a percentage of GDP, and finally, number of telephone subscriptions per 1000 people.

library("WDI", quietly = TRUE)
wdi.sample <- WDI(indicator=c("GDPPC" = "NY.GDP.PCAP.KD",
                              "services_gdp" = "BG.GSR.NFSV.GD.ZS",
                              "agriculture_gdp" = "NV.AGR.TOTL.ZS",
                              "telephones" = "IT.TEL.TOTL.P3"),
                              start=1970, end=2012,
                              country="all")

lm.fit <- lm(GDPPC ~ ., data = wdi.sample)

Suppose we next are interested in how well "trade in services as a percentage of GDP" predicts "GDP per capita".

lm.fit <- lm(GDPPC ~ services_gdp + agriculture_gdp + telephones, data = wdi.sample)

So we have some data and a statistically significant relationship. But which country-years is this relationship based on? One option would be to inspect the data manually, which is viable only if the number of units (countries) and time points (years) are both small. And even in such a case, it is still very tidious. Let's instead apply the coverage function:

library("coverage")
 coverage(lm.fit)

Let us also request a data frame summary:

 coverage(fit = lm.fit, output = "data.frame")[1:10, ]

Or a latex table:

l.tab <- coverage(fit = lm.fit, output = "latex.table")

Supplying a fit is not required, and it may be easier to compare the coverage consequences of different model specifications by instead providing the variable names. This is supported in coverage() through the variable.names and data arguments.

Let's use this functionality to visually explore our data:

 coverage(data = wdi.sample,
          variable.names = c("GDPPC",
                             "agriculture_gdp", 
                             "telephones"),
          output = "visual")

 # vs:
 coverage(data = wdi.sample,
          variable.names = c("GDPPC",
          "telephones"),
          output = "visual")

3-Dimensional Data

Suppose next that we have data that may have multiple observations per time and unit combination. For instance, suppose that instead of looking at country-year data, we had country-year-technology data, where data might be missing for specific technologies within a country in a specific year or for covariates at the country-year level.

techdata <- readRDS("3d_example.RDS")

coverage(timevar = "year", unitvar = "country_name",
          data = techdata,
          variable.names = c("upop", "xlrealgdp", "adoption_lvl"))

Special missingness

Not all missingness is equal. Sometimes, data on a given time-unit combination is not available because the combination did not exist. For instance, research subjects in a medical trial may join a study at different times. We often want to distinguish this type of missingness ("subject had not yet joined the trail") from other types of missingness ("subject failed to measure blood-pressure during trail").

coverage() provides a way to do so in its visual output through the "special.NA" argument. Coverage interprets missingness of the variable specified in "special.NA" to indicate that the time-unit combination does not exist, indicating this in the visual output by cells being light-grey.

Looking at our technology data, we can see that many apparently missing data points in fact are "special missing", belonging to countries that did not exist in the year in question. Suppose that we know our "government" variable has no missing data for independent countries but is missing for all other country-years. Then, we can use this as our "special.NA" variable.

coverage(timevar = "year", unitvar = "country_name",
          data = techdata,
          variable.names = c("upop", "xlrealgdp", "adoption_lvl"), output = "visual", special.NA = "government")

Note: If your data has time and unit values corresponding to every and only relevant time and unit combination, you can simply specify one of these as your "special.NA" variable. E.g. special.NA = "year".

Citation:

Solstad, Sondre Ulvund (2018). Coverage: Visualize Panel Data Coverage. https://github.com/sondreus/coverage#coverage



sondreus/coverage documentation built on May 30, 2019, 6:27 a.m.