Home

/

GitHub

/

In sondreus/coverage: Coverage - Unit and Time Coverage of Analysis with Row-wise Deletion

An R package for seeing what you're missing

options(width = 150)
library(ggplot2)
library(stargazer)
library(coverage)

The coverage package and associated function provides you with a visual, data frame or latex table summary of your time and unit coverage.

This is important for any analysis conducted with row-wise deletion in the presence of missing data, especially if one suspect that patterns of missingness are non-random with respect to variables of interest.

The function supports N-dimensional data by allowing for and summarizing multiple observations per time-unit combination.

Installation instructions:

library(devtools)
install_github("sondreus/coverage")

techdata <- readRDS("3d_example.RDS")

coverage(timevar = "year", unitvar = "country_name",
          data = techdata,
          variable.names = c("upop", "xlrealgdp", "adoption_lvl"),
          output = "visual")

An example of analysis coverage for technology-country-year data

Arguments:

fit A fitted object.
timevar Time variable. Defaults to "year".
unitvar Unit variable. Defaults to "country".
data Data to be investigated. If none is supplied, attempts to use same data as "\code{fit}", first by considering its model matrix, and - if variables are missing (such as timevar and unitvar) - by looking for the source data in the global environment. (Optional)
variable.names Variables to be checked for coverage. If none is supplied, defaults to variables used in fit, or if fit not provided, all variables in data. (Optional)
output Desired extra output: "visual" (default), "data.frame", or "latex.table". First depends on ggplot2, last stargazer, both available on CRAN.
special.NA Variable that if missing will indicate "special" missingness. Can be used to distinguish observations with missing data from time-unit combinations which did not exist or were not considered. (Optional)
data.frequency Integer specifying the increments between observations. Defaults to 1.
... Additional arguments passed to ggplot2's theme function, or stargazer, depending on output selected. (Optional)

Example

Let's see how this package works through a simple application. We begin by getting some data from the World Bank Development Indicators, using the WDI package (by Vincent Arel-Bundock). Let's get data on GDP per capita, trade in services as a percentage of GDP, adult female literacy rates, agriculture as a percentage of GDP, and finally, number of telephone subscriptions per 1000 people.

library("WDI", quietly = TRUE)
wdi.sample <- WDI(indicator=c("GDPPC" = "NY.GDP.PCAP.KD",
                              "services_gdp" = "BG.GSR.NFSV.GD.ZS",
                              "agriculture_gdp" = "NV.AGR.TOTL.ZS",
                              "telephones" = "IT.TEL.TOTL.P3"),
                              start=1970, end=2012,
                              country="all")

lm.fit <- lm(GDPPC ~ ., data = wdi.sample)

Suppose we next are interested in how well "trade in services as a percentage of GDP" predicts "GDP per capita".

lm.fit <- lm(GDPPC ~ services_gdp + agriculture_gdp + telephones, data = wdi.sample)

So we have some data and a statistically significant relationship. But which country-years is this relationship based on? One option would be to inspect the data manually, which is viable only if the number of units (countries) and time points (years) are both small. And even in such a case, it is still very tidious. Let's instead apply the coverage function:

library("coverage")
 coverage(lm.fit)

Let us also request a data frame summary:

 coverage(fit = lm.fit, output = "data.frame")[1:10, ]

Or a latex table:

l.tab <- coverage(fit = lm.fit, output = "latex.table")

Supplying a fit is not required, and it may be easier to compare the coverage consequences of different model specifications by instead providing the variable names. This is supported in coverage() through the variable.names and data arguments.

Let's use this functionality to visually explore our data:

 coverage(data = wdi.sample,
          variable.names = c("GDPPC",
                             "agriculture_gdp", 
                             "telephones"),
          output = "visual")

 # vs:
 coverage(data = wdi.sample,
          variable.names = c("GDPPC",
          "telephones"),
          output = "visual")

3-Dimensional Data

Suppose next that we have data that may have multiple observations per time and unit combination. For instance, suppose that instead of looking at country-year data, we had country-year-technology data, where data might be missing for specific technologies within a country in a specific year or for covariates at the country-year level.

techdata <- readRDS("3d_example.RDS")

coverage(timevar = "year", unitvar = "country_name",
          data = techdata,
          variable.names = c("upop", "xlrealgdp", "adoption_lvl"))

Special missingness

Not all missingness is equal. Sometimes, data on a given time-unit combination is not available because the combination did not exist. For instance, research subjects in a medical trial may join a study at different times. We often want to distinguish this type of missingness ("subject had not yet joined the trail") from other types of missingness ("subject failed to measure blood-pressure during trail").

coverage() provides a way to do so in its visual output through the "special.NA" argument. Coverage interprets missingness of the variable specified in "special.NA" to indicate that the time-unit combination does not exist, indicating this in the visual output by cells being light-grey.

Looking at our technology data, we can see that many apparently missing data points in fact are "special missing", belonging to countries that did not exist in the year in question. Suppose that we know our "government" variable has no missing data for independent countries but is missing for all other country-years. Then, we can use this as our "special.NA" variable.

coverage(timevar = "year", unitvar = "country_name",
          data = techdata,
          variable.names = c("upop", "xlrealgdp", "adoption_lvl"), output = "visual", special.NA = "government")

Note: If your data has time and unit values corresponding to every and only relevant time and unit combination, you can simply specify one of these as your "special.NA" variable. E.g. special.NA = "year".

Citation:

Solstad, Sondre Ulvund (2018). Coverage: Visualize Panel Data Coverage. https://github.com/sondreus/coverage#coverage

sondreus/coverage documentation built on May 30, 2019, 6:27 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sondreus/coverage
Coverage - Unit and Time Coverage of Analysis with Row-wise Deletion

In sondreus/coverage: Coverage - Unit and Time Coverage of Analysis with Row-wise Deletion

An R package for seeing what you're missing

Arguments:

Example

3-Dimensional Data

Special missingness

Citation:

R Package Documentation

Browse R Packages

We want your feedback!

sondreus/coverage Coverage - Unit and Time Coverage of Analysis with Row-wise Deletion

In sondreus/coverage: Coverage - Unit and Time Coverage of Analysis with Row-wise Deletion

An R package for seeing what you're missing

Arguments:

Example

3-Dimensional Data

Special missingness

Citation:

R Package Documentation

Browse R Packages

We want your feedback!

sondreus/coverage
Coverage - Unit and Time Coverage of Analysis with Row-wise Deletion