Using Amelia In Amelia: A Program for Missing Data

knitr::opts_chunkset(fig.width = 5, fig.height = 4, fig.align = "center") options(digits = 4, show.signif.stars = FALSE) set.seed(12345)  Data We now demonstrate how to use Amelia using data from @MilKub05 which studies the effect of democracy on trade policy. For the purposes of this user's guide, we will use a subset restricted to nine developing countries in Asia from 1980 to 1999[^freetrade]. This dataset includes 9 variables: | Variable | Description | |:-----------|:----------------------------------------------------| | year | year | | country | country | | tariff | average tariff rates | | polity | Polity IV Score[^polity] | | pop | total population | | gdp.pc | gross domestic product per capita | | intresmi | gross international reserves | | signed | dummy variable if signed an IMF agreement that year | | fivop | measure of financial openness | | usheg | measure of US hegemony[^hegemony] | These variables correspond to the variables used in the analysis model of @MilKub05 in table 2. [^freetrade]: We have artificially addedsome missingness to these data for presentational purposes. You can access the original data at https://scholar.princeton.edu/hvmilner/data. [^polity]: The Polity score is a number between -10 and 10 indicating how democratic a country is. A fully autocratic country would be a -10 while a fully democratic country would be 1 10. [^hegemony]: This measure of US hegemony is the US imports and exports as a percent of the world total imports and exports. We first load the Amelia and the data: library(Amelia) data(freetrade)  We can check the summary statistics of the data to see that there is missingness on many of the variables: summary(freetrade)  In the presence of missing data, most statistical packages use listwise deletion, which removes any row that contains a missing value from the analysis. Using the base model of @MilKub05 Table 2, we run a simple linear model in R, which uses listwise deletion: summary(lm(tariff ~ polity + pop + gdp.pc + year + country, data = freetrade))  Note that 60 of the 171 original observations are deleted due to missingness. These observations, however, are partially observed, and contain valuable information about the relationships between those variables which are present in the partially completed observations. Multiple imputation will help us retrieve that information and make better, more efficient, inferences. Multiple Imputation When performing multiple imputation, the first step is to identify the variables to include in the imputation model. It is crucial to include at least as much information as will be used in the analysis model. That is, any variable that will be in the analysis model should also be in the imputation model. This includes any transformations or interactions of variables that will appear in the analysis model. In fact, it is often useful to add more information to the imputation model than will be present when the analysis is run. Since imputation is predictive, any variables that would increase predictive power should be included in the model, even if including them in the analysis model would produce bias in estimating a causal effect (such as for post-treatment variables) or collinearity would preclude determining which variable had a relationship with the dependent variable (such as including multiple alternate measures of GDP). In our case, we include all the variables in freetrade in the imputation model, even though our analysis model focuses on polity, pop and gdp.pc. We're not incorporating time or spatial data yet, but we do below. To create multiple imputations in Amelia, we can simply run a.out <- amelia(freetrade, m = 5, ts = "year", cs = "country") a.out  Note that our example dataset is deliberately small both in variables and in cross-sectional elements. Typical datasets may often have hundreds or possibly a couple thousand steps to the EM algorithm. Long chains should remind the analyst to consider whether transformations of the variables would more closely fit the multivariate normal assumptions of the model (correct but omitted transformations will shorten the number of steps and improve the fit of the imputations), but do not necessarily denote problems with the imputation model. The output gives some information about how the algorithm ran. Each of the imputed datasets is now in the list a.outimputations. Thus, we could plot a histogram of the tariff variable from the 3rd imputation,

hist(a.out$imputations[[3]]$tariff, col = "grey", border = "white")


Saving imputed datasets

If you need to save your imputed datasets, one direct method is to save the output list from amelia,

save(a.out, file = "imputations.RData")


hist(log(freetrade$tariff), col="grey", border="white")  Square root {#sec:sqrt} Event count data is often heavily skewed and has nonlinear relationships with other variables. One common transformation to tailor the linear model to count data is to take the square roots of the counts. This is a transformation that can be set as an option in Amelia. Logistic {#sec:lgstc} Proportional data is sharply bounded between 0 and 1. A logistic transformation is one possible option in Amelia to make the distribution symmetric and relatively unbounded. Identification Variables {#sec:idvars} Datasets often contain identification variables, such as country names, respondent numbers, or other identification numbers, codes or abbreviations. Sometimes these are text and sometimes these are numeric. Often it is not appropriate to include these variables in the imputation model, but it is useful to have them remain in the imputed datasets (However, there are models that would include the ID variables in the imputation model, such as fixed effects model for data with repeated observations of the same countries). Identification variables which are not to be included in the imputation model can be identified with the argument idvars. These variables will not be used in the imputation model, but will be kept in the imputed datasets. If the year and country contained no information except labels, we could omit them from the imputation: amelia(freetrade, idvars = c("year", "country"))  Note that Amelia will return with an error if your dataset contains a factor or character variable that is not marked as a nominal or identification variable. Thus, if we were to omit the factor country from the cs or idvars arguments, we would receive an error: a.out2 <- amelia(freetrade, idvars = c("year"))  In order to conserve memory, it is wise to remove unnecessary variables from a data set before loading it into Amelia. The only variables you should include in your data when running Amelia are variables you will use in the analysis stage and those variables that will help in the imputation model. While it may be tempting to simply mark unneeded variables as IDs, it only serves to waste memory and slow down the imputation procedure. Time Series, or Time Series Cross Sectional Data {#sec:tscs} Many variables that are recorded over time within a cross-sectional unit are observed to vary smoothly over time. In such cases, knowing the observed values of observations close in time to any missing value may enormously aid the imputation of that value. However, the exact pattern may vary over time within any cross-section. There may be periods of growth, stability, or decline; in each of which the observed values would be used in a different fashion to impute missing values. Also, these patterns may vary enormously across different cross-sections, or may exist in some and not others. Amelia can build a general model of patterns within variables across time by creating a sequence of polynomials of the time index. If, for example, tariffs vary smoothly over time, then we make the modeling assumption that there exists some polynomial that describes the economy in cross-sectional unit$i$at time$t$as: [ \textrm{tariff}_{ti} = \beta_0 + \beta_1 t + \beta_1 t^2 + \beta_1 t^3 \ldots ] And thus if we include enough higher order terms of time then the pattern between observed values of the tariff rate can be estimated. Amelia will create polynomials of time up to the user defined$k$-th order, ($k\leq3$). We can implement this with the ts and polytime arguments. If we thought that a second-order polynomial would help predict we could run a.out2 <- amelia(freetrade, ts = "year", cs = "country", polytime = 2)  With this input, Amelia will add covariates to the model that correspond to time and its polynomials. These covariates will help better predict the missing values. If cross-sectional units are specified these polynomials can be interacted with the cross-section unit to allow the patterns over time to vary between cross-sectional units. Unless you strongly believe all units have the same patterns over time in all variables (including the same constant term), this is a reasonable setting. When$k$is set to 0, this interaction simply results in a model of fixed effects where every unit has a uniquely estimated constant term. Amelia does not smooth the observed data, and only uses this functional form, or one you choose, with all the other variables in the analysis and the uncertainty of the prediction, to impute the missing values. In order to impute with trends specific to each cross-sectional unit, we can set intercs to TRUE: a.out.time <- amelia(freetrade, ts = "year", cs = "country", polytime = 1, intercs = TRUE, p2s = 2)  Note that attempting to use polytime without the ts argument, or intercs without the cs argument will result in an error. Using the tscsPlot() function (discussed below), we can see that we have a much better prediction about the missing values when incorporating time than when we omit it: tscsPlot(a.out, cs = "Malaysia", main = "Malaysia (no time settings)", var = "tariff", ylim = c(-10, 60)) tscsPlot(a.out.time, cs = "Malaysia", main = "Malaysia (with time settings)", var = "tariff", ylim = c(-10, 60))  Lags and leads {#sec:lags} An alternative way of handling time-series information is to include lags and leads of certain variables into the imputation model. Lags are variables that take the value of another variable in the previous time period while leads take the value of another variable in the next time period. Many analysis models use lagged variables to deal with issues of endogeneity, thus using leads may seems strange. It is important to remember, however, that imputation models are predictive, not causal. Thus, since both past and future values of a variable are likely correlated with the present value, both lags and leads should improve the model. If we wanted to include lags and leads of tariffs, for instance, we would simply pass this to the lags and leads arguments: a.out2 <- amelia(freetrade, ts = "year", cs = "country", lags = "tariff", leads = "tariff")  Including Prior Information Amelia has a number of methods of setting priors within the imputation model. Two of these are commonly used and discussed below, ridge priors and observational priors. Ridge priors for high missingness, Small samples, or large correlations {#sec_prior} When the data to be analyzed contain a high degree of missingness or very strong correlations among the variables, or when the number of observations is only slightly greater than the number of parameters$p(p+3)/2$(where$p$is the number of variables), results from your analysis model will be more dependent on the choice of imputation model. This suggests more testing in these cases of alternative specifications under Amelia. This can happen when using the polynomials of time interacted with the cross section are included in the imputation model. For example, in our data, if we used a polynomial of degree 2 with unit-specific trends and there are 9 countries, it would add$3 \times 9 - 1= 17$more variables to the imputation model (dropping one of the fixed effects for identification). When these are added, the EM algorithm can become unstable. You can detect this by inspecting the screen output under p2s = 2 or by observing that the number iterations per imputation are very divergent. In these circumstances, we recommend adding a ridge prior which will help with numerical stability by shrinking the covariances among the variables toward zero without changing the means or variances. This can be done by including the empri argument. Including this prior as a positive number is roughly equivalent to adding empri artificial observations to the data set with the same means and variances as the existing data but with zero covariances. Thus, increasing the empri setting results in more shrinkage of the covariances, thus putting more a priori structure on the estimation problem: like many Bayesian methods, it reduces variance in return for an increase in bias that one hopes does not overwhelm the advantages in efficiency. In general, we suggest keeping the value on this prior relatively small and increase it only when necessary. A recommendation of 0.5 to 1 percent of the number of observations,$n$, is a reasonable starting value, and often useful in large datasets to add some numerical stability. For example, in a dataset of two thousand observations, this would translate to a prior value of 10 or 20 respectively. A prior of up to 5 percent is moderate in most applications and 10 percent is reasonable upper bound. For our data, it is easy to code up a 1 percent ridge prior: a.out.time2 <- amelia(freetrade, ts = "year", cs = "country", polytime = 1, intercs = TRUE, p2s = 0, empri = .01 * nrow(freetrade)) a.out.time2  Observation-level priors {#sec:obspri} Researchers often have additional prior information about missing data values based on previous research, academic consensus, or personal experience. Amelia can incorporate this information to produce vastly improved imputations. The Amelia algorithm allows users to include informative Bayesian priors about individual missing data cells instead of the more general model parameters, many of which have little direct meaning. The incorporation of priors follows basic Bayesian analysis where the imputation turns out to be a weighted average of the model-based imputation and the prior mean, where the weights are functions of the relative strength of the data and prior: when the model predicts very well, the imputation will down-weight the prior, and vice versa [@HonKin10]. The priors about individual observations should describe the analyst's belief about the distribution of the missing data cell. This can either take the form of a mean and a standard deviation or a confidence interval. For instance, we might know that 1986 tariff rates in Thailand around 40%, but we have some uncertainty as to the exact value. Our prior belief about the distribution of the missing data cell, then, centers on 40 with a standard deviation that reflects the amount of uncertainty we have about our prior belief. To input priors you must build a priors matrix with either four or five columns. Each row of the matrix represents a prior on either one observation or one variable. In any row, the entry in the first column is the row of the observation and the entry is the second column is the column of the observation. In the four column priors matrix the third and fourth columns are the mean and standard deviation of the prior distribution of the missing value. For instance, suppose that we had some expert prior information about tariff rates in Thailand. We know from the data that Thailand is missing tariff rates in many years, freetrade[freetrade$country == "Thailand", c("year", "country", "tariff")]


Suppose that we had expert information that tariff rates were roughly 40% in Thailand between 1986 and 1988 with about a 6% margin of error. This corresponds to a standard deviation of about 3. In order to include this information, we must form the priors matrix:

pr <- matrix(
c(158, 159, 160, 3, 3, 3, 40, 40, 40, 3, 3, 3),
nrow = 3, ncol = 4
)
pr


The first column of this matrix corresponds to the row numbers of Thailand in these three years, the second column refers to the column number of tariff in the data and the last two columns refer to the actual prior. Once we have this matrix, we can pass it to amelia(),

a.out.pr <- amelia(freetrade, ts = "year", cs = "country", priors = pr)


In the five column matrix, the last three columns describe a confidence range of the data. The columns are a lower bound, an upper bound, and a confidence level between 0 and 1, exclusive. Whichever format you choose, it must be consistent across the entire matrix. We could get roughly the same prior as above by utilizing this method. Our margin of error implies that we would want imputations between 34 and 46, so our matrix would be

pr.2 <- matrix(
c(158, 159, 160, 3, 3, 3, 34, 34, 34, 46, 46, 46, 0.95, 0.95, 0.95),
nrow = 3, ncol = 5
)
pr.2


These priors indicate that we are 95% confident that these missing values are in the range 34 to 46.

If a prior has the value 0 in the first column, this prior will be applied to all missing values in this variable, except for explicitly set priors. Thus, we could set a prior for the entire tariff variable of 20, but still keep the above specific priors with the following code:

pr.3 <- matrix(
c(158, 159, 160, 0, 3, 3 , 3, 3, 40, 40, 40, 20, 3, 3, 3, 5),
nrow = 4, ncol = 4)
pr.3


Logical bounds

In some cases, variables in the social sciences have known logical bounds. Proportions must be between 0 and 1 and duration data must be greater than 0, for instance. Many of these logical bounds can be handled by using the correct transformation for that type of variable (see \@ref(sec:trans) for more details on the transformations handled by Amelia). In the occasional case that imputations must satisfy certain logical bounds not handled by these transformations, Amelia can take draws from a truncated normal distribution in order to achieve imputations that satisfy the bounds. Note, however, that this procedure imposes extremely strong restrictions on the imputations and can lead to lower variances than the imputation model implies. The mean value across all the imputed values of a missing cell is the best guess from the imputation model of that missing value. The variance of the distribution across imputed datasets correctly reflects the uncertainty in that imputation. It is often the mean imputed value that should conform to the any known bounds, even if individual imputations are drawn beyond those bounds. The mean imputed value can be checked with the diagnostics presented in the next section. In general, building a more predictive imputation model will lead to better imputations than imposing bounds.

Amelia implements these bounds by rejection sampling. When drawing the imputations from their posterior, we repeatedly resample until we have a draw that satisfies all of the logical constraints. You can set an upper limit on the number of times to resample with the max.resample arguments. Thus, if after max.resample draws, the imputations are still outside the bounds, Amelia will set the imputation at the edge of the bounds. Thus, if the bounds were 0 and 100 and all of the draws were negative, Amelia would simply impute 0.

As an extreme example, suppose that we know, for certain that tariff rates had to fall between 30 and 40. This, obviously, is not true, but we can generate imputations from this model. In order to specify these bounds, we need to generate a matrix of bounds to pass to the bounds argument. This matrix will have 3 columns: the first is the column for the bounded variable, the second is the lower bound and the third is the upper bound. Thus, to implement our bound on tariff rates (the 3rd column of the dataset), we would create the matrix,

bds <- matrix(c(3, 30, 40), nrow = 1, ncol = 3)
bds


which we can pass to the bounds argument to amelia():

a.out.bds <- amelia(freetrade, ts = "year", cs = "country", bounds = bds,
max.resample = 1000)


The difference in results between the bounded and unbounded model are not obvious from the output, but inspection of the imputed tariff rates for Malaysia shows that there has been a drastic restriction of the imputations to the desired range:

tscsPlot(a.out, cs = "Malaysia", main = "No logical bounds", var = "tariff",
ylim = c(-10, 60))

tscsPlot(a.out.bds, cs = "Malaysia", main = "Bounded between 30 and 40",
var = "tariff", ylim = c(-10, 60))


Again, analysts should be extremely cautious when using these bounds as they can seriously affect the inferences from the imputation model, as shown in this example. Even when logical bounds exist, we recommend simply imputing variables normally, as the violation of the logical bounds represents part of the true uncertainty of imputation.

Post-imputations Transformations {#sec_postimptrans}

In many cases, it is useful to create transformations of the imputed variables for use in further analysis. For instance, one may want to create an interaction between two variables or perform a log-transformation on the imputed data. To do this, Amelia includes a transform() function for amelia() output that adds or overwrites variables in each of the imputed datasets. For instance, if we wanted to create a log-transformation of the gdp.pc variable, we could use the following command:

a.out <- transform(a.out, lgdp = log(gdp.pc))
head(a.out$imputations[[1]][,c("country", "year","gdp.pc", "lgdp")])  To create an interaction between two variables, we could simply use: a.out <- transform(a.out, pol_gdp = polity * gdp.pc)  Each transformation is recorded and the summary() command prints out each transformation that has been performed: summary(a.out)  Note the updated output is almost exactly the same as the fresh amelia() output. You can pass the transformed output back to amelia() and it will add imputations and update these imputations with the transformations you have performed. Analysis Models {#sec_analysis} Imputation is most often a data processing step as opposed to a final model in of itself. To this end, it is easy to pass output from amelia() to other functions. The easiest and most integrated way to run an analysis model is to pass the output to functions from the Zelig package. For example, in @MilKub05, the dependent variable was tariff rates. We can replicate table 5.1 from their analysis with the original data simply by running if (requireNamespace("Zelig", quietly = TRUE)) { z5 <- Zelig::zls$new()
z5$zelig(tariff ~ polity + pop + gdp.pc + year + country, data = freetrade) } else { z5 <- "Error: Zelig package not avaiable when vignette built" }  z5 <- Zelig::zls$new()
z5$zelig(tariff ~ polity + pop + gdp.pc + year + country, data = freetrade) z5  Running the same model with imputed data is almost identical. Simply replace the original data set with the amelia() output: if (requireNamespace("Zelig", quietly = TRUE)) { z5_imp <- Zelig::zls$new()
z5_imp$zelig(tariff ~ polity + pop + gdp.pc + year + country, data = freetrade) } else { z5_imp <- "Error: Zelig package not avaiable when vignette built" }  z5_imp <- Zelig::zls$new()
z5_imp$zelig(tariff ~ polity + pop + gdp.pc + year + country, data = a.out) z5_imp  Zelig is one way to run analysis models on imputed data, but certainly not the only way. The imputations list in the amelia() output contains each of the imputed datasets. Thus, users could simply program a loop over the number of imputations and run the analysis model on each imputed dataset and combine the results using the rules described in @KinHonJos01 and @Schafer97. Furthermore, users can easily export their imputations using the write.amelia() function as described in \@ref(sec_saving) and use statistical packages other than R for the analysis model. Amelia also has the ability combine quantities of interest from arbitrary models using the mi.meld() function. This command takes in a matrix with columns for the quantity and its standard error in each of the imputed datasets. It then uses the standard rules for combining multiple imputations to create an overall estimated quantity. b.out <- NULL se.out <- NULL for(i in seq_len(a.out$m)) {
ols.out <- lm(tariff ~ polity + pop + gdp.pc, data = a.out$imputations[[i]]) b.out <- rbind(b.out, ols.out$coef)
se.out <- rbind(se.out, coef(summary(ols.out))[, 2])
}

combined.results <- mi.meld(q = b.out, se = se.out)
combined.results


In addition to the resources available in R, users can draw on Stata to implement their analysis models. As of version 11, Stata has built-in handling of multiply imputed datasets. In order to utilize this functionality, simply export the "stacked" imputations using the write.amelia() function:

write.amelia(a.out, separate = FALSE, file.stem = "outdata", format = "dta")


Once this stacked dataset is open in Stata, you must tell Stata that it is an imputed dataset using the \code{mi import flong} command:

{stata eval = FALSE} mi import flong, m(imp) id(year country) imp(tariff-usheg)

The command takes a few options: m designates the imputation
variable (set with impvar in write.amelia()), id
sets the identifying varibles, and imp sets the variables that
were imputed (or included in the imputation). The tariff-usheg
indicates that Stata should treat the range of variables between
tariff and usheg as imputed. Once we have set the
dataset as imputed, we can use the built-in mi commands to
analyze the data:

{stata eval = FALSE}
mi estimate: reg tariff polity pop gdp_pc

Multiple-imputation estimates                     Imputations     =          5
Linear regression                                 Number of obs   =        171
Average RVI     =     1.4114
Complete DF     =        167
DF adjustment:   Small sample                     DF:     min     =      10.36
avg     =      18.81
max     =      37.62
Model F test:       Equal FMI                     F(   2,   10.4) =      15.50
Within VCE type:          OLS                     Prob > F        =     0.0008

------------------------------------------------------------------------------
tariff |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
polity |  -.2058115   .3911049    -0.53   0.610    -1.072968    .6613452
pop |   3.21e-08   8.72e-09     3.68   0.004     1.27e-08    5.14e-08
gdp_pc |  -.0027561    .000644    -4.28   0.000    -.0040602   -.0014519
_cons |   32.70461   2.660091    12.29   0.000     27.08917    38.32005
------------------------------------------------------------------------------


The amelia class {#sec_out}

The output from the amelia() function is an instance of the S3 class amelia. Instances of the amelia class contain much more than simply the imputed datasets. The mu object of the class contains the posterior draws of the means of the complete data. The covMatrices contains the posterior draws of the covariance matrices of the complete data. Note that these correspond to the variables as they are sent to the EM algorithm. Namely, they refer to the variables after being transformed, centered and scaled.

The iterHist object is a list of m 3-column matrices. Each row of the matrices corresponds to an iteration of the EM algorithm. The first column indicates how many parameters had yet to converge at that iteration. The second column indicates if the EM algorithm made a step that decreased the number of converged parameters. The third column indicates whether the covariance matrix at this iteration was singular. Clearly, the last two columns are meant to indicate when the EM algorithm enters a problematic part of the parameter space.

Try the Amelia package in your browser

Any scripts or data that you put into this service are public.

Amelia documentation built on May 27, 2021, 1:06 a.m.