Introduction to Idealstan

knitr::opts_chunk$set(echo = TRUE,warning=FALSE,fig.align = 'center',fig.width=6, fig.height=5)
require(idealstan)
require(dplyr)
require(ggplot2)

Note: To report bugs with the package, please file an issue on the Github page.

If you use this package, please cite the following:

Kubinec, Robert. "Generalized Ideal Point Models for Time-Varying and Missing-Data Inference". Working Paper.

This package implements IRT (item response theory) ideal point models, which are models designed for situations in which actors make strategic choices that correlate with a unidimensional scale, such as the left-right axis in American politics, disagreement over product design in consumer choice, or psychometric scales in which the direction that items load on the latent scale is unknown. Compared to traditional IRT, ideal point models examine the polarizing influence of a set of items on a set of persons, and has similarities to models based on Euclidean latent spaces, such as multi-dimensional scaling. In fact, this package also implements a version of the latent space model for binary outcomes, which is an alternate formulation of an ideal point model.

The goal of idealstan is to offer a wide variety of ideal point models that can model missing-data, time-varying ideal points, and incorporate a range of outcomes, including binary outcomes, counts, continuous and ordinal responses. In addition, idealstan uses the Stan estimation engine to offer full and variational Bayesian inference for all models so that every model is estimated with uncertainty. Variational inference provides a means to estimate Bayesian models on very large data sets when full Bayesian estimation is impractical. The package also exploits variational inference to automatically identify models instead of requiring users to pre-specify which persons or items in the data to constrain in advance.

The approach to handling missing data in this package is to model directly strategic censoring in observations. This particular version was developed to account for legislatures in which legislators (persons) are strategically absent for votes on bills (items), but it applies to any social choice situation in which actors' failure to respond to an item may be a function of their ideal point. This approach to missing data can be usefully applied to many contexts in which a missing outcome is a function of the person's ideal point (i.e., people will tend to be present in the data when the item is far away or very close to their ideal point). If missingness does not appear to arise as a function of ideal points, the models will still incorporate missing data but will assume it is essentially random.

The package includes the following models:

  1. IRT 2-PL (binary response) ideal point model, no missing-data inflation
  2. IRT 2-PL ideal point model (binary response) with missing- inflation
  3. Ordinal IRT (rating scale) ideal point model no missing-data inflation
  4. Ordinal IRT (rating scale) ideal point model with missing-data inflation
  5. Ordinal IRT (graded response) ideal point model no missing-data inflation
  6. Ordinal IRT (graded response) ideal point model with missing-data inflation
  7. Poisson IRT (Wordfish) ideal point model with no missing data inflation
  8. Poisson IRT (Wordfish) ideal point model with missing-data inflation
  9. Continuous (Gaussian) IRT ideal point model with no missing data
  10. Continuous (Gaussian) IRT ideal point model with missing-data inflation
  11. Positive-Continuous (Log-normal) IRT ideal point model with no missing data
  12. Positive-Continuous (Log-normal) IRT ideal point model with missing-data inflation
  13. Latent Space (binary response) ideal point model with no missing data
  14. Latent Space (binary response) ideal point model with missing-data inflation

In addition, all of these models can be estimated with either time-varying or static ideal points if a column of dates for each item is passed to the model function. This package implements the standard time-varying ideal point model by Martin and Quinn (2002) in which ideal points follow a random walk, i.e., in each time period the ideal points can jump in a random direction. In addition, I have implemented a stationary AR(1) ideal point process for situations in which the random walk model does not appropriately reflect over-time change in ideal points. For more information, please see the vignette about time-varying models.

The package also has extensive plotting functions via ggplot2 for model parameters, particularly the legislator (person) ideal points (ability parameters).

This vignette demonstrates basic usage of the package. I first simulate data using the package, document basic usage and then show an empirical example from the U.S. Senate.

Simulation of Ordinal IRT with Missing Data

To begin with, we can simulate data from an ordinal ideal-point model in which there are three possible responses corresponding to a legislator voting: yes, abstain and no. An additional category is also simulated that indicates whether a legislator shows up to vote or is absent, which traditional IRT models would record as missing data and would drop from the estimation. This package can instead utilize missing data via a hurdle model in which the censoring of the vote/score data is estimated as a function of individual item/bill intercepts and discrimination parameters for the decision to be absent or present. In other words, if the missing data is a reflection of the person's ideal point, such as more conservative legislators refusing to show up to vote, than the model will make use of this missing data to infer additional information about the legislators' ideal points.

The function id_sim_gen() allows you to simulate data from any of the fourteen models currently implemented in idealstan (see previous list). For example, here we sample data from an ordinal graded response model:

ord_ideal_sim <- id_sim_gen(model='ordinal_grm')
knitr::kable(as_data_frame(head(ord_ideal_sim@score_matrix)))

The vote/score matrix has legislators/persons in the rows and bills/items in the columns. In this simulated data, yes votes are recorded as 3, no votes as 1, abstentions as 2, and absences (missing data) as 4.

The function id_estimate will take this processed data and run an IRT ideal point model. To specify the model type, simply include the number of the model in the id_estimate function. The package also includes the ability to incorporate hierarchical (person or item-level) covariates, as discussed below.

The package has options for identification that are similar to other IRT packages in which the IDs of legislators/persons to constrain are specified to the id_estimate function. For example, we can use the true values of the simulated legislators to constrain one legislator with the highest simulated ideal point and one legislator with the lowest ideal point. Each constrained parameter must be fixed to a specific value, preferably at either end of the ideal point spectrum, to identify the model. In particular, two pieces of information are necessary: a value for the high ideal point, and the difference between the high and low points. In this example I pre-specify which parameters to constrain and where to pin them based on the simulated data, but the package can also automatically determine these values, which I show in the next section.

true_legis <- ord_ideal_sim@simul_data$true_person
high_leg <- sort(true_legis,decreasing = TRUE,index.return=TRUE)
low_leg <- sort(true_legis,index.return=TRUE)

ord_ideal_est <- id_estimate(idealdata=ord_ideal_sim,
                             model_type=6,
                             fixtype='constrained',
                             id_diff=high_leg$x[1]-low_leg$x[1],
                             id_diff_high=high_leg$x[1],
                             restrict_ind_high = as.character(high_leg$ix[1]),
                             restrict_ind_low=as.character(low_leg$ix[1]),
                             refresh=500,
                             ncores=2,
                             person_sd=1,
                           nchains=2)

We can then check and see how well the Stan estimation engine was able to capture the "true" values used in the simulation by plotting the true ideal points relative to the estimated ones:

id_plot_legis(ord_ideal_est,show_true = TRUE)

Given the small amount of data used to estimate the model, the imprecision with which the ideal points were recovered is not surprising.

To automatically identify the model, simply change the fixtype option to 'vb_full'. By default, the model will select the highest and lowest ideal points to constrain by running an approximation to the full posterior using Stan's vb function. While this method works well, the exact rotation is not known a priori. To specify a rotation of the ideal points while still letting the model decide the optimal placement of the constrained ideal points, set fixtype to 'vb_partial' and include person IDs as a character value for the high constrained ideal point restrict_ind_high and low restrict_ind_low.

For example, using our simulated data and the default settings ('vb_full'):

ord_ideal_est <- id_estimate(idealdata=ord_ideal_sim,
                             model_type=6,
                             refresh=500,
                             ncores=2,
                           nchains=2,
                           person_sd=1)

We can see from the plot of the Rhats, which is an MCMC convergence diagnostic, that all the Rhats are below 1.1, which is a good (though not perfect) sign that the model is fully identified:

id_plot_rhats(ord_ideal_est)

In general, it is always a good idea to check the Rhats before proceeding with further analysis. If the model is not identified even with using vb_full as fixtype, it is possible to manually change id_diff to a larger value to move the ideal points farther apart. However, it is more likely that there is some issue with the data, such as items that do not provide enough variation in the response to identify individual ideal points. Identification of time-varying ideal point models can be more complicated and is discussed in the accompanying vignette.

Empirical Example: U.S. Senate

This package was developed for datasets that are in a long format, i.e., one row per person/legislator, item/bill, and any other covariate. This data format departs from the traditional way of storing/using IRT/ideal point data, which is more commonly stored in a wide format with legislators/persons in the rows and items/bills in the columns. I have provided a convenience function that can make the necessary transformation for rollcall objects from R package pscl, but otherwise data should be provided in long form.

To demonstrate how the package functions empirically, I include in the package a partial voting record of the 114th Senate (excluding rollcall votes without significant disagreement) from the website www.voteview.com. We can convert this data, which is provided in a long data frame, to an idealdata object suitable for estimation by using the id_make function. Because the intention is to fit a binary outcome model (yes or no votes on bills), I specify ordinal=FALSE to exclude any other outcomes (such as abstentions).

We can first take a look at the raw data:

data('senate114')

knitr::kable(select(head(senate114),1:8))

table(senate114$cast_code)

First, you can see that the data are in long form because there is one row for every vote cast by a legislator, in this case Senator Sessions. The outcome variable is cast_code. This variable is coded as a factor variable with the levels in the order we would want for a binary outcome, i.e, 'No' is before 'Yes'. Each item in the model is a recorded vote in the Senate, the numbers of which are listed in column rollnumber. We also know the dates of the individual items (rollcalls), which we will use in the time-varying ideal point vignette to model time-varying ideal points.

The table shows that there are roughly twice as many yes votes versus no votes, with a small minority of absences. In this case, absences can be counted as missing data.

To create an idealdata object for modeling from our long data frame, we pass it to the id_make function and specify the names of the columns that correspond to person IDs, item IDs, group IDs (i.e., party), time IDs and the outcome, although only person, item and outcome are required for the function to work for a static model.

Because our outcome is discrete and already coded as a factor, we only need to specify the miss_val option to tell id_make which value of the outcome should be considered as missing data. If the outcome was a different type, such as character or numeric, we would also want to specify the low_val and high_val options to signify the ordering of the outcomes, whether for ordinal or binary data (for continuous or unbounded outcomes like the Poisson model, only miss_val needs to be passed to the function). If there is missing data that we want to ignore, i.e. not model in the function but rather remove before estimation, then we can also pass that value of the data as the exclude_level option.

Note that there are many other columns in the senate114 data. These can be used as person or item/bill covariates, as is discussed at the end of the vignette, but are essentially ignored unless specified to the id_make function.

senate_data <- id_make(senate114,outcome = 'cast_code',
                       person_id = 'bioname',
                       item_id = 'rollnumber',
                       group_id= 'party_code',
                       time_id='date',
                       miss_val='Absent')

We can then run a binary IRT ideal point model in which absences on particular bills are treated as a "hurdle" that the legislator must overcome in order to show up to vote. To do so, we specify model_type=2, which signifies a binary IRT model with missing-data (absence) inflation (see list above). In essence, the model is calculating a separate ideal point position for each bill/item that represents the bill's importance to different senators. Only if a bill is relatively salient will a legislator choose to show up and vote.

This same missing-data mechanism also applies more broadly to situations of social choice. Any time that missing data might be a function of ideal points--people with certain ideal points are likely to avoid giving an answer--then these values should be marked as miss_val in the id_make function and an inflated model type should be used.

Because this dataset is relatively large, we will use the use_vb option to use Stan's variational Bayesian inference. This version of the sampler is less accurate and tends to underestimate uncertainty, but it runs much, much faster. Generally speaking, the values of the ideal points are close to the full posterior, and it is useful for exploratory model-building even with small datasets. For very large data, use_vb is the only viable option as estimation times may require days or longer.

Ideal points are not identified without further information about the latent scale, especially its direction: should conservative ideal points be listed as positive or negative? To identify the latent scale, I constrain a conservative senator (John Barrasso) to be positive, and a liberal senator, Elizabeth Warren, to be negative in order to identify the polarity in the model. I have to pass in the names of the legislators as they exist in the IDs present in the senate114 data (column bionames). Rather than specify the exact values to pin the legislators to, I set fixtype to 'vb_partial' so that the function will first fit an un-identified model, and then use those estimates to determine what would be good values to pin these legislators to. Using 'vb_partial' is the recommended way to identify models while using prior information about the direction of the latent scale as it avoids having to fix these ideal points to specific values. It will also create starting values for the ideal points that should speed convergence.

sen_est <- id_estimate(senate_data,
                model_type = 1,
                 use_vb = T,
                fixtype='vb_partial',
                 restrict_ind_high = "WARREN, Elizabeth",
                 restrict_ind_low="BARRASSO, John A.",
            seed=84520)
id_plot_legis(sen_est,person_ci_alpha=0.2) +
  scale_color_manual(values=c(D='blue',R='red',I='green')) +
  ggtitle('Ideal Points of the 114th Senate')

The id_plot function has many other options which are documented in the help files. One notable option, though, is to plot bill midpoints along with the legislator ideal points. The midpoints show the line of equiprobability, i.e., at what ideal point is a legislator with that ideal point indifferent to voting on a bill (or answering an item correctly). To plot a bill midpoint overlay, simply include the character ID of the bill (equivalent to the column name of the bill in the rollcall vote matrix) as the item_plot option:

id_plot_legis(sen_est,person_ci_alpha=0.1,item_plot='94') +
  scale_color_manual(values=c(D='blue',R='red',I='green')) +
  ggtitle('Ideal Points of the 114th Senate with Vote 94 Midpoint')

The 50th bill in the 114 Senate shows very high discrimination: the bill midpoint is right in the middle of the ideal point distribution, with most Democrats voting yes and most Republicans voting no. The two rug lines at the bottom of the plot show the high density posterior interval for the bill midpoint, and as can be seen, the uncertainty only included those legislators near the very center of the distribution.

To look at the bill's absence midpoints, simply change the item_plot_type parameter to the id_plot function:

id_plot(sen_est,person_ci_alpha=0.1,item_plot='225',
        item_plot_type='inflated') + 
  scale_color_manual(values=c(D='blue',R='red',I='green'))

The wide high-posterior density (HPD) interval of the absence midpoint ruglines shows that absences are very uninformative of ideal points for this particular bill (as is common in the Senate where absences are very rare).

Parameter Values

We can obtain summary estimates of all the ideal points and item/bill discrimination/difficulty parameters using the summary function that provides the median value of the parameters in addition to a specified posterior density interval (i.e., 5%-95%). For example, we can extract summaries for the ideal points:

ideal_pts_sum <- summary(sen_est,pars='ideal_pts')
knitr::kable(head(ideal_pts_sum))

Parameter Name is the name of the parameter in the underlying Stan code, which can be useful f you want to peruse the fitted Stan model (and can be accessed as given in the code below). The name of the parameters for ideal points in the Stan model is L_full (as seen in the summary from above).

stan_obj <- sen_est@stan_samples
# show the 
print(stan_obj,pars = c("L_full[1]",
                        'L_full[2]',
                        'L_full[3]'))

If we know the name of the Stan parameter, we can look at the trace plot to see how the quality of the Markov Chain Monte Carlo (MCMC) sampling used to fit the model. A good trace plot shows a bouncy line that is stable around an average value. For more info, see the Stan documentation.

stan_trace(sen_est,par='L_full[1]')

Finally we can also extract all of the posterior iterations to do additional calculations that average over posterior uncertainty by changing the aggregate option in summary. In the following code, I access the individual posterior iterations for the item/bill parameters, including difficulty (average probability of voting yes), discrimination (how strongly the item/bill loads on either end of the ideal point scale) and the midpoints (position where someone with that ideal point would be indifferent to voting yes/no).

item_all <- summary(sen_est,pars='items', aggregate=F)
knitr::kable(head(item_all))

Hierarchical Covariates

Finally, we can also fit a model where we include a covariate that varies by person/legislator. To do so, we need to pass a one-sided formula to the id_make function to prepare the data accordingly. By way of example, we will include a model where we include an interaction between party ID (party_code) and the legislator's age to see if younger/older legislators are more or less conservative. Because this is a static model, the effect of the covariate is averaged over all of the bills in the dataset and all the legislators in the dataset without taking into account the order or time period of the bills.

It is important to note that for static ideal point models, covariates are only defined over the legislators/persons who are not being used as constraints in the model, such as John Barasso and Elizabeth Warren in this model.

senate114$age <- 2018 - senate114$born

senate_data <- id_make(senate114,outcome = 'cast_code',
                       person_id = 'bioname',
                       item_id = 'rollnumber',
                       group_id= 'party_code',
                       time_id='date',
                       person_cov = ~party_code*age,
                       miss_val='Absent')

sen_est_cov <- id_estimate(senate_data,
                model_type = 1,
                 use_vb = T,
                fixtype='vb_full',
                 restrict_ind_high = "BARRASSO, John A.",
                 restrict_ind_low="WARREN, Elizabeth",
            seed=84520)

idealstan implements a form of calculating marginal effects for these hierarchical covariates. Because the ideal point model is fundamentally a measurement model with a latent scale, the marginal effects are calculated separately for each end of the latent scale. To do so, the marginal effect of the covariate on one of the outcomes is calculated first for all items/bills with positive discrimination scores and then for items/bills with negative discrimination scores. The effect of the covariate can then be interpreted as having a conditional effect on either the positive or negatively associated latent scores.

For example, the following plot shows the effect of the age covariate on the probability that a senator votes yes separately for conservative bills (negative discrimination) and liberal bills (positive discrimination). In other words, the two marginal effects are for whether increasing in age is associated with casting a yes vote when the bill is liberal versus when a bill is conservative. While these two probabilities are likely to differ, they are not mirror images of each other as they depend on the model's estimates of how liberal or conservative the bills actually are. The following plot shows these marginal effects of age:

id_plot_cov(sen_est_cov,cov_type='person_cov',
            pred_outcome="Yes") +
  ggtitle("Conditional Marginal Effect of Age on Senator's Vote Choices")

Because of the scale of the plot, it can be difficult to see the values for the age interaction. However, because the plot is a ggplot object, we can restrict the x axis (we put the limits on the y axis to modify the x axis because the plot is flipped):

id_plot_cov(sen_est_cov,cov_type='person_cov',pred_outcome = "Yes") + ylim(c(-0.02,0.02)) +
  ggtitle("Conditional Marginal Effect of Age on Senator's Vote Choices")

Or we can filter which parameters we want to exclude from the plot to focus on a few specific interaction effects:

id_plot_cov(sen_est_cov,cov_type='person_cov',
            filter_cov=c('age','party_codeI:age','party_codeI','(Intercept)'),
            pred_outcome="Yes") + 
  ggtitle("Conditional Marginal Effect of Age on Senator's Vote Choices")

Because these covariates are on the ideal point scale, the meaning of the scale in terms of party ideology must be kept in mind in order to interpret the plot. First, because we only have estimates for the "I" and "R" values of party_code, we know that the Democrats "D" are being used as the reference value. As a consequence, the coefficient for age is equal to the effect of age for Democrats (i.e., when the person is neither Republican or Independent). As a result, the way to interpret these plots is that for both Democrats and Republicans, increasing age is associated with voting for bills that are more conservative if they are Republican or more liberal if they are Democrat. In other words, younger party members show stronger ideological affiliation in terms of conservative or liberal bills (or perhaps party unity). Keeping the direction of the scale in mind is crucial for understanding hierarchical covariates in ideal point models.

Of course, because this model was fit with variational inference (use_vb=T), the results are an approximation. It would be good to run full Bayesian inference to obtain final estimates.

We can also extract the covariate summary values using the summary function:

cov_sum <- summary(sen_est_cov,pars='person_cov')
knitr::kable(cov_sum)


Try the idealstan package in your browser

Any scripts or data that you put into this service are public.

idealstan documentation built on July 10, 2019, 5:05 p.m.