id_estimate | R Documentation |
idealstan
modelThis function will take a pre-processed idealdata
vote/score dataframe and
run one of the available IRT/latent space ideal point models on the data using
Stan's MCMC engine.
id_estimate(
idealdata = NULL,
model_type = 2,
inflate_zero = FALSE,
vary_ideal_pts = "none",
keep_param = NULL,
grainsize = 1,
mpi_export = NULL,
use_subset = FALSE,
sample_it = FALSE,
subset_group = NULL,
subset_person = NULL,
sample_size = 20,
nchains = 4,
niters = 1000,
use_vb = FALSE,
ignore_db = NULL,
restrict_ind_high = NULL,
fix_high = 1,
fix_low = (-1),
restrict_ind_low = NULL,
num_restrict_high = 1,
num_restrict_low = 1,
fixtype = "prefix",
const_type = "persons",
id_refresh = 0,
prior_only = FALSE,
warmup = 1000,
ncores = 4,
use_groups = FALSE,
discrim_reg_upb = 1,
discrim_reg_lb = -1,
discrim_miss_upb = 1,
discrim_miss_lb = -1,
discrim_reg_scale = 2,
discrim_reg_shape = 2,
discrim_miss_scale = 2,
discrim_miss_shape = 2,
person_sd = 3,
time_fix_sd = 0.1,
time_var = 10,
spline_knots = NULL,
spline_degree = 2,
ar1_up = 1,
ar1_down = 0,
boundary_prior = NULL,
time_center_cutoff = 50,
restrict_var = FALSE,
sample_stationary = FALSE,
ar_sd = 1,
diff_reg_sd = 3,
diff_miss_sd = 3,
restrict_sd_high = NULL,
restrict_sd_low = NULL,
restrict_N_high = 1000,
restrict_N_low = 1000,
ordbeta_phi_mean = 1,
ordbeta_cut_alpha = c(1, 1, 1),
ordbeta_cut_phi = 0,
gp_sd_par = 0.025,
gp_num_diff = 3,
gp_m_sd_par = 0.3,
gp_min_length = 0,
cmdstan_path_user = NULL,
map_over_id = "persons",
save_files = NULL,
compile_optim = FALSE,
debug = FALSE,
init_pathfinder = TRUE,
debug_mode = 0,
...
)
idealdata |
An object produced by the |
model_type |
An integer reflecting the kind of model to be estimated. See below. |
inflate_zero |
If the outcome is distributed as Poisson (count/unbounded integer),
setting this to
|
vary_ideal_pts |
Default |
keep_param |
A list with logical values for different categories of paremeters which
should/should not be kept following estimation. Can be any/all of |
grainsize |
The grainsize parameter for the |
mpi_export |
If |
use_subset |
Whether a subset of the legislators/persons should be used instead of the full response matrix |
sample_it |
Whether or not to use a random subsample of the response matrix. Useful for testing. |
subset_group |
If person/legislative data was included in the |
subset_person |
A list of character values of names of persons/legislators to use to subset if |
sample_size |
If |
nchains |
The number of chains to use in Stan's sampler. Minimum is one. See |
niters |
The number of iterations to run Stan's sampler. Shouldn't be set much lower than 500. See |
use_vb |
Whether or not to use Stan's Pathfinder algorithm instead of full Bayesian inference. Pros: it's much faster but can be much less accurate. Note that Pathfinder is also used by default for finding initial starting values for sfull HMC sampling. |
ignore_db |
If there are multiple time periods (particularly when there are
very many time periods), you can pass in a data frame
(or tibble) with one row per person per time period and an indicator column
|
restrict_ind_high |
If |
fix_high |
A vector of length |
fix_low |
A vector of length |
restrict_ind_low |
If |
num_restrict_high |
If using variational inference for identification ( |
num_restrict_low |
If using variational inference for identification ( |
fixtype |
Sets the particular kind of identification used on the model, could be either 'vb_full'
(identification provided exclusively by running a variational identification model with no prior info), or
'prefix' (two indices of ideal points or items to fix are provided to
options |
const_type |
Whether |
id_refresh |
The number of times to report iterations from the variational run used to identify models. Default is 0 (nothing output to console). |
prior_only |
Whether to only sample from priors as opposed to the full model with likelihood (the default). Useful for doing posterior predictive checks. |
warmup |
The number of iterations to use to calibrate Stan's sampler on a given model. Shouldn't be less than 100.
See |
ncores |
The number of cores in your computer to use for parallel processing in the Stan engine.
See |
use_groups |
If |
discrim_reg_upb |
Upper bound of the rescaled Beta distribution for observed discrimination parameters (default is +1) |
discrim_reg_lb |
Lower bound of the rescaled Beta distribution for observed discrimination parameters (default is -1). Set to 0 for conventional IRT. |
discrim_miss_upb |
Upper bound of the rescaled Beta distribution for missing discrimination parameters (default is +1) |
discrim_miss_lb |
Lower bound of the rescaled Beta distribution for missing discrimination parameters (default is -1). Set to 0 for conventional IRT. |
discrim_reg_scale |
Set the scale parameter for the rescaled Beta distribution of the discrimination parameters. |
discrim_reg_shape |
Set the shape parameter for the rescaled Beta distribution of the discrimination parameters. |
discrim_miss_scale |
Set the scale parameter for the rescaled Beta distribution of the missingness discrimination parameters. |
discrim_miss_shape |
Set the shape parameter for the rescaled Beta distribution of the missingness discrimination parameters. |
person_sd |
The standard deviation of the Normal distribution prior for persons (all non-constrained person ideal point parameters). Default is weakly informative (3) on the logit scale. |
time_fix_sd |
The variance of the over-time component of the first person/legislator is fixed to this value as a reference. Default is 0.1. |
time_var |
The mean of the exponential distribution for over-time variances for ideal point parameters. Default (10) is weakly informative on the logit scale. |
spline_knots |
Number of knots (essentially, number of points
at which to calculate time-varying ideal points given T time points).
Default is NULL, which means that the spline is equivalent to
polynomial time trend of degree |
spline_degree |
The degree of the spline polynomial. The default is 2 which is a quadratic polynomial. A value of 1 will result in independent knots (essentially pooled across time points T). A higher value will result in wigglier time series. There is no "correct" value but lower values are likely more stable and easier to identify. |
ar1_up |
The upper bound of the AR(1) parameter, default is +1. |
ar1_down |
The lower bound of the AR(1) parameter, default is 0. Set to -1 to allow for inverse responses to time shocks. |
boundary_prior |
If your time series has very low variance (change over time),
you may want to use this option to put a boundary-avoiding inverse gamma prior on
the time series variance parameters if your model has a lot of divergent transitions.
To do so, pass a list with a element called
|
time_center_cutoff |
The number of time points above which the model will employ a centered time series approach for AR(1) and random walk models. Below this number the model will employ a non-centered approach. The default is 50 time points, which is relatively arbitrary and higher values may be better if sampling quality is poor above the threshold. |
restrict_var |
Whether to fix the variance parameter for the first person trajectory. Default is FALSE (usually not necessary). |
sample_stationary |
If |
ar_sd |
If an AR(1) model is used, this defines the prior scale of the Normal distribution. A lower number can help identify the model when there are few time points. |
diff_reg_sd |
Set the prior standard deviation for the bill (item) intercepts for the non-inflated model. |
diff_miss_sd |
Set the prior standard deviation for the bill (item) intercepts for the inflated model. |
restrict_sd_high |
Set the level of tightness for high fixed parameters
(top/positive end of scale).
If NULL, the default, will set to .1 if |
restrict_sd_low |
Set the level of tightness for low fixed parameters
(low/negative end of scale).
If NULL, the default, will set to .1 if |
restrict_N_high |
Set the prior scale for high/positive pinned parameters. Default is 1000 (equivalent to 1,000 observations of the pinned value). Higher values make the pin stronger (for example if there is a lot of data). |
restrict_N_low |
Set the prior shape for low/negative pinned parameters. Default is 1000 (equivalent to 1,000 observations of the pinned value). Higher values make the pin stronger (for example if there is a lot of data). |
ordbeta_phi_mean |
The mean of the prior for phi, the dispersion parameter in the ordered beta distribution. Value of this parameter (default is 1) is given as the mean of the exponential distribution for prior values of phi. |
ordbeta_cut_alpha |
A length 2 vector of positive continuous values for alpha in the induced dirichlet distribution. This distribution is used for the cutpoints of the ordered beta distribution. Default is c(1,1), which is uninformative. |
ordbeta_cut_phi |
A value for the phi paremeter of the induced dirichlet distribution used for ordered beta cutpoint priors. Default is 0, which is weakly informative. |
gp_sd_par |
The upper limit on allowed residual variation of the Gaussian process prior. Increasing the limit will permit the GP to more closely follow the time points, resulting in much sharper bends in the function and potentially oscillation. |
gp_num_diff |
The number of time points to use to calculate the length-scale prior that determines the level of smoothness of the GP time process. Increasing this value will result in greater smoothness/autocorrelation over time by selecting a greater number of time points over which to calculate the length-scale prior. |
gp_m_sd_par |
The upper limit of the marginal standard deviation of the GP time process. Decreasing this value will result in smoother fits. |
gp_min_length |
The minimum value of the GP length-scale parameter. This is a hard
lower limit. Increasing this value will force a smoother GP fit. It should always be less than
|
cmdstan_path_user |
Default is NULL, and so will default to whatever is set in
|
map_over_id |
This parameter identifies which ID variable to use to construct the
shards for within-chain parallelization. It defaults to |
save_files |
The location to save CSV files with MCMC draws from |
compile_optim |
Whether to use Stan compile optimization flags (off by default) |
debug |
For debugging purposes, turns off threading to enable more informative error messages from Stan. Also recompiles model objects. |
init_pathfinder |
Whether to generate initial values from the Pathfinder algorithm (see Stan documentation). If FALSE, will generate random start values.. |
debug_mode |
Whether to debug code by printing values of log-probability statements to the console. A level of 1 will print log-probability before and after likelihood functions are calculated. A level of 2 will also print out the log probability contributions of priors. Default is 0. |
... |
Additional parameters passed on to Stan's sampling engine. See |
To run an IRT ideal point model, you must first pre-process your data using the id_make()
function. Be sure to specify the correct options for the
kind of model you are going to run: if you want to run an unbounded outcome (i.e. Poisson or continuous),
the data needs to be processed differently. Also any hierarchical covariates at the person or item level
need to be specified in id_make()
. If they are specified in id_make()
, than all
subsequent models fit by this function will have these covariates.
Note that for static ideal point models, the covariates are only defined for those persons who are not being used as constraints.
As of this version of idealstan
, the following model types are available. Simply pass
the number of the model in the list to the model_type
option to fit the model.
IRT 2-PL (binary response) ideal point model, no missing-data inflation
IRT 2-PL ideal point model (binary response) with missing- inflation
Ordinal IRT (rating scale) ideal point model no missing-data inflation
Ordinal IRT (rating scale) ideal point model with missing-data inflation
Ordinal IRT (graded response) ideal point model no missing-data inflation
Ordinal IRT (graded response) ideal point model with missing-data inflation
Poisson IRT (Wordfish) ideal point model with no missing data inflation
Poisson IRT (Wordfish) ideal point model with missing-data inflation
unbounded (Gaussian) IRT ideal point model with no missing data
unbounded (Gaussian) IRT ideal point model with missing-data inflation
Positive-unbounded (Log-normal) IRT ideal point model with no missing data
Positive-unbounded (Log-normal) IRT ideal point model with missing-data inflation
Latent Space (binary response) ideal point model with no missing data
Latent Space (binary response) ideal point model with missing-data inflation
Ordered Beta (proportion/percentage) with no missing data
Ordered Beta (proportion/percentage) with missing-data inflation
A fitted idealstan()
object that contains posterior samples of all parameters either via full Bayesian inference
or a variational approximation if use_vb
is set to TRUE
. This object can then be passed to the plotting functions for further analysis.
In addition, each of these models can have time-varying ideal point (person) parameters if
a column of dates is fed to the id_make()
function. If the option vary_ideal_pts
is
set to 'random_walk'
, id_estimate
will estimate a random-walk ideal point model where ideal points
move in a random direction. If vary_ideal_pts
is set to 'AR1'
, a stationary ideal point model
is estimated where ideal points fluctuate around long-term mean. If vary_ideal_pts
is set to 'GP'
, then a semi-parametric Gaussian process time-series prior will be put
around the ideal points. If vary_ideal_pts
is set to 'splines'
, then the ideal point trajectories will be a basis spline defined by the parameters spline_knots
and spline_degree
.
Please see the package vignette and associated paper for more detail
about these time-varying models.
The inflation model used to account for missing data assumes that missingness is a
function of the persons' (legislators')
ideal points. In other words,the model will take into account if people with high or low ideal points
tend to have more/less missing data on a specific item/bill. Missing data should be coded
as NA
when it is passed to the id_make function.
If there isn't any relationship
between missing data and ideal points, then the model assumes that the missingness is ignorable
conditional on each
item, but it will still adjust the results to reflect these ignorable (random) missing
values. The inflation is designed to be general enough to handle a wide array of potential
situations where strategic social choices make missing data important to take into account.
To leave missing data out of the model, simply choose a version of the model in the list above that is non-inflated.
Models can be either fit on the person/legislator IDs or on group-level IDs (as specified to the
id_make
function). If group-level parameters should be fit, set use_groups
to TRUE
.
Covariates are included in the model if they were specified as options to the
id_make()
function. The covariate plots can be accessed with
id_plot_cov()
on a fitted idealstan
model object.
Identifying IRT models is challenging, and ideal point models are still more challenging
because the discrimination parameters are not constrained.
As a result, more care must be taken to obtain estimates that are the same regardless of starting values.
The parameter fixtype
enables you to change the type of identification used. The default, 'vb_full',
does not require any further
information from you in order for the model to be fit. In this version of identification,
an unidentified model is run using
variational Bayesian inference (see rstan::vb()
). The function will then select two
persons/legislators or items/bills that end up on either end of the ideal point spectrum,
and pin their ideal points
to those specific values.
To control whether persons/legislator or items/bills are constrained,
the const_type
can be set to either "persons"
or
"items"
respectively.
In many situations, it is prudent to select those persons or items
ahead of time to pin to specific values. This allows the analyst to
be more specific about what type of latent dimension is to be
estimated. To do so, the fixtype
option should be set to
"prefix"
. The values of the persons/items to be pinned can be passed
as character values to restrict_ind_high
and
restrict_ind_low
to pin the high/low ends of the latent
scale respectively. Note that these should be the actual data values
passed to the id_make
function. If you don't pass any values,
you will see a prompt asking you to select certain values of persons/items.
The pinned values for persons/items are set by default to +1/-1, though
this can be changed using the fix_high
and
fix_low
options. This pinned range is sufficient to identify
all of the models
implemented in idealstan, though fiddling with some parameters may be
necessary in difficult cases. For time-series models, one of the
person ideal point over-time variances is also fixed to .1, a value that
can be changed using the option time_fix_sd
.
Clinton, J., Jackman, S., & Rivers, D. (2004). The Statistical Analysis of Roll Call Data. The American Political Science Review, 98(2), 355-370. doi:10.1017/S0003055404001194
Bafumi, J., Gelman, A., Park, D., & Kaplan, N. (2005). Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation. Political Analysis, 13(2), 171-187. doi:10.1093/pan/mpi010
Kubinec, R. "Generalized Ideal Point Models for Time-Varying and Missing-Data Inference". Working Paper.
Betancourt, Michael. "Robust Gaussian Processes in Stan". (October 2017). Case Study.
id_make()
for pre-processing data,
id_plot_legis()
for plotting results,
summary()
for obtaining posterior quantiles,
id_post_pred()
for producing predictive replications.
# First we can simulate data for an IRT 2-PL model that is inflated for missing data
library(ggplot2)
library(dplyr)
# This code will take at least a few minutes to run
## Not run:
bin_irt_2pl_abs_sim <- id_sim_gen(model_type='binary',inflate=T)
# Now we can put that directly into the id_estimate function
# to get full Bayesian posterior estimates
# We will constrain discrimination parameters
# for identification purposes based on the true simulated values
bin_irt_2pl_abs_est <- id_estimate(bin_irt_2pl_abs_sim,
model_type=2,
restrict_ind_high =
sort(bin_irt_2pl_abs_sim@simul_data$true_person,
decreasing=TRUE,
index=TRUE)$ix[1],
restrict_ind_low =
sort(bin_irt_2pl_abs_sim@simul_data$true_person,
decreasing=FALSE,
index=TRUE)$ix[1],
fixtype='prefix',
ncores=2,
nchains=2)
# We can now see how well the model recovered the true parameters
id_sim_coverage(bin_irt_2pl_abs_est) %>%
bind_rows(.id='Parameter') %>%
ggplot(aes(y=avg,x=Parameter)) +
stat_summary(fun.args=list(mult=1.96)) +
theme_minimal()
## End(Not run)
# In most cases, we will use pre-existing data
# and we will need to use the id_make function first
# We will use the full rollcall voting data
# from the 114th Senate as a rollcall object
## Not run:
data('senate114')
# Running this model will take at least a few minutes, even with
# variational inference (use_vb=T) turned on
to_idealstan <- id_make(score_data = senate114,
outcome = 'cast_code',
person_id = 'bioname',
item_id = 'rollnumber',
group_id= 'party_code',
time_id='date',
high_val='Yes',
low_val='No',
miss_val='Absent')
sen_est <- id_estimate(to_idealstan,
model_type = 2,
use_vb = TRUE,
fixtype='prefix',
restrict_ind_high = "BARRASSO, John A.",
restrict_ind_low = "WARREN, Elizabeth")
# After running the model, we can plot
# the results of the person/legislator ideal points
id_plot_legis(sen_est)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.