#' Create data to run IRT model
#'
#' To run an IRT model using \code{idealstan}, you must first process your data using the \code{id_make}
#' function.
#'
#' @details This function can accept either a \code{rollcall} data object from package
#' \code{pscl} or
#' a long data frame where one row equals one item-person (bill-legislator)
#' observation with associated outcome. The preferred method is the long data frame
#' as passing a long data frame permits
#' the inclusion of a wide range of covariates in the model, such as person-varying and item-varying
#' (bill-varying) covariates.
#' If a \code{rollcall} object is passed to the function, the \code{rollcall} data is converted
#' to a long data frame with data from the \code{vote.data} matrix used to determine dates for bills.
#' If passing a long data frame, you should specify the names of the
#' columns containing the IDs for persons, items and
#' groups (groups are IDs that may have multiple observations per ID, such as political parties or
#' classes) to the \code{id_make} function, along with the name of the response/outcome.
#' The only required columns are the item/bill ID and the person/legislator ID along with an
#' outcome column.
#'
#' The preferred format for the outcome column for discrete variables (binary or ordinal)
#' is to pass a factor variable with levels in the correct order, i.e., in ascending order.
#' For example, if using legislative data, the levels of the factor should be \code{c('No','Yes')}.
#' If a different kind of variable is passed, such as a character or numeric variable,
#' you should consider specifying \code{low_val},\code{high_val} and \code{middle_val} to
#' determine the correct order of the discrete outcome. Specifying \code{middle_val} is only
#' necessary if you are estimating an ordinal model.
#'
#' If you do not specify a value for \code{miss_val}, then any \code{NA} are assumed to be
#' missing. If you do specify \code{miss_val} and you also have \code{NA} in your data
#' (assuming \code{miss_val} is not \code{NA}), then the function will treat the data
#' coded as \code{miss_val} as missing data that should be modeled and will treat the \code{NA}
#' data as ignorable missing data that will be removed (list-wise deletion) before estimating a
#' model.
#'
#' @section Time-Varying Models
#'
#' To run a time-varying model, you need to include the name of a column with dates (or integers) that is passed
#' to the \code{time_id} option.
#'
#' @section Continuous Outcomes
#'
#' If the outcome is unbounded i.e. a continuous or an unbounded
#' discrete variable like Poisson, simply set \code{unbounded} to \code{TRUE}. You can ignore the
#' options that specify which values should be \code{high_val} or \code{low_val}. You can either specify
#' a particular value as missing using \code{miss_val}, or all
#' missing values (\code{NA}) will be recoded to a specific value out of the range of the outcome to use
#' for modeling the missingness.
#'
#' @section Hierarchical Covariates
#'
#' Covariates can be fit on the person-level ideal point parameters as well as
#' item discrimination parameters for either the inflated (missing) or non-inflated (observed)
#' models. These covariates must be columns that were included with the data fed to the
#' \code{\link{id_make}} function. The covariate relationships are specified as
#' one-sided formulas, i.e. \code{~cov1 + cov2 + cov1*cov2}. To interact covariates with the
#' person-level ideal points you can use \code{~cov1 + person_id + cov1*person_id} and for
#' group-level ideal poins you can use \code{~cov1 + group_id + cov1*group_id} where
#' \code{group_id} or \code{person_id} is the same name as the name of the column
#' for these options that you passed to \code{id_make} (i.e., the names of the columns
#' in the original data).
#'
#' @param score_data A data frame in long form, i.e., one row in the data for each
#' measured score or vote in the data or a \code{rollcall} data object from package \code{pscl}.
#' @param outcome Column name of the outcome in \code{score_data}, default is \code{"outcome"}
#' @param person_id Column name of the person/legislator ID index in \code{score_data},
#' default is \code{'person_id'}. Should be integer, character or factor.
#' @param item_id Column name of the item/bill ID index in \code{score_data},
#' default is \code{'item_id'}. Should be integer, character or factor.
#' @param time_id Column name of the time values in \code{score_data}:
#' optional, default is \code{'time_id'}. Should be a date or date-time class, but can be an integer
#' (i.e., years in whole numbers).
#' @param model_id Column name of the model/response types in the data.
#' Default is \code{"model_id"}. Only necessary if a model with multiple
#' response types (i.e., binary + continuous outcomes). Must be a
#' column with a series
#' of integers matching the model types in \code{\link{id_estimate}}
#' showing which row of the data matches which outcome.
#' @param group_id Optional column name of a person/legislator group IDs (i.e., parties) in \code{score_data}.
#' Optional, default is \code{'group_id'}. Should be integer, character or factor.
#' @param person_cov A one-sided formula that specifies the covariates
#' in \code{score_data} that will be used to hierarchically model the person/legislator ideal points
#' @param item_cov A one-sided formula that specifies the covariates
#' in \code{score_data} that will be used to hierarchically model the
#' item/bill discrimination parameters for the regular model
#' @param item_cov_miss A one-sided formula that specifies the covariates
#' in the dataset that will be used to hierarchically model the item/bill discrimination parameters for the
#' missing data model.
#' @param remove_cov_int Whether to remove constituent terms from hierarchical covariates that
#' interact covariates with IDs like \code{person_id} or \code{item_id}. Set to \code{TRUE} if
#' including these constituent terms would cause multi-collinearity with other terms in the model
#' (such as running a group-level model with a group-level interaction or a person-level model
#' with a person-level interaction).
#' @param simul_data Optionally, data that has been generated by the \code{\link{id_sim_gen}} function.
#' @param unbounded Whether or not the outcome/response is unbounded (i.e., continuous or
#' Poisson). If it is, \code{miss_val}
#' is recoded as the maximum of the outcome + 1.
#' @param exclude_level A vector of any values that should be treated as \code{NA} in the response matrix.
#' Unlike the \code{miss_val} parameter, these values will be dropped from the data before
#' estimation rather than modeled explicitly.
#' @param simulation If \code{TRUE}, simulated values are saved in the \code{idealdata} object for
#' later plotting with the \code{\link{id_plot_sims}} function
#' @return A \code{idealdata} object that can then be used in the \code{\link{id_estimate}} function
#' to fit a model.
#' @export
#' @import dplyr
#' @importFrom tidyr gather spread
#' @import bayesplot
#' @import Rcpp
#' @import methods
#' @importFrom stats dbinom median plogis quantile reorder rexp rlnorm runif sd step rnorm
#' @importFrom utils person
#' @examples
#' # You can either use a pscl rollcall object or a vote/score matrix
#' # where persons/legislators are in the rows
#' # and items/bills are in the columns
#'
#' library(dplyr)
#'
#' # First, using a rollcall object with the 114th Senate's rollcall votes:
#'
#' data('senate114')
#'
#' to_idealstan <- id_make(score_data = senate114,
#' outcome = 'cast_code',
#' person_id = 'bioname',
#' item_id = 'rollnumber',
#' group_id= 'party_code',
#' time_id='date',
#' high_val='Yes',
#' low_val='No',
#' miss_val='Absent')
#'
id_make <- function(score_data=NULL,
outcome_disc='outcome_disc',
outcome_cont='outcome_cont',
person_id='person_id',
item_id='item_id',
time_id='time_id',
group_id='group_id',
model_id='model_id',
ordered_id="ordered_id",
ignore_id="ignore_id",
simul_data=NULL,
person_cov=NULL,
item_cov=NULL,
item_cov_miss=NULL,
remove_cov_int=FALSE,
unbounded=FALSE,
exclude_level=NA,simulation=FALSE) {
# only allow missing values as NA
miss_val <- c(NA,NA)
high_val <- NULL
middle_val <- NULL
low_val <- NULL
# make sure to ungroup it if it's a tidy data frame
if('tbl' %in% class(score_data)) score_data <- ungroup(score_data)
# if object is a rollcall, pre-process data into long form
if('rollcall' %in% class(score_data)) {
score_data <- .prepare_rollcall(score_data,item_id=item_id,
time_id=time_id)
time_id <- score_data$time_id
item_id <- score_data$item_id
score_data <- score_data$score_data
miss_val <- 9
low_val <- 6
high_val <- 1
exclude_level <- c(3,7)
}
# data is already in long form
# save original ID names as strings to filter later
orig_id <- c(item_id,
group_id,
person_id)
# test for input and quote
outcome_disc <- .check_quoted(outcome_disc,quo(outcome_disc))
outcome_cont <- .check_quoted(outcome_cont,quo(outcome_cont))
ordered_id <- .check_quoted(ordered_id,quo(ordered_id))
item_id <- .check_quoted(item_id,quo(item_id))
time_id <- .check_quoted(time_id,quo(time_id))
group_id <- .check_quoted(group_id,quo(group_id))
person_id <- .check_quoted(person_id,quo(person_id))
model_id <- .check_quoted(model_id,quo(model_id))
# rename data
# IDs are made factors now so we can re-arrange indices when need be
score_rename <- select(score_data,
item_id = !!item_id,
person_id = !!person_id) %>%
mutate(item_id=factor(!! quo(item_id)),
person_id=factor(!! quo(person_id)))
# need to test model number one first
test_model <- try(pull(score_data,!!model_id),silent=TRUE)
if(any('try-error' %in% class(test_model))) {
score_rename$model_id <- "missing"
} else {
score_rename$model_id <- test_model
}
# if time or group IDs don't exist, make dummies
test_group <- try(factor(pull(score_data,!!group_id)),silent=TRUE)
test_time <- try(pull(score_data,!!time_id),silent=TRUE)
test_out_disc <- try(factor(pull(score_data,!!outcome_disc)),silent=TRUE)
test_out_cont <- try(pull(score_data,!!outcome_cont),silent=TRUE)
test_ordered <- try(pull(score_data,!!ordered_id),silent=TRUE)
if(any('try-error' %in% class(test_group))) {
score_rename$group_id <- factor("G")
} else {
score_rename$group_id <- test_group
}
if(any('try-error' %in% class(test_time))) {
score_rename$time_id <- 1
} else {
score_rename$time_id <- test_time
}
if(any('try-error' %in% class(test_out_disc))) {
} else {
score_rename$outcome_disc <- test_out_disc
}
if(any('try-error' %in% class(test_out_cont))) {
} else {
score_rename$outcome_cont <- test_out_cont
}
if(any("try_error" %in% class(test_ordered))) {
} else {
score_rename$ordered_id <- test_ordered
}
# now we can make these all quosures again to use in NSE
outcome_disc <- quo(outcome_disc)
outcome_cont <- quo(outcome_cont)
item_id <- quo(item_id)
person_id <- quo(person_id)
time_id <- quo(time_id)
group_id <- quo(group_id)
model_id <- quo(model_id)
ordered_id <- quo(ordered_id)
# see what the max time point is
max_t <- max(as.numeric(factor(pull(score_rename,!!time_id))),na.rm=T)
num_person <- max(as.numeric(factor(pull(score_rename,!!person_id))),na.rm=T)
num_group <- try(max(as.numeric(factor(pull(score_rename,!!group_id)))))
num_item <- max(as.numeric(factor(pull(score_rename,!!item_id))),na.rm=T)
# create data frames for all hierachical parameters
if(!is.null(person_cov)) {
# drop intercept
personm <- model.matrix(person_cov,data=score_data)[,-1,drop=FALSE]
# need to check for missing data and remove any missing from IDs
score_rename <- slice(score_rename,as.numeric(attr(personm,'dimnames')[[1]]))
if(nrow(score_rename)!=nrow(personm)) stop('Covariate matrix and data matrix not the same size even after removing missing data.')
# need to remove any constituent terms for IDs from the model matrix to avoid collinearity with
# model parameters
check_ids <- sapply(unique(score_rename$person_id), function(c) {
check_all <- grepl(x=colnames(personm),
pattern=c)
check_all_colon <- grepl(x=colnames(personm),
pattern=paste0(c,":","|",":",c))
matrix(check_all & !check_all_colon,nrow=length(check_all))
},simplify = F) %>%
do.call(rbind, .) %>%
apply(2,any)
if(remove_cov_int) {
personm <- personm[,!check_ids]
}
score_rename <- bind_cols(score_rename,
as_data_frame(personm))
person_cov <- dimnames(personm)[[2]]
} else {
# make a dummy column if no covariate data
score_rename$personcov0 <- 0
person_cov <- 'personcov0'
}
if(!is.null(item_cov)) {
itemm <- model.matrix(item_cov,data=score_data)[,-1,drop=F]
# need to check for missing data and remove any missing from IDs
score_rename <- slice(score_rename,as.numeric(attr(itemm,'dimnames')[[1]]))
if(nrow(score_rename)!=nrow(itemm)) stop('Covariate matrix and data matrix not the same size even after removing missing data.')
# need to remove any constituent terms for IDs from the model matrix to avoid collinearity with
# model parameters
check_ids <- sapply(unique(score_rename$item_id), function(c) {
check_all <- grepl(x=colnames(itemm),
pattern=c)
check_all_colon <- grepl(x=colnames(itemm),
pattern=paste0(c,":","|",":",c))
matrix(check_all & !check_all_colon,nrow=length(check_all))
},simplify = F) %>%
do.call(rbind, .) %>%
apply(2,any)
if(remove_cov_int) {
itemm <- itemm[,!check_ids]
}
score_rename <- bind_cols(score_rename,
as_data_frame(itemm))
item_cov <- dimnames(itemm)[[2]]
} else {
# make a dummy column if no covariate data
score_rename$itemcov0 <- 0
item_cov <- 'itemcov0'
}
if(!is.null(item_cov_miss)) {
itemmissm <- model.matrix(item_cov_miss,data=score_data)[,-1,drop=FALSE]
# need to check for missing data and remove any missing from IDs
score_rename <- slice(score_rename,as.numeric(attr(itemmissm,'dimnames')[[1]]))
if(nrow(score_rename)!=nrow(itemmissm)) stop('Covariate matrix and data matrix not the same size even after removing missing data.')
# need to remove any constituent terms for IDs from the model matrix to avoid collinearity with
# model parameters
check_ids <- sapply(unique(score_rename$item_id), function(c) {
check_all <- grepl(x=colnames(itemmissm),
pattern=c)
check_all_colon <- grepl(x=colnames(itemmissm),
pattern=paste0(c,":","|",":",c))
matrix(check_all & !check_all_colon,nrow=length(check_all))
},simplify = F) %>%
do.call(rbind, .) %>%
apply(2,any)
if(remove_cov_int) {
itemmissm <- itemmissm[,!check_ids]
}
score_rename <- bind_cols(score_rename,
as_data_frame(itemmissm))
item_cov_miss <- dimnames(itemmissm)[[2]]
} else {
# make a dummy column if no covariate data
score_rename$itemcovmiss0 <- 0
item_cov_miss <- 'itemcovmiss0'
}
# recode score/outcome
if("outcome_disc" %in% names(score_rename)) {
if(is.na(miss_val[1])) {
# make NA a level, then change it
score_rename$outcome_disc <- addNA(score_rename$outcome_disc)
levels(score_rename$outcome_disc)[length(levels(score_rename$outcome_disc))] <- "Missing"
miss_val[1] <- 'Missing'
}
if(!is.null(high_val) && !is.null(low_val) && !is.null(middle_val)) {
score_rename$outcome_disc <- factor(score_rename$outcome_disc,
levels=c(low_val,middle_val,high_val,miss_val[1]),
exclude = exclude_level)
} else if(!is.null(high_val) && !is.null(low_val)) {
score_rename$outcome_disc <- factor(score_rename$outcome_disc,
levels=c(low_val,high_val,miss_val[1]),
exclude = exclude_level)
} else {
# variable does not need to be recoded, only move missing to the end
score_rename$outcome_disc <- factor(score_rename$outcome_disc)
score_rename$outcome_disc <- fct_relevel(score_rename$outcome_disc,as.character(miss_val[1]),after=Inf)
}
# reconvert if continuous values present
if("outcome_cont" %in% names(score_rename)) {
levels(score_rename$outcome_disc) <- c(levels(score_rename$outcome_disc),
"Joint Posterior")
score_rename$outcome_disc[score_rename$model_id>8 & score_rename$model_id<13] <- "Joint Posterior"
}
}
if("outcome_cont" %in% names(score_rename)) {
if("outcome_disc" %in% names(score_rename)) {
max_val <- max(pull(score_rename,!!outcome_cont)[score_rename$outcome_disc=="Joint Posterior"],na.rm=T)
} else {
max_val <- max(pull(score_rename,!!outcome_cont),na.rm=T)
}
# make missing data the highest observed value
if(is.na(miss_val[2])) {
if("outcome_disc" %in% names(score_rename)) {
# only truly missing if discrete outcome also missing
score_rename$outcome_cont[score_rename$outcome_disc=="Joint Posterior" & !is.na(score_rename$outcome_disc)] <- coalesce(score_rename$outcome_cont[score_rename$outcome_disc=="Joint Posterior" & !is.na(score_rename$outcome_disc)],max_val+1L)
# need to set another value for not truly missing values for appended datasets
score_rename$outcome_cont[score_rename$outcome_disc!="Joint Posterior"] <- max_val + 2L
} else {
score_rename <- mutate(score_rename,!! quo_name(outcome_cont) := coalesce(!!outcome_cont,max_val+1L))
}
} else {
if("outcome_disc" %in% names(score_rename)) {
score_rename <- mutate(score_rename,!! quo_name(outcome_cont) := ifelse(!!outcome_cont==miss_val[2] & outcome_disc=="Joint Posterior" &
!is.na(outcome_disc),max_val+1L,!!outcome_cont))
# need to set another value for not truly missing values for appended datasets
score_rename$outcome_cont[score_rename$outcome_disc!="Joint Posterior"] <- max_val + 2L
} else {
score_rename <- mutate(score_rename,!! quo_name(outcome_cont) := ifelse(!!outcome_cont==miss_val[2],max_val+1L,!!outcome_cont))
}
}
miss_val[2] <- max_val + 1L
}
# need to sort by integer vs. continuous
score_rename$discrete <- as.numeric(score_rename$model_id %in% c(1,2,3,4,5,6,7,8,13,14))
score_rename <- arrange(score_rename,desc(discrete),person_id,item_id)
outobj <- new('idealdata',
score_matrix=score_rename,
person_cov=person_cov,
item_cov=item_cov,
item_cov_miss=item_cov_miss,
miss_val=miss_val)
if(simulation==TRUE) {
outobj@simul_data <- simul_data
outobj@simulation <- simulation
}
return(outobj)
}
#' Estimate an \code{idealstan} model
#'
#' This function will take a pre-processed \code{idealdata} vote/score dataframe and
#' run one of the available IRT/latent space ideal point models on the data using
#' Stan's MCMC engine.
#'
#' To run an IRT ideal point model, you must first pre-process your data using the \code{\link{id_make}} function. Be sure to specify the correct options for the
#' kind of model you are going to run: if you want to run an unbounded outcome (i.e. Poisson or continuous),
#' the data needs to be processed differently. Also any hierarchical covariates at the person or item level
#' need to be specified in \code{\link{id_make}}. If they are specified in \code{\link{id_make}}, than all
#' subsequent models fit by this function will have these covariates.
#'
#' \strong{Note that for static ideal point models, the covariates are only defined for those
#' persons who are not being used as constraints.}
#'
#' As of this version of \code{idealstan}, the following model types are available. Simply pass
#' the number of the model in the list to the \code{model_type} option to fit the model.
#'
#' \enumerate{
#' \item IRT 2-PL (binary response) ideal point model, no missing-data inflation
#' \item IRT 2-PL ideal point model (binary response) with missing- inflation
#' \item Ordinal IRT (rating scale) ideal point model no missing-data inflation
#' \item Ordinal IRT (rating scale) ideal point model with missing-data inflation
#' \item Ordinal IRT (graded response) ideal point model no missing-data inflation
#' \item Ordinal IRT (graded response) ideal point model with missing-data inflation
#' \item Poisson IRT (Wordfish) ideal point model with no missing data inflation
#' \item Poisson IRT (Wordfish) ideal point model with missing-data inflation
#' \item unbounded (Gaussian) IRT ideal point model with no missing data
#' \item unbounded (Gaussian) IRT ideal point model with missing-data inflation
#' \item Positive-unbounded (Log-normal) IRT ideal point model with no missing data
#' \item Positive-unbounded (Log-normal) IRT ideal point model with missing-data inflation
#' \item Latent Space (binary response) ideal point model with no missing data
#' \item Latent Space (binary response) ideal point model with missing-data inflation
#' }
#'
#' @section Time-Varying Inferece
#'
#' In addition, each of these models can have time-varying ideal point (person) parameters if
#' a column of dates is fed to the \code{\link{id_make}} function. If the option \code{vary_ideal_pts} is
#' set to \code{'random_walk'}, \code{id_estimate} will estimate a random-walk ideal point model where ideal points
#' move in a random direction. If \code{vary_ideal_pts} is set to \code{'AR1'}, a stationary ideal point model
#' is estimated where ideal points fluctuate around long-term mean. If \code{vary_ideal_pts}
#' is set to \code{'GP'}, then a semi-parametric Gaussian process time-series prior will be put
#' around the ideal points. Please see the package vignette and associated paper for more detail
#' about these time-varying models. Both the \code{'AR1'} and \code{'GP'} models can also
#' accept time-varying covariates.
#'
#' @section Missing Data
#'
#' The inflation model used to account for missing data assumes that missingness is a
#' function of the persons' (legislators')
#' ideal points. In other words,the model will take into account if people with high or low ideal points
#' tend to have more/less missing data on a specific item/bill. Missing data is whatever was
#' passed as \code{miss_val} to the \code{\link{id_make}} function.
#' If there isn't any relationship
#' between missing data and ideal points, then the model assumes that the missingness is ignorable
#' conditional on each
#' item, but it will still adjust the results to reflect these ignorable (random) missing
#' values. The inflation is designed to be general enough to handle a wide array of potential
#' situations where strategic social choices make missing data important to take into account.
#'
#' The missing data is assumed to be any possible value of the outcome. The well-known
#' zero-inflated Poisson model is a special case where missing values are known to be all zeroes.
#' To fit a zero-inflated Poisson model, change \code{inflate_zeroes} to \code{TRUE} and also
#' make sure to set the value for zero as \code{miss_val} in the \code{\link{id_make}} function.
#' This will only work for outcomes that are distributed as Poisson variables (i.e.,
#' unbounded integers or counts).
#'
#' To leave missing data out of the model, simply choose a version of the model in the list above
#' that is non-inflated.
#'
#' Models can be either fit on the person/legislator IDs or on group-level IDs (as specified to the
#' \code{id_make} function). If group-level parameters should be fit, set \code{use_groups} to \code{TRUE}.
#'
#' @section Covariates
#'
#' Covariates are included in the model if they were specified as options to the
#' \code{\link{id_make}} function. The covariate plots can be accessed with
#' \code{\link{id_plot_cov}} on a fitted \code{idealstan} model object.
#'
#' @section Identification:
#' Identifying IRT models is challenging, and ideal point models are still more challenging
#' because the discrimination parameters are not constrained.
#' As a result, more care must be taken to obtain estimates that are the same regardless of starting values.
#' The parameter \code{fixtype} enables you to change the type of identification used. The default, 'vb_full',
#' does not require any further
#' information from you in order for the model to be fit. In this version of identification,
#' an unidentified model is run using
#' variational Bayesian inference (see \code{\link[rstan]{vb}}). The function will then select two
#' persons/legislators or items/bills that end up on either end of the ideal point spectrum,
#' and pin their ideal points
#' to those specific values.
#' To control whether persons/legislator or items/bills are constrained,
#' the \code{const_type} can be set to either \code{"persons"} or
#' \code{"items"} respectively.
#' In many situations, it is prudent to select those persons or items
#' ahead of time to pin to specific values. This allows the analyst to
#' be more specific about what type of latent dimension is to be
#' estimated. To do so, the \code{fixtype} option should be set to
#' \code{"prefix"}. The values of the persons/items to be pinned can be passed
#' as character values to \code{restrict_ind_high} and
#' \code{restrict_ind_low} to pin the high/low ends of the latent
#' scale respectively. Note that these should be the actual data values
#' passed to the \code{id_make} function. If you don't pass any values,
#' you will see a prompt asking you to select certain values of persons/items.
#'
#' The pinned values for persons/items are set by default to +1/-1, though
#' this can be changed using the \code{fix_high} and
#' \code{fix_low} options. This pinned range is sufficient to identify
#' all of the models
#' implemented in idealstan, though fiddling with some parameters may be
#' necessary in difficult cases. For time-series models, one of the
#' person ideal point over-time variances is also fixed to .1, a value that
#' can be changed using the option \code{time_fix_sd}.
#' @param idealdata An object produced by the \code{\link{id_make}}
#' containing a score/vote matrix for use for estimation & plotting
#' @param model_type An integer reflecting the kind of model to be estimated.
#' See below.
#' @param inflate_zero If the outcome is distributed as Poisson (count/unbounded integer),
#' setting this to
#' \code{TRUE} will fit a traditional zero-inflated model.
#' To use correctly, the value for
#' zero must be passed as the \code{miss_val} option to \code{\link{id_make}} before
#' running a model so that zeroes are coded as missing data.
#' @param ignore_db If there are multiple time periods (particularly when there are
#' very many time periods), you can pass in a data frame
#' (or tibble) with one row per person per time period and an indicator column
#' \code{ignore} that is equal to 1 for periods that should be considered in sample
#' and 0 for periods for periods that should be considered out of sample. This is
#' useful for excluding time periods from estimation for persons when they could not
#' be present, i.e. such as before entrance into an organization or following death.
#' If \code{ignore} equals 0, the person's ideal point is estimated as a standard Normal
#' draw rather than an auto-correlated parameter, reducing computational
#' burden substantially.
#' Note that there can only be one pre-sample period of 0s, one in-sample period of 1s,
#' and one post-sample period of 0s. Multiple in-sample periods cannot be interspersed
#' with out of sample periods. The columns must be labeled as \code{person_id},
#' \code{time_id} and \code{ignore} and must match the formatting of the columns
#' fed to the \code{id_make} function.
#' @param keep_param A list with logical values for different categories of paremeters which
#' should/should not be kept following estimation. Can be any/all of \code{person_int} for
#' the person-level intercepts (static ideal points),
#' \code{person_vary} for person-varying ideal points,
#' \code{item} for observed item parameters (discriminations/intercepts),
#' \code{item_miss} for missing item parameters (discriminations/intercepts),
#' and \code{extra} for other parameters (hierarchical covariates, ordinal intercepts, etc.).
#' Takes the form \code{list(person_int=TRUE,person_vary=TRUE,item=TRUE,item_miss=TRUE,extra=TRUE)}.
#' If any are missing in the list, it is assumed that those parameters will be excluded.
#' If \code{NULL} (default), will save all parameters in output.
#' @param grainsize The grainsize parameter for the \code{reduce_sum}
#' function used for within-chain parallelization. The default is 1,
#' which means 1 chunk (item or person) per core. Set to -1. to use
#' @param map_over_id This parameter identifies which ID variable to use to construct the
#' shards for within-chain parallelization. It defaults to \code{"persons"} but can also take
#' a value of \code{"items"}. It is recommended to select whichever variable has more
#' distinct values to improve parallelization.
#' @param vary_ideal_pts Default \code{'none'}. If \code{'random_walk'}, \code{'AR1'} or
#' \code{'GP'}, a
#' time-varying ideal point model will be fit with either a random-walk process, an
#' AR1 process or a Gaussian process. See documentation for more info.
#' @param use_subset Whether a subset of the legislators/persons should be used instead of the full response matrix
#' @param sample_it Whether or not to use a random subsample of the response matrix. Useful for testing.
#' @param subset_group If person/legislative data was included in the \code{\link{id_make}} function, then you can subset by
#' any value in the \code{$group} column of that data if \code{use_subset} is \code{TRUE}.
#' @param subset_person A list of character values of names of persons/legislators to use to subset if \code{use_subset} is
#' \code{TRUE} and person/legislative data was included in the \code{\link{id_make}} function with the required \code{$person.names}
#' column
#' @param sample_size If \code{sample_it} is \code{TRUE}, this value reflects how many legislators/persons will be sampled from
#' the response matrix
#' @param nchains The number of chains to use in Stan's sampler. Minimum is one. See \code{\link[rstan]{stan}} for more info.
#' @param niters The number of iterations to run Stan's sampler. Shouldn't be set much lower than 500. See \code{\link[rstan]{stan}} for more info.
#' @param use_vb Whether or not to use Stan's variational Bayesian inference engine instead of full Bayesian inference. Pros: it's much faster.
#' Cons: it's not quite as accurate. See \code{\link[rstan]{vb}} for more info.
#' @param tol_rel_obj If \code{use_vb} is \code{TRUE}, this parameter sets the stopping rule for the \code{vb} algorithm.
#' It's default is 0.001. A stricter threshold will require the sampler to run longer but may yield a
#' better result in a difficult model with highly correlated parameters. Lowering the threshold should work fine for simpler
#' models.
#' @param warmup The number of iterations to use to calibrate Stan's sampler on a given model. Shouldn't be less than 100.
#' See \code{\link[rstan]{stan}} for more info.
#' @param ncores The number of cores in your computer to use for parallel processing in the Stan engine.
#' See \code{\link[rstan]{stan}} for more info. If \code{within_chain} is set to
#' \code{"threads"}, this parameter will determine the number of threads
#' (independent processes) used for within-chain parallelization.
#' @param fixtype Sets the particular kind of identification used on the model, could be either 'vb_full'
#' (identification provided exclusively by running a variational identification model with no prior info), or
#' 'prefix' (two indices of ideal points or items to fix are provided to
#' options \code{restrict_ind_high} and \code{restrict_ind_low}).
#' See details for more information.
#' @param prior_fit If a previous \code{idealstan} model was fit \emph{with the same} data, then the same
#' identification constraints can be recycled from the prior fit if the \code{idealstan} object is passed
#' to this option. Note that means that all identification options, like \code{restrict_var}, will also
#' be the same
#' @param mpi_export If \code{within_chains="mpi"}, this parameter should refer to the
#' directory where the necessary data and Stan code will be exported to. If missing,
#' an interactive dialogue will prompt the user for a directory.
#' @param id_refresh The number of times to report iterations from the variational run used to
#' identify models. Default is 0 (nothing output to console).
#' @param sample_stationary If \code{TRUE}, the AR(1) coefficients in a time-varying model will be
#' sampled from an unconstrained space and then mapped back to a stationary space. Leaving this \code{TRUE} is
#' slower but will work better when there is limited information to identify a model. If used, the
#' \code{ar_sd} parameter should be increased to 5 to allow for wider sampling in the unconstrained space.
#' @param ar_sd If an AR(1) model is used, this defines the prior scale of the Normal distribution. A lower number
#' can help
#' identify the model when there are few time points.
#' @param use_groups If \code{TRUE}, group parameters from the person/legis data given in \code{\link{id_make}} will be
#' estimated instead of individual parameters.
#' @param const_type Whether \code{"persons"} are the parameters to be
#' fixed for identification (the default) or \code{"items"}. Each of these
#' pinned parameters should be specified to \code{fix_high} and \code{fix_low}
#' if \code{fixtype} equals \code{"prefix"}, otherwise the model will
#' select the parameters to pin to fixed values.
#' @param restrict_ind_high If \code{fixtype} is not "vb_full", the character value or numerical index
#' of a legislator/person or bill/item to pin to a high value (default +1).
#' @param restrict_ind_low If \code{fixtype} is not "vb_full", the character value or numerical index of a
#' legislator/person or bill/item to pin to a low value (default -1).
#' @param fix_high The value of which the high fixed ideal point/item should be
#' fixed to. Default is +1.
#' @param fix_low The value of which the high fixed ideal point/item should be
#' fixed to. Default is -1.
#' @param discrim_reg_sd Set the prior standard deviation of the bimodal prior for the discrimination parameters for the non-inflated model.
#' @param discrim_miss_sd Set the prior standard deviation of the bimodal prior for the discrimination parameters for the inflated model.
#' @param person_sd Set the prior standard deviation for the legislators (persons) parameters
#' @param time_fix_sd The variance of the over-time component of the first person/legislator
#' is fixed to this value as a reference.
#' Default is 0.1.
#' @param boundary_prior If your time series has very low variance (change over time),
#' you may want to use this option to put a boundary-avoiding inverse gamma prior on
#' the time series variance parameters if your model has a lot of divergent transitions.
#' To do so, pass a list with a element called
#' \code{beta} that signifies the rate parameter of the inverse-gamma distribution.
#' For example, try \code{boundary_prior=list(beta=1)}. Increasing the value of \code{beta}
#' will increase the "push" away from zero. Setting it too high will result in
#' time series that exhibit a lot of "wiggle" without much need.
#' @param time_center_cutoff The number of time points above which
#' the model will employ a centered time series approach for AR(1)
#' and random walk models. Below this number the model will employ a
#' non-centered approach. The default is 50 time points, which is
#' relatively arbitrary and higher values may be better if sampling
#' quality is poor above the threshold.
#' @param diff_reg_sd Set the prior standard deviation for the bill (item) intercepts for the non-inflated model.
#' @param diff_miss_sd Set the prior standard deviation for the bill (item) intercepts for the inflated model.
#' @param restrict_sd_high Set the prior standard deviation for pinned parameters. This has a default of
#' 0.01, but could be set lower if the data is really large.
#' @param restrict_sd_low Set the prior standard deviation for pinned parameters. This has a default of
#' 0.01, but could be set lower if the data is really large.
#' @param gp_sd_par The upper limit on allowed residual variation of the Gaussian process
#' prior. Increasing the limit will permit the GP to more closely follow the time points,
#' resulting in much sharper bends in the function and potentially oscillation.
#' @param gp_num_diff The number of time points to use to calculate the length-scale prior
#' that determines the level of smoothness of the GP time process. Increasing this value
#' will result in greater smoothness/autocorrelation over time by selecting a greater number
#' of time points over which to calculate the length-scale prior.
#' @param gp_m_sd_par The upper limit of the marginal standard deviation of the GP time
#' process. Decreasing this value will result in smoother fits.
#' @param gp_min_length The minimum value of the GP length-scale parameter. This is a hard
#' lower limit. Increasing this value will force a smoother GP fit. It should always be less than
#' \code{gp_num_diff}.
#' @param cmdstan_path_user Default is NULL, and so will default to whatever is set in
#' \code{cmdstanr} package. Specify a file path here to use a different \code{cmdtstan}
#' installation.
#' @param gpu Whether a GPU is available to speed computation (primarily for GP
#' time-varying models).
#' @param save_files The location to save CSV files with MCMC draws from \code{cmdstanr}.
#' The default is \code{NULL}, which will use a folder in the package directory.
#' @param pos_discrim Whether all discrimination parameters should be constrained to be
#' positive. If so, the model reduces to a traditional IRT model in which all items
#' positively predict ability. Setting this to \code{TRUE} will also eliminate the need
#' to use other parameters for identification, though the options should still be
#' specified to prevent errors.
#' @param het_var Whether to use a separate variance parameter for each item if using
#' Normal or Log-Normal distributions that have variance parameters. Defaults to TRUE and
#' should be set to FALSE only if all items have a similar variance.
#' @param ... Additional parameters passed on to Stan's sampling engine. See \code{\link[rstan]{stan}} for more information.
#' @return A fitted \code{\link{idealstan}} object that contains posterior samples of all parameters either via full Bayesian inference
#' or a variational approximation if \code{use_vb} is set to \code{TRUE}. This object can then be passed to the plotting functions for further analysis.
#' @seealso \code{\link{id_make}} for pre-processing data,
#' \code{\link{id_plot_legis}} for plotting results,
#' \code{\link{summary}} for obtaining posterior quantiles,
#' \code{\link{posterior_predict}} for producing predictive replications.
#' @examples
#' # First we can simulate data for an IRT 2-PL model that is inflated for missing data
#' library(ggplot2)
#' library(dplyr)
#'
#' # This code will take at least a few minutes to run
#' \dontrun{
#' bin_irt_2pl_abs_sim <- id_sim_gen(model_type='binary',inflate=T)
#'
#' # Now we can put that directly into the id_estimate function
#' # to get full Bayesian posterior estimates
#' # We will constrain discrimination parameters
#' # for identification purposes based on the true simulated values
#'
#' bin_irt_2pl_abs_est <- id_estimate(bin_irt_2pl_abs_sim,
#' model_type=2,
#' restrict_ind_high =
#' sort(bin_irt_2pl_abs_sim@simul_data$true_person,
#' decreasing=TRUE,
#' index=TRUE)$ix[1],
#' restrict_ind_low =
#' sort(bin_irt_2pl_abs_sim@simul_data$true_person,
#' decreasing=FALSE,
#' index=TRUE)$ix[1],
#' fixtype='prefix',
#' ncores=2,
#' nchains=2)
#'
#' # We can now see how well the model recovered the true parameters
#'
#' id_sim_coverage(bin_irt_2pl_abs_est) %>%
#' bind_rows(.id='Parameter') %>%
#' ggplot(aes(y=avg,x=Parameter)) +
#' stat_summary(fun.args=list(mult=1.96)) +
#' theme_minimal()
#' }
#'
#' # In most cases, we will use pre-existing data
#' # and we will need to use the id_make function first
#' # We will use the full rollcall voting data
#' # from the 114th Senate as a rollcall object
#'
#' data('senate114')
#'
#' # Running this model will take at least a few minutes, even with
#' # variational inference (use_vb=T) turned on
#' \dontrun{
#'
#' to_idealstan <- id_make(score_data = senate114,
#' outcome = 'cast_code',
#' person_id = 'bioname',
#' item_id = 'rollnumber',
#' group_id= 'party_code',
#' time_id='date',
#' high_val='Yes',
#' low_val='No',
#' miss_val='Absent')
#'
#' sen_est <- id_estimate(to_idealstan,
#' model_type = 2,
#' use_vb = TRUE,
#' fixtype='prefix',
#' restrict_ind_high = "BARRASSO, John A.",
#' restrict_ind_low = "WARREN, Elizabeth")
#'
#' # After running the model, we can plot
#' # the results of the person/legislator ideal points
#'
#' id_plot_legis(sen_est)
#' }
#'
#' @references \enumerate{
#' \item Clinton, J., Jackman, S., & Rivers, D. (2004). The Statistical Analysis of Roll Call Data. \emph{The American Political Science Review}, 98(2), 355-370. doi:10.1017/S0003055404001194
#' \item Bafumi, J., Gelman, A., Park, D., & Kaplan, N. (2005). Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation. \emph{Political Analysis}, 13(2), 171-187. doi:10.1093/pan/mpi010
#' \item Kubinec, R. "Generalized Ideal Point Models for Time-Varying and Missing-Data Inference". Working Paper.
#' \item Betancourt, Michael. "Robust Gaussian Processes in Stan". (October 2017). Case Study.
#' }
#' @importFrom stats dnorm dpois model.matrix qlogis relevel rpois update
#' @importFrom utils person
#' @import cmdstanr
#' @export
id_estimate <- function(idealdata=NULL,model_type=2,
inflate_zero=FALSE,
vary_ideal_pts='none',
keep_param=NULL,
grainsize=1,
mpi_export=NULL,
use_subset=FALSE,sample_it=FALSE,
subset_group=NULL,subset_person=NULL,sample_size=20,
nchains=4,niters=1000,use_vb=FALSE,
ignore_db=NULL,
restrict_ind_high=NULL,
fix_high=1,
fix_low=(-1),
restrict_ind_low=NULL,
fixtype='prefix',
const_type="persons",
id_refresh=0,
prior_fit=NULL,
warmup=1000,ncores=4,
use_groups=FALSE,
discrim_reg_sd=2,
discrim_miss_sd=2,
person_sd=3,
time_fix_sd=.1,
time_var=10,
ar1_up=1,
ar1_down=0,
boundary_prior=NULL,
time_center_cutoff=50,
restrict_var=FALSE,
sample_stationary=FALSE,
ar_sd=2,
diff_reg_sd=1,
diff_miss_sd=1,
restrict_sd_high=0.01,
restrict_sd_low=0.01,
tol_rel_obj=.001,
gp_sd_par=.025,
gp_num_diff=3,
gp_m_sd_par=0.3,
gp_min_length=0,
cmdstan_path_user=NULL,
gpu=FALSE,
map_over_id="persons",
save_files=NULL,
pos_discrim=FALSE,
het_var=TRUE,
...) {
# check to make sure cmdstanr is working
if(is.null(cmdstanr::cmdstan_version())) {
print("You need to install cmdstan with cmdstanr to compile models. Use the function install_cmdstan() in the cmdstanr package.")
}
if(use_subset==TRUE || sample_it==TRUE) {
idealdata <- subset_ideal(idealdata,use_subset=use_subset,sample_it=sample_it,subset_group=subset_group,
subset_person=subset_person,sample_size=sample_size)
}
# set path if user specifies
if(!is.null(cmdstan_path_user))
set_cmdstan_path(cmdstan_path_user)
if(!file.exists(system.file("stan_files","irt_standard",
package="idealstan"))) {
print("Compiling model. Will take some time as this is the first time package is used.")
print("Have you thought about donating to relief for victims for Yemen's famine?")
print("Check out https://www.unicef.org/emergencies/yemen-crisis for more info.")
}
# stan_code <- system.file("stan_files","irt_standard.stan",
# package="idealstan")
stan_code_map <- system.file("stan_files","irt_standard_map.stan",
package="idealstan")
# stan_code_gpu <- system.file("stan_files","irt_standard_gpu.stan",
# package="idealstan")
idealdata@stanmodel_map <- stan_code_map %>%
cmdstan_model(include_paths=dirname(stan_code_map),
cpp_options = list(stan_threads = TRUE,
STAN_CPP_OPTIMS=TRUE))
# idealdata@stanmodel_gpu <- stan_code_gpu %>%
# cmdstan_model(include_paths=dirname(stan_code_map),
# cpp_options = list(stan_threads = TRUE,
# STAN_CPP_OPTIMS=TRUE,
# STAN_OPENCL=TRUE,
# opencl_platform_id = 0,
# opencl_device_id = 0))
#Using an un-identified model with variational inference, find those parameters that would be most useful for
#constraining/pinning to have an identified model for full Bayesian inference
# change time IDs if non time-varying model is being fit
if(vary_ideal_pts=='none') {
idealdata@score_matrix$time_id <- 1
# make sure that the covariate arrays are only one time point
#idealdata@person_cov <- idealdata@person_cov[1,,,drop=F]
#idealdata@group_cov <- idealdata@group_cov[1,,,drop=F]
}
vary_ideal_pts <- switch(vary_ideal_pts,
none=1,
random_walk=2,
AR1=3,
GP=4)
# number of combined posterior models
mod_count <- length(unique(idealdata@score_matrix$model_id))
if(length(unique(idealdata@score_matrix$ordered_id))>1) {
mod_count <- mod_count + length(unique(idealdata@score_matrix$ordered_id)) - 1
}
# set GP parameters
# check if varying model IDs exist and replace with model type
# if not
if(idealdata@score_matrix$model_id[1]=="missing") {
idealdata@score_matrix$model_id <- model_type
idealdata@score_matrix$discrete <- as.numeric(model_type %in% c(1,2,3,4,5,6,7,8,13,14))
} else {
if(!is.numeric(idealdata@score_matrix$model_id)) {
stop("Column model_id in the data is not a numeric series of
integers. Please pass a numeric value in the id_make function
for model_id based on the available model types.")
}
}
if((!(all(c(restrict_ind_high,restrict_ind_low) %in% unique(idealdata@score_matrix$person_id))) && !use_groups) && const_type=="persons") {
stop("Your restricted persons/items are not in the data.")
} else if(!(all(c(restrict_ind_high,restrict_ind_low) %in% unique(idealdata@score_matrix$item_id))) && const_type=="items") {
stop("Your restricted persons/items are not in the data.")
} else if((!(all(c(restrict_ind_high,restrict_ind_low) %in% unique(idealdata@score_matrix$group_id))) && use_groups) && const_type=="persons") {
stop("Your restricted persons/items are not in the data.")
}
# check if ordinal categories exist in the data if model_id>1
if(any(c(3,4,5,6) %in% idealdata@score_matrix$model_id)) {
if(is.null(idealdata@score_matrix$ordered_id) || !is.numeric(idealdata@score_matrix$ordered_id)) {
stop("If you have an ordered categorical variable in a multi-distribution model, you must include a column in the data with the number of ordered categories for each row in the data.\n
See id_make documentation for more info.")
}
} else {
idealdata@score_matrix$ordered_id <- 0
}
# use either row numbers for person/legislator IDs or use group IDs (static or time-varying)
if(use_groups==T) {
legispoints <- as.numeric(idealdata@score_matrix$group_id)
} else {
legispoints <- as.numeric(idealdata@score_matrix$person_id)
}
billpoints <- as.numeric(idealdata@score_matrix$item_id)
timepoints <- as.numeric(factor(as.numeric(idealdata@score_matrix$time_id)))
modelpoints <- as.integer(idealdata@score_matrix$model_id)
ordered_id <- as.integer(idealdata@score_matrix$ordered_id)
# for gaussian processes, need actual time values
time_ind <- switch(class(idealdata@score_matrix$time_id)[1],
factor=unique(as.numeric(idealdata@score_matrix$time_id)),
Date=unique(as.numeric(idealdata@score_matrix$time_id)),
POSIXct=unique(as.numeric(idealdata@score_matrix$time_id)),
POSIXlt=unique(as.numeric(idealdata@score_matrix$time_id)),
numeric=unique(idealdata@score_matrix$time_id),
integer=unique(idealdata@score_matrix$time_id))
# now need to generate max/min values for empirical length-scale prior in GP
if(gp_min_length>=gp_num_diff[1]) {
stop('The parameter gp_min_length cannot be equal to or greater than gp_num_diff[1].')
}
if(("outcome_cont" %in% names(idealdata@score_matrix)) && ("outcome_disc" %in% names(idealdata@score_matrix))) {
Y_int <- idealdata@score_matrix$outcome_disc
Y_cont <- idealdata@score_matrix$outcome_cont
} else if ("outcome_cont" %in% names(idealdata@score_matrix)) {
Y_int <- array(0)
Y_cont <- idealdata@score_matrix$outcome_cont
} else {
Y_cont <- array(0)
Y_int <- idealdata@score_matrix$outcome_disc
}
#Remove NA values, which should have been coded correctly in the make_idealdata function
# need to have a different way to remove missing values if multiple
# posteriors used
# set values for length of discrete/continuous outcomes
remove_list <- .remove_nas(Y_int,
Y_cont,
discrete=idealdata@score_matrix$discrete,
legispoints,
billpoints,
timepoints,
modelpoints,
ordered_id,
idealdata,
time_ind=as.array(time_ind),
time_proc=vary_ideal_pts,
ar_sd=ar_sd,
gp_sd_par=gp_sd_par,
num_diff=gp_num_diff,
m_sd_par=gp_m_sd_par,
min_length=gp_min_length,
const_type=switch(const_type,
persons=1L,
items=2L),
discrim_reg_sd=discrim_reg_sd,
discrim_miss_sd=discrim_miss_sd,
diff_reg_sd=diff_reg_sd,
diff_miss_sd=diff_miss_sd,
legis_sd=person_sd,
restrict_sd_high=restrict_sd_high,
restrict_sd_low=restrict_sd_low,
restrict_high=idealdata@restrict_ind_high,
restrict_low=idealdata@restrict_ind_low,
fix_high=idealdata@restrict_num_high,
fix_low=idealdata@restrict_num_low)
# need to create new data if map_rect is in operation
# and we have missing values / ragged arrays
out_list <- .make_sum_vals(idealdata@score_matrix,map_over_id,use_groups=use_groups,
remove_nas=remove_list$remove_nas)
sum_vals <- out_list$sum_vals
# need number of shards
S <- nrow(sum_vals)
# check for heterogenous variances
if(het_var) {
num_var <- length(unique(remove_list$billpoints[remove_list$modelpoints %in% c(9,10,11,12)]))
mod_items <- tibble(model_id=remove_list$modelpoints,
item_id=remove_list$billpoints) %>%
distinct
mod_items <- mutate(mod_items,cont=model_id %in% c(9,10,11,12)) %>%
group_by(cont) %>%
mutate(num_var=1:n())
type_het_var <- arrange(mod_items, item_id) %>% pull(num_var)
} else {
num_var <- 1
type_het_var <- rep(num_var, length(unique(billpoints)))
}
# check for boundary priors
if(!is.null(boundary_prior)) {
if(is.null(boundary_prior$beta)) {
stop("If you want to use a boundary-avoiding prior for time-series variance, please pass a list with an element named beta, i.e. list(beta=1).")
}
if(boundary_prior$beta <= 0) {
stop("Boundary prior beta value must be strictly positive (i.e. greater than 0).")
}
inv_gamma_beta <- boundary_prior$beta
} else {
inv_gamma_beta <- 0
}
if(remove_list$N_cont>0) {
Y_cont <- remove_list$Y_cont[out_list$this_data$orig_order]
} else {
Y_cont <- remove_list$Y_cont
}
if(remove_list$N_int>0) {
Y_int <- remove_list$Y_int[out_list$this_data$orig_order]
} else {
Y_int <- remove_list$Y_int
}
if(!is.null(ignore_db)) {
# check for correct columns
if(!all(c("person_id",
"time_id",
"ignore") %in% names(ignore_db))) {
stop("Ignore DB does not have the three necessary columns: time_id, person_id, and ignore (0 or 1).")
}
ignore_db <- mutate(ungroup(ignore_db),
person_id=as.numeric(person_id),
time_id=as.numeric(factor(as.numeric(ignore_db$time_id))))
# filter out time_ids not in main data
ignore_db <- filter(ignore_db, time_id %in% remove_list$timepoints,
person_id %in% remove_list$legispoints) %>%
select(person_id,time_id,ignore) %>%
arrange(person_id,time_id) %>%
as.matrix
} else {
ignore_db <- matrix(nrow=0,ncol=3)
}
this_data <- list(N=remove_list$N,
N_cont=remove_list$N_cont,
N_int=remove_list$N_int,
Y_int=Y_int,
Y_cont=Y_cont,
y_int_miss=remove_list$y_int_miss,
y_cont_miss=remove_list$y_cont_miss,
num_var=num_var,
type_het_var=type_het_var,
S=nrow(sum_vals),
S_type=as.numeric(map_over_id=="persons"),
T=remove_list$max_t,
num_legis=remove_list$num_legis,
num_bills=remove_list$num_bills,
num_ls=remove_list$num_ls,
num_bills_grm=remove_list$num_bills_grm,
ll=remove_list$legispoints[out_list$this_data$orig_order],
bb=remove_list$billpoints[out_list$this_data$orig_order],
mm=remove_list$modelpoints[out_list$this_data$orig_order],
ignore=as.numeric(nrow(ignore_db)>0),
ignore_db=ignore_db,
mod_count=length(unique(remove_list$modelpoints)),
num_fix_high=as.integer(1),
num_fix_low=as.integer(1),
tot_cats=length(remove_list$n_cats_rat),
n_cats_rat=remove_list$n_cats_rat,
n_cats_grm=remove_list$n_cats_grm,
order_cats_rat=remove_list$order_cats_rat[out_list$this_data$orig_order],
order_cats_grm=remove_list$order_cats_grm[out_list$this_data$orig_order],
num_bills_grm=ifelse(any(remove_list$modelpoints %in% c(5,6)),
remove_list$num_bills,0L),
LX=remove_list$LX,
SRX=remove_list$SRX,
SAX=remove_list$SAX,
legis_pred=remove_list$legis_pred[out_list$this_data$orig_order,,drop=FALSE],
srx_pred=remove_list$srx_pred[out_list$this_data$orig_order,,drop=FALSE],
sax_pred=remove_list$sax_pred[out_list$this_data$orig_order,,drop=FALSE],
time=remove_list$timepoints[out_list$this_data$orig_order],
time_proc=vary_ideal_pts,
discrim_reg_sd=discrim_reg_sd,
discrim_abs_sd=discrim_miss_sd,
diff_reg_sd=diff_reg_sd,
diff_abs_sd=diff_miss_sd,
legis_sd=person_sd,
restrict_sd_high=5,
restrict_sd_low=5,
time_sd=time_fix_sd,
time_var_sd=time_var,
ar1_up=ar1_up,
ar1_down=ar1_down,
inv_gamma_beta=inv_gamma_beta,
center_cutoff=as.integer(time_center_cutoff),
restrict_var=restrict_var,
ar_sd=ar_sd,
zeroes=as.numeric(inflate_zero),
time_ind=as.array(time_ind),
time_proc=vary_ideal_pts,
gp_sd_par=gp_sd_par,
m_sd_par=gp_m_sd_par,
min_length=gp_min_length,
id_refresh=id_refresh,
sum_vals=as.matrix(sum_vals),
const_type=switch(const_type,
persons=1L,
items=2L),
restrict_high=1,
restrict_low=2,
fix_high=0,
fix_low=0,
num_diff=gp_num_diff,
pos_discrim=as.integer(pos_discrim),
grainsize=grainsize)
idealdata <- id_model(object=idealdata,fixtype=fixtype,this_data=this_data,
nfix=nfix,restrict_ind_high=restrict_ind_high,
restrict_ind_low=restrict_ind_low,
ncores=ncores,
tol_rel_obj=tol_rel_obj,
use_groups=use_groups,
prior_fit=prior_fit,
fix_high=fix_high,
fix_low=fix_low,
const_type=const_type)
if(("outcome_cont" %in% names(idealdata@score_matrix)) && ("outcome_disc" %in% names(idealdata@score_matrix))) {
Y_int <- idealdata@score_matrix$outcome_disc
Y_cont <- idealdata@score_matrix$outcome_cont
} else if ("outcome_cont" %in% names(idealdata@score_matrix)) {
Y_int <- array(0)
Y_cont <- idealdata@score_matrix$outcome_cont
} else {
Y_cont <- array(0)
Y_int <- idealdata@score_matrix$outcome_disc
}
# need to redo everything post-identification
remove_list <- .remove_nas(Y_int,
Y_cont,
discrete=idealdata@score_matrix$discrete,
legispoints,
billpoints,
timepoints,
modelpoints,
ordered_id,
idealdata,
time_ind=as.array(time_ind),
time_proc=vary_ideal_pts,
ar_sd=ar_sd,
gp_sd_par=gp_sd_par,
m_sd_par=gp_m_sd_par,
min_length=gp_min_length,
const_type=switch(const_type,
persons=1L,
items=2L),
discrim_reg_sd=discrim_reg_sd,
discrim_miss_sd=discrim_miss_sd,
diff_reg_sd=diff_reg_sd,
diff_miss_sd=diff_miss_sd,
legis_sd=person_sd,
restrict_sd_high=restrict_sd_high,
restrict_sd_low=restrict_sd_low,
restrict_high=idealdata@restrict_ind_high,
restrict_low=idealdata@restrict_ind_low,
fix_high=idealdata@restrict_num_high,
fix_low=idealdata@restrict_num_low)
idealdata <- remove_list$idealdata
# need to create new data if map_rect is in operation
# and we have missing values / ragged arrays
out_list <- .make_sum_vals(idealdata@score_matrix,map_over_id,use_groups=use_groups,
remove_nas=remove_list$remove_nas)
sum_vals <- out_list$sum_vals
# make sure data is re-sorted by ID
idealdata@score_matrix <- out_list$this_data
# need number of shards
S <- nrow(sum_vals)
# check for heterogenous variances
if(het_var) {
num_var <- length(unique(remove_list$billpoints[remove_list$modelpoints %in% c(9,10,11,12)]))
mod_items <- tibble(model_id=this_data$mm,
item_id=this_data$bb) %>%
distinct
mod_items <- mutate(mod_items,cont=model_id %in% c(9,10,11,12)) %>%
group_by(cont) %>%
mutate(num_var=1:n())
type_het_var <- arrange(mod_items, item_id) %>% pull(num_var)
} else {
num_var <- 1
type_het_var <- rep(num_var, length(unique(billpoints)))
}
if(remove_list$N_cont>0) {
Y_cont <- remove_list$Y_cont[out_list$this_data$orig_order]
} else {
Y_cont <- remove_list$Y_cont
}
if(remove_list$N_int>0) {
Y_int <- remove_list$Y_int[out_list$this_data$orig_order]
} else {
Y_int <- remove_list$Y_int
}
this_data <- list(N=remove_list$N,
N_cont=remove_list$N_cont,
N_int=remove_list$N_int,
Y_int=Y_int,
Y_cont=Y_cont,
y_int_miss=remove_list$y_int_miss,
y_cont_miss=remove_list$y_cont_miss,
num_var=num_var,
type_het_var=type_het_var,
S=nrow(sum_vals),
S_type=as.numeric(map_over_id=="persons"),
T=remove_list$max_t,
num_legis=remove_list$num_legis,
num_bills=remove_list$num_bills,
num_ls=remove_list$num_ls,
num_bills_grm=remove_list$num_bills_grm,
ll=remove_list$legispoints[out_list$this_data$orig_order],
bb=remove_list$billpoints[out_list$this_data$orig_order],
mm=remove_list$modelpoints[out_list$this_data$orig_order],
ignore=as.numeric(nrow(ignore_db)>0),
ignore_db=ignore_db,
mod_count=length(unique(remove_list$modelpoints)),
num_fix_high=as.integer(1),
num_fix_low=as.integer(1),
tot_cats=length(remove_list$n_cats_rat),
n_cats_rat=remove_list$n_cats_rat,
n_cats_grm=remove_list$n_cats_grm,
order_cats_rat=remove_list$order_cats_rat[out_list$this_data$orig_order],
order_cats_grm=remove_list$order_cats_grm[out_list$this_data$orig_order],
num_bills_grm=ifelse(any(remove_list$modelpoints %in% c(5,6)),
remove_list$num_bills,0L),
LX=remove_list$LX,
SRX=remove_list$SRX,
SAX=remove_list$SAX,
legis_pred=remove_list$legis_pred[out_list$this_data$orig_order,,drop=FALSE],
srx_pred=remove_list$srx_pred[out_list$this_data$orig_order,,drop=FALSE],
sax_pred=remove_list$sax_pred[out_list$this_data$orig_order,,drop=FALSE],
time=remove_list$timepoints[out_list$this_data$orig_order],
time_proc=vary_ideal_pts,
discrim_reg_sd=discrim_reg_sd,
discrim_abs_sd=discrim_miss_sd,
diff_reg_sd=diff_reg_sd,
diff_abs_sd=diff_miss_sd,
legis_sd=person_sd,
restrict_sd_high=restrict_sd_high,
restrict_sd_low=restrict_sd_low,
time_sd=time_fix_sd,
time_var_sd=time_var,
ar1_up=ar1_up,
ar1_down=ar1_down,
inv_gamma_beta=inv_gamma_beta,
center_cutoff=as.integer(time_center_cutoff),
restrict_var=restrict_var,
ar_sd=ar_sd,
zeroes=as.numeric(inflate_zero),
time_ind=as.array(time_ind),
time_proc=vary_ideal_pts,
gp_sd_par=gp_sd_par,
m_sd_par=gp_m_sd_par,
min_length=gp_min_length,
id_refresh=id_refresh,
sum_vals=as.matrix(sum_vals),
const_type=switch(const_type,
persons=1L,
items=2L),
restrict_high=idealdata@restrict_ind_high,
restrict_low=idealdata@restrict_ind_low,
fix_high=idealdata@restrict_num_high,
fix_low=idealdata@restrict_num_low,
num_diff=gp_num_diff,
pos_discrim=as.integer(pos_discrim),
grainsize=grainsize)
# need to save n_cats
idealdata@n_cats_rat <- remove_list$n_cats_rat
idealdata@n_cats_grm <- remove_list$n_cats_grm
idealdata@order_cats_rat <- remove_list$order_cats_rat
idealdata@order_cats_grm <- remove_list$order_cats_grm
outobj <- sample_model(object=idealdata,nchains=nchains,niters=niters,warmup=warmup,ncores=ncores,
this_data=this_data,use_vb=use_vb,
gpu=gpu,
save_files=save_files,
keep_param=keep_param,
tol_rel_obj=tol_rel_obj,within_chain=within_chain,
...)
# deprecate: reduce_sum doesn't work in mpi
# else if(within_chain=="mpi") {
#
# # we can't do mpi natively, so let's export to cmdstan to run on a cluster
#
# if(is.null(mpi_export)) {
# print("Please choose a folder to store the exported data and stan code for MPI.")
# mpi_export <- rstudioapi::selectDirectory(caption="Choose a directory to export code and data for running MPI in cmdstan:")
# }
#
# write_stan_json(this_data,paste0(mpi_export,"/idealstan_mpi_data.json"))
#
# writeLines(idealdata@stanmodel$code(),con=file(paste0(mpi_export,"/idealstan_stan_code.stan")))
#
# saveRDS(idealdata,paste0(mpi_export,"/idealdata_object.rds"))
#
# # save extra necessary parameters
#
# saveRDS(list(vary_ideal_pts=vary_ideal_pts,
# use_groups=use_groups,
# model_type=model_type,
# use_vb=use_vb,
# map_over_id=map_over_id,
# time_fix_sd=time_fix_sd),
# paste0(mpi_export,"/extra_params.rds"))
#
# # need all the chunks
# dir.create(paste0(mpi_export,"/chunks"),showWarnings=FALSE)
# chunks <- system.file("stan_files/chunks",package="idealstan")
# chunks_files <- list.files(chunks,full.names=T)
# file.copy(chunks_files,to=paste0(mpi_export,"/chunks"),overwrite = T)
#
# return("You will need to run the model in cmdstan yourself and load the resulting data back in to R with the id_load_mpi function. See vignette for details.")
#
# }
outobj@model_type <- model_type
outobj@time_proc <- vary_ideal_pts
outobj@use_groups <- use_groups
outobj@map_over_id <- map_over_id
outobj@time_fix_sd <- time_fix_sd
outobj@restrict_var <- restrict_var
# need to recalculate legis points if time series used
if(this_data$T>1 && ((!is.null(keep_param$person_vary) && keep_param$person_vary) || is.null(keep_param))) {
outobj@time_varying <- try(.get_varying(outobj))
}
return(outobj)
}
#' DEPRECATED: Reconstitute an idealstan object after an MPI/cluster run
#'
#' This convenience function takes as input a file location storing the results of a
#' MPI/cluster run.
#'
#' Given the CSV output from cmdstan and the original files exported
#' from the \code{id_estimate} function, this function will create an \code{idealstan}
#' object which can be further analyzed with other \code{idealstan} package helper
#' functions.
#'
#' @param object A fitted \code{idealstan} object (see \code{\link{id_estimate}})
#' @param file_loc A string with the location of the original files exported by
#' \code{id_estimate}
#' @param csv_name A vector of character names of CSV files with posterior estimates from
#' \code{cmdstan}. Should be located in the same place as \code{file_loc}.
#' @importFrom posterior summarize_draws
id_rebuild_mpi <- function(file_loc=NULL,
csv_name=NULL) {
if(is.null(file_loc)) {
file_loc <- rstudioapi::selectDirectory(caption="Choose the directory containing relevant files to rebuild object:")
}
all_csvs <- read_cmdstan_csv(paste0(file_loc,"/",csv_name))
object <- readRDS(paste0(file_loc,"/","idealdata_object.rds"))
extra_params <- readRDS(paste0(file_loc,"/",extra_params.rds))
outobj <- new('idealstan',
score_data=object,
model_code=readLines(paste0(file_loc,"/","idealstan_stan_code.stan")),
stan_samples=all_csvs,
use_vb=extra_params$use_vb)
# add safe summaries
to_sum <- summarize_draws(outobj@stan_samples$post_warmup_draws)
to_sum <- select(to_sum,variable,lower="q95",mean,median,upper="q5",rhat,ess_bulk,ess_tail)
outobj@summary <- to_sum
outobj@mpi <- T
outobj@model_type <- extra_params$model_type
outobj@time_proc <- extra_params$vary_ideal_pts
outobj@use_groups <- extra_params$use_groups
outobj@map_over_id <- extra_params$map_over_id
outobj@time_fix_sd <- extra_params$time_fix_sd
return(outobj)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.