knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE ) options(digits = 3)
The devtools library can be used to install the TSI package directly from Github:
library(devtools) install_github('mmansolf/TSI') library(TSI) library(mice)
First, let's simulate some data to demonstrate true score imputation for classical test theory! Here I'm simulating two samples of true score x, with mean difference equal to d:
####################### #SIMULATION PARAMETERS# set.seed(0) n = 1000 #sample size ratio = 0.6 #ratio of true variance to error variance; squared reliability d = 0.5 #mean difference mean_x = 0 #mean of true score mean_w = 0 #mean of observed score var_x = 1 #variance of true score var_w = 1 #variance of observed score
################# # SIMUALTE DATA # x1 = rnorm(n, mean_x, sqrt(var_x)) e1 = rnorm(n, 0, sqrt(var_x * (1 - ratio) / ratio)) w1 = x1 + e1 x2 = rnorm(n, mean_x + d * sqrt(var_x), sqrt(var_x)) e2 = rnorm(n, 0, sqrt(var_x * (1 - ratio) / ratio)) w2 = x2 + e2
Before imputing, notice that we don't actually have a variable to relate x to include in an imputation model, only two groups. Let's create a dummy-coded variable y, where 0 corresponds to the first group and 1 corresponds to the second group, to create a data set to work with:
w = c(w1, w2) y = c(rep(0, n), rep(1, n)) data_ctt = data.frame(w, y) head(data_ctt)
You can access these data directly as the data_ctt object in the TSI package.
TSI functionI'll present the imputation code, then explain the pieces:
set.seed(0) mice_data = TSI(data_ctt, os_names = "w", score_types = "CTT", reliability = ratio, mean = 0, var_ts = 1, mice_args = list(m = 5, printFlag = F)) mice_data
For classical test theory, the following are required:
os_names: Character vector containing the names of the observed scores. Here, there is just one observed score "w".score_types: Character vector of models used to generate each score. Here, the score was generated under classical test theory "CTT".reliability: Numeric vector of reliability values for each observed score. Here, I copied this from the ratio variable defined above.mean and var_ts: Numeric vectors of means and variances of the underlying true scores. Here, x was simulated with a mean of 0 and variance of 1.mice_args: Additional arguments passed to the mice function. Here, I'm specifying 5 imputations and to not print the trace output during execution.Observe that a new variable, true_w, was generated as part of this call. This variable contains the imputed true score values. We can quickly look at the first imputation using the complete function in the mice package:
head(complete(mice_data))
This example is one of the examples available when running example("TSI")
If true score imputation is required for multiple scores, all such variables must be specified in os_names. If these variables share imputation characteristics (score_types, reliability, mean, and/or var_ts), a single value can be specified for each, and this input is recycled for all observed scores specified in os_names.
Instead of specifying mean and var_ts, another option is to specify the metric of the true score if it is a z score (mean = 0, SD = 1; metric = "z"), T-score (mean = 50, SD = 10; metric = "T"), or standard score (mean = 100, SD = 15, metrics = "standard"). Here's an example, using the z score metric because the underlying x has a mean of 0 and SD of 1:
set.seed(0) mice_data = TSI(data_ctt, os_names = "w", score_types = "CTT", reliability = ratio, metrics = "z", mice_args = list(m = 5, printFlag = F)) mice_data
Once data are generated using TSI, the resulting r class(mice_data) object can be analyzed using multiple imputation techniques. Here, we can use the with functionality of mice to predict x from y:
pool(with(mice_data, lm(true_w~y)))
Comparing this result to that from using the observed score w:
pool(with(mice_data, lm(w~y)))
The results are roughly the same, because under classical test theory, the slope relating the mismeasured x from the measured y is unaffected by the measurement error. One statistic that is affected is the standard deviation of the score. To obtain this, we obtain a list of completed data sets mice_data by using the complete function with "all" as the second input,
mice_data = complete(mice_data, "all")
calculate the standard deviation of each column in each imputed data set, stored as sds,
sds = sapply(mice_data, function(d)apply(d, 2, sd))
and calculate the mean standard deviation across imputations:
apply(sds, 1, mean)
The standard deviation of w is inflated compared to that of the underlying true score, while the standard deviation of the imputed true_w is unbiased.
Because true score imputation can be used in the mice package, it can be used in tandem with conventional multiple imputation. To illustrate, I simulated two uncorrelated, normally-distributed variables, x and m, and used them to generate a new variable y according to a simple regression model:
set.seed(1) x = rnorm(n, 0, 1) m = rnorm(n, 0, 1) y = 1 + 0.4 * x + 0.6 * m + rnorm(n, 0, 1)
Thus, y is related to m and w by the equation y = 0.4x + 0.6m + e, where e~N(0, 1), in a prototypical multiple regression configuration.
Next, I used calibrated item parameters from the NIH Toolbox emotion battery norming study (Kupst et al., 2015) to simulate item responses on the 10-item Perceived Stress Scale for each of x and y. I then scored these item responses using the same item parameters, yielding a set of T-scores and standard errors (SE's) for each record. These two quantities are provided routinely by the HealthMeasures scoring service, and while the standard errors are more often used in assessing individual change, true score imputation allows them to be used in statistical analysis of data sets with multiple observarions.
Lastly, I used a missing completely at random (MCAR) missing data model to set approximately 10% of the values of m and the T-score/SE pairs to missing. This "amputation" was conducted separately for m, for the T-score/SE pairs for x, and for the T-score/SE pairs for y.
We can access the simulated data using the data_eap data set in the TSI package.
head(data_eap)
TSI functionThe code for imputation with two EAP-scored variables is nearly identical to that above for CTT:
set.seed(0) mice_data = TSI(data_eap, os_names = c("Fx", "Fy"), se_names = c("SE.Fx", "SE.Fy"), metrics = "T", score_types = "EAP", separated = T, ts_names = c("Tx", "Ty"), mice_args = c(m = 5, maxit = 5, printFlag = F)) mice_data head(complete(mice_data))
A few differences:
Tx and Ty. These names were specified using the optional ts_names input to TSI, which works as follows:ts_names is not specified, the prefix "TRUE_" is prepended to each of os_names to produce the names of the resulting imputed true score variables.ts_names is specified, these names are used instead, with each element of ts_names denoting the name of the imputed true score variable to be generated from the respective variable in os_names.os_names because both Fx and Fy are T scores.se_names, which contains the names of the standard error variables corresponding to each observed score in os_names, must be specified for EAP scores and have the same length as os_names.T-score metric was used, consistent with how the Perceived Stress Scale is scored in the NIH Toolbox.score_types contains "EAP" because NIH Toolbox scores the Percieved Stress Scale using EAP scoringseparated specifies whether true score imputation uses an average standard error (separated = F), which runs faster but doesn't account for differential measurement error of the observed scores for each respondent, or whether separate standard errors are used for each value of each observed score (separated = T), which runs slower but accounts for differential measurement error.Note that although there are two variables measured with error (Fx and Fy), only one value is provided for metrics, score_types, and separated, resulting in these settings being applied to all variables measured with error.
Once data are generated using TSI, the resulting r class(mice_data) object can once again be analyzed using multiple imputation techniques. For reference, estimating this model using the true scores, with no missing data, yields:
lm(y~x + m)
Using the observed T scores yields:
pool(with(mice_data, lm(Fy~Fx + m)))
Observe that the regression coefficient associated with Fx, the T-score for x, is between 0 and 1, because both T scores are on the same metric; therefore, this estimate can be directly compared to that from the true, complete data above. In contrast, the m variable was not linearly transformed and therefore this estimated coefficient should be divided by 10 because the SD of T scores is 10. Placing this estimate back on the original metric by dividing by 10 yields r round(pool(with(mice_data, lm(Fy~Fx + m)))$pooled[3, "estimate"]/10, 3). Both regression coefficients are biased.
Finally, using true score imputation yields:
pool(with(mice_data, lm(Ty~Tx + m)))
Bias has been substantially reduced, and the fractions of missing information fmi now represent both information lost to missingness and information lost to measurement error.
This example is one of the examples available when running example("TSI")
mice Function Directly with method = "truescore"The TSI function generates a call to the mice function, but the mice function can also be called directly to perform true score imputation. The core of the TSI package is a function named mice.impute.truescore which can be called from the mice package by setting method = "truescore" for each variable to be imputed using this method. These calls can get complicated, so for most purposes we hope that the TSI function can provide a convenient means to conduct true score imputation, but when an analyst needs to modify this call, they can do so by working with mice directly.
One convenience provided by the TSI function is that it creates the true score variables automatically. When using mice directly, one must first add empty (all NA values) variables to the data set corresponding to the true scores:
data_ctt_2 = data_ctt data_ctt_2$true_w = NA
Then, the mice function is called with method = "truescore":
set.seed(0) mice_data = mice(data_ctt_2, m = 5, blocks = list("true_w"), method = "truescore", calibration = list(os_name = "w", score_type = "CTT", reliability = ratio, mean_ts = mean_x, var_ts = var_x), predictorMatrix = matrix(c(1, 1, 0), 1, 3, byrow = T), printFlag = F, remove.constant = F) mice_data
A few notes:
blocks argument, which contains a list of vectors with variable names per block.method listed under the method argument; here, we're only using true score imputation, so this is simply "truescore".remove.constant = F is needed to prevent mice from automatically discarding the true score variable true_w for containing only NA values.predictorMatrix, where each row corresponds to a block and each column indicates whether the corresponding variable in the data set is included (1) or not (0) in the imputation model.
For more information on these inputs, see the documentation for the mice function at ?mice. Lastly:calibration. The elements of the list for classical test theory must be the following, in any order, and all must be named as follows:os_name (character vector of length one): Name of the observed score variable in the data set.score_type (character vector of length one): "CTT" for classical test theory.reliability (numeric scalar or vector with length equal to nrow(data): An estimate of the reliability of the observed scores as measures of the true score. If a scalar is provided, the same reliability is assumed for all values of the observed score.mean_ts and var_ts (numeric scalars): Mean and variance of true scores, respectively, specified as in the TSI function. Note that specifying metric as a shortcut is not available when using mice directly.The resulting mice_data object is the same, imputation error notwithstanding, from that created in the classical test theory example above:
pool(with(mice_data, lm(w~y))) pool(with(mice_data, lm(true_w~y))) mice_data = complete(mice_data, "all") sds = sapply(mice_data, function(d)apply(d, 2, sd)) apply(sds, 1, mean)
For EAP scores, we once again need to add empty true score variables to the data set manually:
data_eap_2 = data_eap data_eap_2$Tx = NA #nolint data_eap_2$Ty = NA #nolint
Then, we need a more complicated call to mice:
set.seed(0) mice_data = mice(data_eap_2, m = 5, maxit = 5, method = c("pmm", "pmm", "pmm", "pmm", "pmm", "truescore", "truescore"), blocks = list(Fx = "Fx", Fy = "Fy", SE.Fx = "SE.Fx", SE.Fy = "SE.Fy", m = "m", Tx = "Tx", Ty = "Ty"), blots = list(Tx = list(calibration = list(os_name = "Fx", se_name = "SE.Fx", score_type = "EAP", mean = 50, var_ts = 100, separated = T)), Ty = list(calibration = list(os_name = "Fy", se_name = "SE.Fy", score_type = "EAP", mean = 50, var_ts = 100, separated = T))), predictorMatrix = matrix(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0), 7, 7, byrow = T), printFlag = F, remove.constant = F) mice_data
A few extra things to note:
block, and corresponding method and row of predictorMatrix, must be specified for each variable because all variables have missing datablots are needed for the two true scores, with different inputs for each. A few things to note here:predictorMatrix argument must be specified in order to avoid using observed scores for x and y to predict true scores on y and x, respectively. Based on (unpublished) simulation results, it seems the best way to specify the predictor matrix for use in mice is for true scores to be predicted from all observed variables but not to predict other missing data from the imputed true scores. This is the default behavior when the TSI function is used, and we recommend, unless further research identifies otherwise, that the same be done when using this function to interact with mice directly.These examples are both available when running example("mice.impute.truescore")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.