library(knitr) set.seed(1) is_CRAN <- !identical(Sys.getenv("NOT_CRAN"), "true") if (!is_CRAN) { options(mc.cores = parallel::detectCores()) } else { knitr::opts_chunk$set(eval = FALSE) knitr::knit_hooks$set(evaluate.inline = function(x, envir) x) } knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache.lazy = FALSE # https://github.com/yihui/knitr/issues/572 ) options(digits=4) options(scipen=2)

The **pcFactorStan** package for **R** provides convenience functions and pre-programmed **Stan** models related to analysis of paired comparison data. Its purpose is to make fitting models using Stan easy and easy to understand. **pcFactorStan** relies on the **rstan** package, which should be installed first.
See here for instructions on installing **rstan**.

One situation where a factor model might be useful is when there are people that play in tournaments of more than one game. For example, the computer player AlphaZero (Silver et al. 2018) has trained to play chess, shogi, and Go. We can take the tournament match outcome data for each of these games and find rankings among the players. We may also suspect that there is a latent board game skill that accounts for some proportion of the variance in the per-board game rankings. This proportion can be recovered by the factor model.

Our goal may be to fit a factor model, but it is necessary to build up the model step-by-step. There are essentially three models: 'unidim', 'correlation', and 'factor'. 'unidim' analyzes a single item. 'correlation' is suitable for two or more items. Once you have vetted your items with the 'unidim' and 'correlation' models, then you can try the 'factor' model. There is also a special model 'unidim_adapt'. Except for this model, the other models require scaling constants. To find appropriate scaling constants, we will fit 'unidim_adapt' to each item separately.

The R code below first loads **rstan** and **pcFactorStan**.
We load **loo** for extra diagnostics, and
**qgraph** and **ggplot2** for visualization.

library(rstan) library(pcFactorStan) library(loo) library(qgraph) library(ggplot2) library(Matrix)

Next we take a peek at the data.

```
head(phyActFlowPropensity)
```

kable(head(phyActFlowPropensity))

These data consist of paired comparisons of 87 physical activities on 16 flow-related facets. Participants submitted two activities using free-form input. These activities were substituted into item templates. For example, Item *predict* consisted of the prompt, "How predictable is the action?" with response options:

`A1`

is much more predictable than`A2`

.`A1`

is somewhat more predictable than`A2`

.- Both offer roughly equal predictability.
`A2`

is somewhat more predictable than`A1`

.`A2`

is much more predictable than`A1`

.

If the participant selected 'golf' and 'running' for activities then 'golf' was substituted into `A1`

and 'running' into `A2`

. Duly prepared, the item was presented and the participant asked to select the most plausible statement.

A *somewhat more* response is scored 1 or -1
and *much more* scored 2 or -2.
A tie (i.e. *roughly equal*) is scored as zero.
A negative value indicates > (greater than) and positive
value indicates > (less than).
For example, if `A1`

is *golf*, `A2`

is *running*, and
the observed response is 2 then the endorsement
is "*golf* is much less predictable than *running*."

We will need to analyze each item separately before
we analyze them together. Therefore, we will start
with Item *skill*.
Data must be fed into **Stan** in a partially digested form. The next block of code demonstrates how a suitable data list may be constructed using the `prepData()`

function. This function automatically determines the
number of threshold parameters based on the
range observed in your data.
One thing it does not do is pick a `varCorrection`

factor. The `varCorrection`

determines the degree of adaption in the model. Usually some choice between 2.0 to 4.0 will obtain optimal results.

dl <- prepData(phyActFlowPropensity[,c(paste0('pa',1:2), 'skill')]) dl$varCorrection <- 5.0

Next we fit the model using the `pcStan()`

function, which is a wrapper for `stan()`

from **rstan**. We also choose the number of chains.
As is customary **Stan** procedure, the first half of each chain is used to estimate the
sampler's weight matrix (i.e. warm up) and excluded from inference.

fit1 <- pcStan("unidim_adapt", data=dl)

A variety of diagnostics are available to check whether the sampler ran into trouble.

```
check_hmc_diagnostics(fit1)
```

Everything looks good, but there are a few more things to check. We want $\widehat R$ < 1.015 and effective sample size greater than 100 times the number of chains (Vehtari et al., 2019).

allPars <- summary(fit1, probs=c())$summary print(min(allPars[,'n_eff'])) print(max(allPars[,'Rhat']))

Again, everything looks good. If the target values were not reached then we would sample the model again with more iterations. Time for a plot,

theta <- summary(fit1, pars=c("theta"), probs=c())$summary[,'mean'] ggplot(data.frame(x=theta, activity=dl$nameInfo$pa, y=0.47)) + geom_point(aes(x=x),y=0) + geom_text(aes(label=activity, x=x, y=y), angle=85, hjust=0, size=2, position = position_jitter(width = 0, height = 0.4)) + ylim(0,1) + theme(legend.position="none", axis.title.x=element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank())

Intuitively, this seems like a fairly reasonable ranking for skill. As pretty as the plot is, the main reason that we fit this model was to find a scaling constant to produce a score variance close to 1.0,

s50 <- summary(fit1, pars=c("scale"), probs=c(.5))$summary[,'50%'] print(s50)

rm(fit1) # free up some memory

We use the median instead of the mean because `scale`

is not likely to have a symmetric marginal posterior distribution.
We obtained `r s50`

, but that value is just for one item.
We have to perform the same procedure for every item.
Wow, that would be really tedious ... if we did not have a function to do it for us!
Fortunately, `calibrateItems`

takes care of it and produces a table
of the pertinent data,

result <- calibrateItems(phyActFlowPropensity, iter=1000L)

```
print(result)
```

```
kable(result)
```

Items *goal1* and *feedback1* are prone to failure.
This happens when there is no clear ranking between objects.
For example, if we observe that `A<B`

, `B<C`

, and `C<A`

then the
only sensible interpretation is that `A=B=C`

which can only have
close to zero variance.
We exclude these two items with the smallest `scale`

.
I requested `iter=1000L`

to demonstrate how `calibrateItems`

will resample the model until `n_eff`

is large enough and `Rhat`

small enough.
As demonstrated in the *iter* column, some items needed more than 1000 samples to converge.

Next we will fit the correlation model. We exclude parameters
that start with the prefix `raw`

. These parameters are needed
by the model, but should not be interpreted.

pafp <- phyActFlowPropensity excl <- match(c('goal1','feedback1'), colnames(pafp)) pafp <- pafp[,-excl] dl <- prepData(pafp) dl$scale <- result[match(dl$nameInfo$item, result$item), 'scale']

fit2 <- pcStan("correlation", data=dl, include=FALSE, pars=c('rawTheta', 'rawThetaCorChol'))

check_hmc_diagnostics(fit2) allPars <- summary(fit2, probs=0.5)$summary print(min(allPars[,'n_eff'])) print(max(allPars[,'Rhat']))

The HMC diagnostics look good, but ... oh dear!
Something is wrong with the `n_eff`

and $\widehat R$.
Let us look more carefully,

head(allPars[order(allPars[,'sd']),])

Ah ha! It looks like all the entries of the correlation matrix are reported, including the entries that are not stochastic but are fixed to constant values. We need to filter those out to get sensible results.

allPars <- allPars[allPars[,'sd'] > 1e-6,] print(min(allPars[,'n_eff'])) print(max(allPars[,'Rhat']))

Ah, much better. Now we can inspect the correlation matrix. There are
many ways to visualize a correlation matrix. One of my favorite
ways is to plot it using the **qgraph** package,

corItemNames <- dl$nameInfo$item tc <- summary(fit2, pars=c("thetaCor"), probs=c(.1,.5,.9))$summary tcor <- matrix(tc, length(corItemNames), length(corItemNames)) tcor[sign(tc[,'10%']) != sign(tc[,'90%'])] <- 0 # delete faint edges dimnames(tcor) <- list(corItemNames, corItemNames) tcor <- nearPD(tcor, corr=TRUE)$mat qgraph(tcor, layout = "spring", graph = "cor", labels=colnames(tcor), legend.cex = 0.3, cut = 0.3, maximum = 1, minimum = 0, esize = 20, vsize = 7, repulsion = 0.8, negDashed=TRUE, theme="colorblind")

Based on this plot and theoretical considerations,
I decided to exclude *spont*, *control*, *evaluated*, and *waiting*
from the factor model.
A detailed rationale for why these items, and not others, are excluded
will be presented in a forthcoming article.
For now, let us focus on the mechanics of data analysis.
Here are item response curves,

df <- responseCurve(dl, fit2, item=setdiff(dl$nameInfo$item, c('spont','control','evaluated','waiting')), responseNames=c("much more","somewhat more", 'equal', "somewhat less", "much less")) ggplot(df) + geom_line(aes(x=worthDiff, y=prob, color=response,linetype=response, group=responseSample), size=.2, alpha=.2) + xlab("difference in latent worths") + ylab("probability") + ylim(0,1) + facet_wrap(~item, scales="free_x") + guides(color=guide_legend(override.aes=list(alpha = 1, size=1)))

We plot response curves from the correlation model and not the factor model
because the factor model is expected to report slightly
inflated discrimination estimates.
These response curves are a function of the `thresholds`

, `scale`

,
and `alpha`

parameters.
The ability of an item to discriminate amongst objects is
partitioned into the `scale`

and `alpha`

parameters.
Most of the information is accounted for by the `scale`

parameter and the `alpha`

parameter should always
be estimated near 1.75.
The distribution of objects is always standardized to a
variance near 1.0; the `scale`

parameter zooms in on the x-axis
to account for the ability to make finer and finer distinctions among objects.
Notice the variation in x-axis among the plots above.
A detailed description of the item
response model can be found in the man page for `responseCurve`

.

I will fit model 'factor_ll' instead of 'factor' so I can use the
**loo** package to look for outliers.
We also need to take care that the data `pafp`

matches, one-to-one,
the data seen by Stan so we can map back from the model
to the data. Hence, we update `pafp`

using the usual the data cleaning sequence
of `filterGraph`

and `normalizeData`

and pass the result to `prepCleanData`

.

Up until version 1.0.2, only a single factor model was available. As of 1.1.0, the factor model supports an arbitrary number of factors and arbitrary factor-to-item structure. In this example, we will stay with the simplest factor model, a single factor that predicts all items.

pafp <- pafp[,c(paste0('pa',1:2), setdiff(corItemNames, c('spont','control','evaluated','waiting')))] pafp <- normalizeData(filterGraph(pafp)) dl <- prepCleanData(pafp) dl <- prepSingleFactorModel(dl) dl$scale <- result[match(dl$nameInfo$item, result$item), 'scale']

rm(fit2) # free up some memory

fit3 <- pcStan("factor1_ll", data=dl, include=FALSE, pars=c('rawUnique', 'rawUniqueTheta', 'rawPerComponentVar', 'rawFactor', 'rawLoadings', 'rawFactorProp', 'rawThreshold', 'rawPathProp', 'rawCumTh'))

To check the fit diagnostics, we have to take care to
examine only the parameters of interest. The factor model outputs
many parameters that should not be interpreted (those that start
with the prefix `raw`

).

check_hmc_diagnostics(fit3) interest <- c("threshold", "alpha", "pathProp", "factor", "residualItemCor", "lp__") allPars <- summary(fit3, pars=interest)$summary allPars <- allPars[allPars[,'sd'] > 1e-6,] print(min(allPars[,'n_eff'])) print(max(allPars[,'Rhat']))

Looks good!

Let us see which data are the most unexpected by the model.
We create a `loo`

object and inspect the summary output.

options(mc.cores=1) # otherwise loo consumes too much RAM kThreshold <- 0.1 l1 <- toLoo(fit3) print(l1)

The estimated Pareto $k$ estimates are particularly noisy
due to the many activities with a small sample size.
Sometimes all $k<0.5$ and sometimes not.
We can look at `p_loo`

,
the effective number of parameters. In well behaving cases,
`p_loo`

is less than the sample size and the number of parameters.
This looks good. There are `r dl$NITEMS * dl$NPA`

parameters
just for the unique scores.

To connect $k$ statistics with observations,
we pass the `loo`

object to `outlierTable`

and
use a threshold of `r kThreshold`

instead of 0.5 to ensure
that we get enough lines.
Activities with small sample sizes are retained by `filterGraph`

if
they connect other activities because they contribute information to
the model. When we look at outliers, we can limit ourselves to
activities with a sample size of at least 11.

pa11 <- levels(filterGraph(pafp, minDifferent=11L)$pa1) ot <- outlierTable(dl, l1, kThreshold) ot <- subset(ot, pa1 %in% pa11 & pa2 %in% pa11)

print(ot[1:6,])

kable(ot[1:6,], row.names=TRUE)

xx <- which(ot[,'pa1'] == 'mountain biking' & ot[,'pa2'] == 'climbing' & ot[,'item'] == 'predict' & ot[,'pick'] == -2)

if (length(xx) == 0) { xx <- 1 warning("Can't find outlier") } kable(ot[xx,,drop=FALSE], row.names=TRUE)

We will take a closer look at row `r rownames(ot)[xx]`

.
What does a `pick`

of `r ot[xx,'pick']`

mean? `Pick`

numbers are converted
to response categories by adding the number of thresholds plus one.
There are two thresholds (*much* and *somewhat*) so
3 + `r ot[xx,'pick']`

= `r 3+ot[xx,'pick']`

.
Looking back at our item response curve plot,
the legend gives the response category order from top (1) to bottom (5).
The first response category is *much more*.
Putting it all together we obtain an endorsement of
* r ot[xx,'pa1'] is much more predictable than r ot[xx,'pa2']*.
Specifically what about that assertion is unexpected?
We can examine how other participants have responded,

pafp[pafp$pa1 == ot[xx,'pa1'] & pafp$pa2 == ot[xx,'pa2'], c('pa1','pa2', as.character(ot[xx,'item']))]

Hm, both participants agreed. Let us look a little deeper to understand why this response was unexpected.

loc <- sapply(ot[xx,c('pa1','pa2','item')], unfactor) exam <- summary(fit3, pars=paste0("theta[",loc[paste0('pa',1:2)], ",", loc['item'],"]"))$summary rownames(exam) <- c(as.character(ot[xx,'pa1']), as.character(ot[xx,'pa2']))

#exam <- data.frame(mean=c(0,0), '2.5%'=c(0,0), '97.5%'=c(0,0)) kable(exam)

Here we find that
`r ot[xx,'pa1']`

was estimated `r exam[1,'mean'] - exam[2,'mean']`

units more
predictable than `r ot[xx,'pa2']`

. I guess this difference was expected
to be larger.
What sample sizes are associated with these activities?

sum(c(pafp$pa1 == ot[xx,'pa1'], pafp$pa2 == ot[xx,'pa1'])) sum(c(pafp$pa1 == ot[xx,'pa2'], pafp$pa2 == ot[xx,'pa2']))

Hm, the predictability 95% uncertainty interval for `r ot[xx,'pa2']`

is from `r exam[2,'2.5%']`

to `r exam[2,'97.5%']`

.
So there is little information.
We could continue our investigation by looking at which
responses justified these *predict* estimates.
However, let us move on and
plot the marginal posterior distributions of the factor proportions.
Typical jargon is *factor loadings*, but *proportion* is
preferable since the scale is arbitrary and standardized.

pi <- parInterval(fit3, 'pathProp', dl$nameInfo$item, label='item') pi <- pi[order(abs(pi$M)),] ggplot(pi) + geom_vline(xintercept=0, color="green") + geom_jitter(data=parDistributionFor(fit3, pi), aes(value, item), height = 0.35, alpha=.05) + geom_segment(aes(y=item, yend=item, x=L, xend=U), color="yellow", alpha=.5) + geom_point(aes(x=M, y=item), color="red", size=1) + theme(axis.title.y=element_blank())

Finally, we can plot the factor scores.

pick <- paste0('factor[',match(pa11, dl$nameInfo$pa),',1]') pi <- parInterval(fit3, pick, pa11, label='activity') pi <- pi[order(pi$M),] ggplot(pi) + geom_vline(xintercept=0, color="green") + geom_jitter(data=parDistributionFor(fit3, pi, samples=200), aes(value, activity), height = 0.35, alpha=.05) + geom_segment(aes(y=activity, yend=activity, x=L, xend=U), color="yellow", alpha=.5) + geom_point(aes(x=M, y=activity), color="red", size=1) + theme(axis.title.y=element_blank())

If this factor model is a good fit to the data then the residual item activity scores should be uncorrelated. Let us examine the residual item correlation matrix.

m <- matrix(apply(expand.grid(r=1:dl$NITEMS, c=1:dl$NITEMS), 1, function(x) paste0("residualItemCor[",x['r'],",",x['c'],"]")), dl$NITEMS, dl$NITEMS) n <- matrix(apply(expand.grid(r=dl$nameInfo$item, c=dl$nameInfo$item), 1, function(x) paste0(x['r'],":",x['c'])), dl$NITEMS, dl$NITEMS) pi <- parInterval(fit3, m[lower.tri(m)], n[lower.tri(n)], label='cor') pi <- pi[abs(pi$M) > .08,] pi <- pi[order(-abs(pi$M)),] ggplot(pi) + geom_vline(xintercept=0, color="green") + geom_jitter(data=parDistributionFor(fit3, pi, samples=800), aes(value, cor), height = 0.35, alpha=.05) + geom_segment(aes(y=cor, yend=cor, x=L, xend=U), color="yellow", alpha=.5) + geom_point(aes(x=M, y=cor), color="red", size=1) + theme(axis.title.y=element_blank())

Many survey measures are going to exhibit
some faint correlations of this nature.
Residual correlations can suggest items
that could benefit from refinement. Item *chatter* is involved in
relatively high residual correlations. Thought
might be give to splitting this item or rewording it.

And there you have it. If you have not done so already, go find a dojo and commence study of martial arts!

If you read through the **Stan** models included with this package, you will find some
variables prefixed with `raw`

. These are special variables
internal to the model. In particular, you should not try
to evaluate the $\widehat R$ or effective sample size
of `raw`

parameters. These parameters are best excluded
from the sampling output.

Latent worths are estimated by `theta`

parameters. `theta`

is always
standard normally distributed.

Thresholds are parameterized as a proportion with distribution
`Beta(1.1, 2.0)`

. The shape of this prior is fairly arbitrary.
`Uniform(0,1)`

also works in many cases. There is usually plenty of
information available to estimate thresholds accurately.
To convert from a proportion to threshold units, the following
formula is employed, `rawThreshold * (max(theta) - min(theta))`

.

This model is fairly robust; priors are unlikely to need tweaking.
The 'unidim_adapt' model has a `varCorrection`

constant
that is used to calibrate the `scale`

. For all other models,
the per-item `scale`

must be supplied as data.

All models except 'unidim_adapt' estimate the item discrimination
parameter `alpha`

. A `normal(1.749, alphaScalePrior)`

prior is used
with `alphaScalePrior`

set to 0.2 by default. `alpha`

must be positive
so the normal distribution is truncated at zero. The distribution is
centered at 1.749 because this allows the logistic to approximate the
standard normal cumulative distribution (Savalei, 2006). We need to
estimate `alpha`

because `scale`

is entered as a constant and we need
to account for the stochastic uncertainty in the item's ability to
discriminate objects.

The correlation matrix uses a `lkj_corr(corLKJPrior)`

prior with
`corLKJPrior`

set to 2.5 by default. It may be necessary to increase
the prior if divergences are observed.

`factor`

scores are standard normally distributed.
`pathProp`

is shaped by two priors that act on different parts
of the distribution. `rawLoadings`

are
distributed `beta(propShape, propShape)`

with `propShape`

set to 4.0
by default. `rawLoadings`

has an indirect influence on `pathProp`

. The
quantity `2*rawLoadings-1`

is used to scale the factor scores, but
`pathProp`

is computed based on Equation 3 of Gelman et al. (in
press).
`pathProp`

is a signed proportion bounded between -1 and 1.
`pathProp`

is additionally constrained by prior ```
normal(logit(0.5 + pathProp/2),
pathScalePrior)
```

where `pathScalePrior`

is set to 1.2 by default. This
prevents extreme factor proportions (i.e. |pathProp|>.95). The purpose
of `propShape`

is to nudge `rawLoadings`

toward zero. If may be
necessary to increase `propShape`

if divergences are observed.

If you have more than one factor then `Psi`

is available
to estimate correlations among factors.
The prior on entries of `Psi`

is `normal(logit(0.5 + Psi/2), psiScalePrior)`

.
It may be necessary to reduce `psiScalePrior`

toward zero if
factors are highly correlated.

The idea of putting a prior on `pathProp`

was inspired by Gelman (2019, Aug 23).

Gelman, A. (2019, Aug 23). Yes, you can include prior information on quantities of interest, not just on parameters in your model [Blog post]. Retrieved from https://statmodeling.stat.columbia.edu/2019/08/23/yes-you-can-include-prior-information-on-quantities-of-interest-not-just-on-parameters-in-your-model/.

Gelman, A., Goodrich, B., Gabry, J., & Vehtari, A. (in press). R-squared for Bayesian regression models. *The American Statistician.* \doi{10.1080/00031305.2018.1549100}

Savalei, V. (2006). Logistic approximation to the normal: The KL rationale.
*Psychometrika, 71*(4), 763–767. \doi{10.1007/s11336-004-1237-y}

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., ... & Lillicrap, T. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P. C. (2019). Rank-normalization, folding, and localization: An improved $\widehat R$ for assessing convergence of MCMC. *arXiv preprint* arXiv:1903.08008.

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.