Compute FIA for MPTs

Description

R-port of the function to compute FIA for MPT models by Wu, Myung, and Batchelder (2010a, 2010b). This function is essentially a copy of the original Matlab code to R (with significant parts moved to C++ and allowing for multicore functionality). Also, the order of input arguments is more R-like.

Usage

1
2
bmpt.fia(s, parameters, category, N, ineq0 = NULL, Sample = 2e+05, 
        multicore = FALSE, split = NULL, mConst = NULL)

Arguments

s

see Details

parameters

see Details

category

see Details

N

see Details

ineq0

see Details

Sample

see Details

multicore

logical. Should fitting be distributed across several cores? Requires snowfall and initialized cluster. See also below.

split

NULL (the default) or integer specifying in how many separate calls to the C++ workhorse the integrant should be calculated. See below.

mConst

A constant which is added in the Monte Carlo integration to avoid numerical underflows and is later subtracted (after appropriate transformation). Should be a power of 2 to avoid unnecessary numerical imprecision.

Details

The following is the original description by Wu, Myung, & Batchelder (2010a) for their Matlab function. All changes to the original document are in squared brackets []:

This function computes the FIA complexity measure, C_FIA, using a Monte Carlo numerical integration algorithm. When inequality is present, sampling from the restricted parameter space is performed by rejection algorithm.

[...] [see References for References]

The following symbols are used in the body of the function:
S denotes number of parameters.
C denotes the number of categories.
M denotes the number of leaves in the tree.

The first input argument s is related to the string representation of the BMPT model. It can be obtained by replacing all categories in the string by the capital letter C and all branching probabilities by the lower case letter p.

The second input argument parameters is a row vector that assigns parameters or constants to the p's in the string s. Its length should be the same as the number of p's in s, and its elements correspond to the p's according to their order in s. Positive integer elements in parameters assign parameters to the corresponding p's, with the same integer denoting the same parameter. Constants are assigned to the p's using the negation of their values.

The [third] input argument category is a 1 by M vector assigning categories to the C's in the string ‘s’ in the same way parameters assigns branching probabilities, except that only positive consecutive integers from 1 to J, the total number of categories, are allowed.

The [fourth] input argument N specifies the total sample size.

The [fifth] input argument ineq0 assigns inequality constraints imposed on the parameters. It is a matrix with two columns. Each element denotes a parameter coded in the same way as in parameters. For each row, the parameter on the left column is constrained to be smaller than that on the right column. The number of rows is determined by the total number of simple inequality constraints of the form theta_1 < theta_2 in the model. [Default is NULL corresponding to no inequality restrictions.]

The last input argument ‘Sample’ specifies the number of random samples to be drawn in the Monte Carlo algorithm. [Default is 200000.]

[For returned values see Value]

It should be noted that ‘lnconst’ can be computed analytically free of Monte Carlo error on a case by case basis described below. For this reason, the users can calculate C_FIA [see Wu, Myung & Batchelder, 2010a; Equation 7] by adding (S/2)*ln*(N/(2*pi)), lnInt and their hand-calculated lnconst to minimize the Monte Carlo errors. [In our experience this error is rather low and negligible.]

A sequence of inequalities theta_1 < theta_2 < ... < theta_k reduces the parameter space to its 1/k!, so in this case lnconst should be -ln * (k!). In general, any combination of inequality constraints specifies a union of subsets of the parameter space, each satisfying some sequence of inequalities. For example, the subspace defined by theta_1 < theta_2 and theta_3 < theta_2 is a union of two subspaces, one satisfying theta_1 < theta_3 < theta_2 and the other theta_3 < theta_1 < theta_2, so the proportion is given by 2 * (1/3!) = 1/3.

A coding example:
Suppose that for model 1HTM-5c of source monitoring [see Wu et al., 2010a] , the sample sizes of source A, source B and new items are 300, 300 and 400, respectively and the inequality constraint of d_1 < d_2 is imposed. In this case, the six input arguments should be specified as follows:
s = 'ppppCpCCppCCCppCpCCppCCCppCCC';
parameters = c(-.6,-.5,1,2,5,4,5,1,3,5,4,5,4,5); [adapted for R]
ineq0 = matrix(c(2,3), 1,2); [adapted for R]
category = c(1,1,2,1,2,3,5,4,5,4,5,6,7,8,9); [adapted for R]
N = 1000;

Another coding example:
For the pair-clustering model in Batchelder and Riefer (1999, Figure 1), suppose in a pair-clustering experiment there are 300 pairs of words and 100 singletons, the six input arguments should be specified as follows:
s = 'pppCCppCCpCCpCC'; parameters = c(-.75,1,2,3,3,3,3); [adapted for R]
ineq0 = NULL; [adapted for R]
category = c(1,4,2,3,3,4,5,6); [adapted for R]
N = 400;

[For more examples, see Examples]

Since MPTinR version 1.1.3 the Monte Carlo integration is performed in C++ using RcppEigen. With the default arguments, one instance of the C++ workhorse is called. To call multiple instances of the C++ workhorse, you can use the split argument (which can be useful to replicate results obtained with multicore = TRUE as described below). Note, that each time before calling the C++ code, the seed is set (the set of random seeds are generated before calling the function for the first time).

Multicore functionality is achieved via snowfall which needs to be loaded and a cluster initialized via sfInit when setting multicore = TRUE. When split = NULL (the default), the Samples will be evenly distributed on the different cores (using sfClusterSplit), so that only one instance of the underlying C++ workhorse is called on each core. Setting split to non-NULL will produce as many instances (distributed across cores). Note that in order to obtain comparable results (as snowfall uses load balancing), the random seed is set (at each core) before calling each instance of the C++ workhorse. This allows to replicate results obtained via multicore in a non-multicore environment when seting split appropriately (and set.seed beforehands).

Value

[A named vector:]

The first output argument CFIA gives the FIA complexity value of the model.

The second [and third] output argument CI gives the Monte Carlo confidence interval of CFIA. [CI.l, gives the lower, CI.u, the upper bound of the interval].

The [fourth] output argument lnInt gives the log integral term in C_FIA [see Wu, Myung & Batchelder, 2010a; Equation 7] for models without inequality constraints. When inequality constraints are present, lnInt does not take into account the change in the normalizing constant in the proposal distribution and must be adjusted with the output argument lnconst.

The [fifth and sixth] output argument [CI.lnint] gives the Monte Carlo confidence interval of lnInt. [.l = lower & .u = upper bound of the CI]

When inequality constraints are present, the [seventh] output argument lnconst serves as an adjustment of ‘lnInt’. It estimates the logarithm of the proportion of parameter space [0,1]^S that satisfies those inequality constraints, and the log integral term is given by lnInt+lnconst.

The next [two] output argument [CI.lnconst] give the Monte Carlo confidence interval of ‘lnconst’. [.l = lower & .u = upper bound of the CI]

Note

The R version of the code should now (after moving the code to RcppEigen) be considerably faster than the Matlab version of this code.

Author(s)

The original Matlab code was written by Hao Wu, Jay I. Myung, and William H. Batchelder.
This code was ported to R by Henrik Singmann and David Kellen. RcppEigen was added by Henrik Singmann and Christian Mueller. Multicore functionality was added by Henrik Singmann.

References

Wu, H., Myung, J.I., & Batchelder, W.H. (2010a). Minimum description length model selection of multinomial processing tree models. Psychonomic Bulletin & Review, 17, 275-286.

Wu, H., Myung, J.I., & Batchelder, W.H. (2010b). On the minimum description length complexity of multinomial processing trees. Journal of Mathematical Psychology, 54, 291-303.

See Also

fit.mpt for the main function of MPTinR.
get.mpt.fia for a convenient wrapper of this function.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## Not run: 
# The following example is the code for the first example in Wu, Myung & Batchelder (2010a, pp. 280)
# The result should be something like: CFIA = 12.61... or 12.62..., CI = 12.61... - 12.62....
# Executing this command can take a while.

bmpt.fia(s = "ppppCpCCppCCCppCpCCppCCCppCCC", 
	parameters = c(-0.5, -0.5, 3, 2, 5, 1, 5, 4, 2, 5, 1, 5, 1, 5), 
	category = c(1,1,2,1,2,3,5,4,5,4,5,6,7,8,9), 
  N = 1000, ineq0 = matrix(c(4,3),1,2))

bmpt.fia(s = "ppppCpCCppCCCppCpCCppCCCppCCC", 
	parameters = c(-0.5, -0.5, 3, 2, 5, 1, 5, 4, 2, 5, 1, 5, 1, 5), 
	category = c(1,1,2,1,2,3,5,4,5,4,5,6,7,8,9), 
  N = 1000, ineq0 = matrix(c(4,3),1,2), mConst = 2L^8)

## End(Not run)