R-port of the function to compute FIA for MPT models by Wu, Myung, and Batchelder (2010a, 2010b). This function is essentially a copy of the original Matlab code to R (with significant parts moved to C++ and allowing for multicore functionality). Also, the order of input arguments is more R-like.
logical. Should fitting be distributed across several cores? Requires snowfall and initialized cluster. See also below.
A constant which is added in the Monte Carlo integration to avoid numerical underflows and is later subtracted (after appropriate transformation). Should be a power of 2 to avoid unnecessary numerical imprecision.
The following is the original description by Wu, Myung, & Batchelder (2010a) for their Matlab function. All changes to the original document are in squared brackets :
This function computes the FIA complexity measure, C_FIA, using a Monte Carlo numerical integration algorithm. When inequality is present, sampling from the restricted parameter space is performed by rejection algorithm.
[...] [see References for References]
The following symbols are used in the body of the function:
S denotes number of parameters.
C denotes the number of categories.
M denotes the number of leaves in the tree.
The first input argument
s is related to the string representation of the BMPT model. It can be obtained by replacing all categories in the string by the capital letter C and all branching probabilities by the lower case letter p.
The second input argument
parameters is a row vector that assigns parameters or constants to the p's in the string
s. Its length should be the same as the number of p's in
s, and its elements correspond to the p's according to their order in
s. Positive integer elements in
parameters assign parameters to the corresponding p's, with the same integer denoting the same parameter. Constants are assigned to the p's using the negation of their values.
The [third] input argument
category is a 1 by M vector assigning categories to the C's in the string ‘s’ in the same way
parameters assigns branching probabilities, except that only positive consecutive integers from 1 to
J, the total number of categories, are allowed.
The [fourth] input argument
N specifies the total sample size.
The [fifth] input argument
ineq0 assigns inequality constraints imposed on the parameters. It is a matrix with two columns. Each element denotes a parameter coded in the same way as in
parameters. For each row, the parameter on the left column is constrained to be smaller than that on the right column. The number of rows is determined by the total number of simple inequality constraints of the form
theta_1 < theta_2 in the model. [Default is
NULL corresponding to no inequality restrictions.]
The last input argument ‘Sample’ specifies the number of random samples to be drawn in the Monte Carlo algorithm. [Default is 200000.]
[For returned values see Value]
It should be noted that ‘lnconst’ can be computed analytically free of Monte Carlo error on a case by case basis described below. For this reason, the users can calculate
C_FIA [see Wu, Myung & Batchelder, 2010a; Equation 7] by adding
lnInt and their hand-calculated
lnconst to minimize the Monte Carlo errors. [In our experience this error is rather low and negligible.]
A sequence of inequalities
theta_1 < theta_2 < ... < theta_k reduces the parameter space to its
1/k!, so in this case
lnconst should be
-ln * (k!). In general, any combination of inequality constraints specifies a union of subsets of the parameter space, each satisfying some sequence of inequalities. For example, the subspace defined by
theta_1 < theta_2 and
theta_3 < theta_2 is a union of two subspaces, one satisfying
theta_1 < theta_3 < theta_2 and the other
theta_3 < theta_1 < theta_2, so the proportion is given by
2 * (1/3!) = 1/3.
A coding example:
Suppose that for model 1HTM-5c of source monitoring [see Wu et al., 2010a] , the sample sizes of source A, source B and new items are 300, 300 and 400, respectively and the inequality constraint of
d_1 < d_2 is imposed. In this case, the six input arguments should be specified as follows:
s = 'ppppCpCCppCCCppCpCCppCCCppCCC';
parameters = c(-.6,-.5,1,2,5,4,5,1,3,5,4,5,4,5); [adapted for R]
ineq0 = matrix(c(2,3), 1,2); [adapted for R]
category = c(1,1,2,1,2,3,5,4,5,4,5,6,7,8,9); [adapted for R]
N = 1000;
Another coding example:
For the pair-clustering model in Batchelder and Riefer (1999, Figure 1), suppose in a pair-clustering experiment there are 300 pairs of words and 100 singletons, the six input arguments should be specified as follows:
s = 'pppCCppCCpCCpCC'; parameters = c(-.75,1,2,3,3,3,3); [adapted for R]
ineq0 = NULL; [adapted for R]
category = c(1,4,2,3,3,4,5,6); [adapted for R]
N = 400;
[For more examples, see Examples]
Since MPTinR version 1.1.3 the Monte Carlo integration is performed in C++ using RcppEigen. With the default arguments, one instance of the C++ workhorse is called. To call multiple instances of the C++ workhorse, you can use the
split argument (which can be useful to replicate results obtained with
multicore = TRUE as described below). Note, that each time before calling the C++ code, the seed is set (the set of random seeds are generated before calling the function for the first time).
Multicore functionality is achieved via snowfall which needs to be loaded and a cluster initialized via
sfInit when setting
multicore = TRUE. When
split = NULL (the default), the
Samples will be evenly distributed on the different cores (using
sfClusterSplit), so that only one instance of the underlying C++ workhorse is called on each core. Setting
split to non-
NULL will produce as many instances (distributed across cores). Note that in order to obtain comparable results (as snowfall uses load balancing), the random seed is set (at each core) before calling each instance of the C++ workhorse. This allows to replicate results obtained via multicore in a non-multicore environment when seting
split appropriately (and
[A named vector:]
The first output argument
CFIA gives the FIA complexity value of the model.
The second [and third] output argument
CI gives the Monte Carlo confidence interval of
CI.l, gives the lower,
CI.u, the upper bound of the interval].
The [fourth] output argument
lnInt gives the log integral term in
C_FIA [see Wu, Myung & Batchelder, 2010a; Equation 7] for models without inequality constraints. When inequality constraints are present,
lnInt does not take into account the change in the normalizing constant in the proposal distribution and must be adjusted with the output argument
The [fifth and sixth] output argument [
CI.lnint] gives the Monte Carlo confidence interval of
lnInt. [.l = lower & .u = upper bound of the CI]
When inequality constraints are present, the [seventh] output argument
lnconst serves as an adjustment of ‘lnInt’. It estimates the logarithm of the proportion of parameter space [0,1]^S that satisfies those inequality constraints, and the log integral term is given by lnInt+lnconst.
The next [two] output argument [
CI.lnconst] give the Monte Carlo confidence interval of ‘lnconst’. [.l = lower & .u = upper bound of the CI]
The R version of the code should now (after moving the code to RcppEigen) be considerably faster than the Matlab version of this code.
The original Matlab code was written by Hao Wu, Jay I. Myung, and William H. Batchelder.
This code was ported to R by Henrik Singmann and David Kellen. RcppEigen was added by Henrik Singmann and Christian Mueller. Multicore functionality was added by Henrik Singmann.
Wu, H., Myung, J.I., & Batchelder, W.H. (2010a). Minimum description length model selection of multinomial processing tree models. Psychonomic Bulletin & Review, 17, 275-286.
Wu, H., Myung, J.I., & Batchelder, W.H. (2010b). On the minimum description length complexity of multinomial processing trees. Journal of Mathematical Psychology, 54, 291-303.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## Not run: # The following example is the code for the first example in Wu, Myung & Batchelder (2010a, pp. 280) # The result should be something like: CFIA = 12.61... or 12.62..., CI = 12.61... - 12.62.... # Executing this command can take a while. bmpt.fia(s = "ppppCpCCppCCCppCpCCppCCCppCCC", parameters = c(-0.5, -0.5, 3, 2, 5, 1, 5, 4, 2, 5, 1, 5, 1, 5), category = c(1,1,2,1,2,3,5,4,5,4,5,6,7,8,9), N = 1000, ineq0 = matrix(c(4,3),1,2)) bmpt.fia(s = "ppppCpCCppCCCppCpCCppCCCppCCC", parameters = c(-0.5, -0.5, 3, 2, 5, 1, 5, 4, 2, 5, 1, 5, 1, 5), category = c(1,1,2,1,2,3,5,4,5,4,5,6,7,8,9), N = 1000, ineq0 = matrix(c(4,3),1,2), mConst = 2L^8) ## End(Not run)