handling_NAs: Discrete probability masses and NA/NaN/Inf in distributions...

handling_NAsR Documentation

Discrete probability masses and NA/NaN/Inf in distributions of summary statistics.


This explains the use of the boundaries attribute of observed statistics to handle (1) values of the summary statistics that can occur with some probability mass; (2) special values (NA/NaN/Inf) in distributions of summary statistics. This further explains why Infusion handles special values by removing affected distributions unless the boundaries attribute is used.


Special values may be encountered in an analysis. For example, trying to estimate a regression coefficient when the predictor variable is constant may return a NaN. Since functions such as refine automatically add simulated distributions, this problem must be automatically handled by the user's simulation function or by the package functions, rather than by user's tinkering with the Infusion procedures.

The user must consider what s-he would do if actual data also included NA/NaN/Inf values. If such data would not be subject to a statistical analysis, then the simulation procedure must reflect that, otherwise the analysis will be biased. The processing of reference tables by Infusion functions applies na.omit() on the tables so any line containing NA's will be removed. The drawbacks are that the number of informative simulations is reduced and that inference will be difficult if the data-generating parameters were indeed prone to induce data that would not be subject to statistical analysis. Thus, it may be necessary to simulate alternative data until no special values are obtained and the target size of the simulated distribution is reached. One solution is for the user to write a simulation function that calls itself recursively until a valid summary statistic is produced. Care is then needed to avoid infinite recursion (which might well indicate unlikely parameter values).

Alternatively, if one considers that special values are informative about parameters (in the above example of a regression coefficient, if a constant predictor variable says something about the parameters), then NA/NaN/Inf must be replaced by a (fixed) dummy numerical value which is flagged to be distinctly handled, using the boundaries attribute of the observed summary statistics. The simulation function should return statistic foo=-1 (say) instead of foo=NaN, and one should then set attr(<observed>,"boundaries") <- c(foo=-1).

The boundary attribute is also useful to handle all values of the summary statistics that can occur with some probability mass. For example if the estimate est_p of a probability takes values 0 or 1 with positive probability, one should set attr(<observed>,"boundaries") <- c(p_est=0,p_est=1).

Infusion documentation built on Sept. 29, 2022, 1:05 a.m.