loglinear: Fit a hierarchical log-linear model with algebraic methods
In dkahle/algstat: Algebraic Statistics

loglinear

R Documentation

Fit a hierarchical log-linear model with algebraic methods

Description

loglinear() fits a hierarchical log-linear model to a dataset (typically a contingency table) and performs an exact conditional test on the fitted model using various distance metrics. The exact test, which is a goodness-of-fit test, is performed via Monte-Carlo sampling from the conditional distribution of the table given the sufficient statistics of the model. In short, inference is drawn by comparing the statistic of the observed table (be it an unnormalized log-likelihood, a Pearson chi-squared value, or another metric) to those of the samples. The proportion of sampled tables with equal to or more extreme values than the observed table is the resulting p-value.

Usage

loglinear(
  model,
  data,
  init = tab2vec(data),
  iter = 10000,
  burn = 1000,
  thin = 10,
  engine = c("C++", "R"),
  method = c("ipf", "mcmc"),
  moves,
  ...
)

Arguments

`model`	hierarchical log-linear model specification
`data`	data, typically as a table but can be in different formats. see `teshape()`
`init`	the initialization of the chain. by default, this is the observed table
`iter`	number of chain iterations
`burn`	burn-in
`thin`	thinning
`engine`	`"C++"` or `"R"`? (C++ is significantly faster)
`method`	should the expected value (exp) be fit using iterative proportional fitting (via loglin) or the MCMC as the average of the steps?
`moves`	the markov moves for the mcmc (as columns of a matrix).
`...`	...

Details

In many ways, loglinear() is like stats::loglin() or MASS::loglm(); however, there are a few key differences.

The first difference is methodological. The tests conducted with loglinear() are exact conditional tests based on the conditional distribution of the data given the sufficient statistics for the model. In other words, they are analogues of Fisher's exact test for generic log-linear models. These tests are made possible by advances in algebraic statistics; see references 1–3 below.

The second difference between loglinear() and loglin() or loglm() is that inference is made through Monte Carlo simulation. In particular, loglinear() leverages Markov moves to sample from the conditional distribution of the data given its sufficent statistics. If the software 4ti2 is installed on your machine, you can use markov() (or let loglinear() use markov()) to generate a Markov basis to use for the Markov moves. This basis is guaranteed to produce a MCMC routine that converges to the conditional distribution of interest. Since loglinear() uses Monte Carlo simulation to conduct inference, and since it uses MCMC to do so, concerns typical to MCMC should be addressed. In particular, issues such as burn in and mixing (autocorrelation of samples) should be addressed. The examples illustrate some of these topics. The result is a p-value that is generated by Monte Carlo simulation. Its standard error is provided (computed as in the standard CLT confidence interval) to give a sense of the Monte Carlo error.

A third way that loglinear() differs from loglin() or loglm() is in generalizing the kinds of tests performed. While those allow for asymptotic unconditional testing using Pearson's X^2 test and the likelihood ratio test, loglinear() gives several test statistics: Pearson's X^2, the likelihood ratio G^2, Freeman-Tukey, Cressie-Read (lambda = 2/3), and Neyman's modified X^2., see the last reference. In other words, to compute the exact p-value, iter = 1e4 samples are sampled from the conditional distribution of the table given the sufficient statistics, and then the proportion of tables that have X^2, G^2, etc. values greater than or equal to that of the observed table is the p value for the (conditional) exact test. A similar, and perhaps preferable approach, simply adds up the probabilities of the tables that have probabilities less than or equal to that of the observed table; this is the first line output in hierarchical and does not use a test statistic.

Some authors (see the third reference) suggest that for discrete problems, a "mid p value" is preferable to the traditional p value, and when presented should be interepreted in the same way. If the p value is defined to be P(samps >= obs), the mid p value is defined to be P(samps > obs) + P(samps == obs)/2. The mid p value is computed for each test.

Value

a list containing named elements

steps: an integer matrix whose columns represent individual samples from the mcmc.
moves: the moves used for the proposal distribution in the mcmc, computed with 4ti2 (note that only the positive moves are given).
accept_prob: the empirical transition probability of the moves, including the thinned moves.
param: the fitted parameters of the log linear model.
df: parameters per term in the model
quality: model selection statistics AIC, AICc, and BIC.
residuals: the (unstandardized) pearson residuals (O - E) / sqrt(E)
call: the call.
obs: the contingency table given.
exp: the fit contingency table as an integer array.
A: the sufficient statistics computing matrix (from Tmaker).
p.value: the exact p-values of individual tests, accurate to Monte-Carlo error. these are computed as the proportion of samples with statistics equal to or larger than the oberved statistic.
mid.p.value: the mid p.values, see Agresti pp.20–21.
statistic: the pearson's chi-squared (X2), likelihood ratio (G2), Freeman-Tukey (FT), Cressie-Read (CR), and Neyman modified chi-squared (NM) statistics computed for the table given.
sampsStats: the statistics computed for each mcmc sample.
cells: the number of cells in the table.
method: the method used to estimate the table.

Author(s)

David Kahle

References

Diaconis, P. and B. Sturmfels (1998). Algebraic Algorithms for Sampling from Conditional Distributions. The Annals of Statistics 26(1), pp.363-397.

Drton, M., B. Sturmfels, and S. Sullivant (2009). Lectures on Algebraic Statistics, Basel: Birkhauser Verlag AG.

Aoki, S., H. Hara, and A. Takemura (2012). Markov Bases in Algebraic Statistics, Springer.

Agresti, A. (2002). Categorical Data Analysis, Basel: John Wiley & Sons, 2ed.

Agresti, A. (1992). A Survey of Exact Inference for Contingency Tables Statistical Science 7(1), pp.131-153.

Read, T. and Cressie, N. (1998). Goodness-of-Fit Statistics for Discrete Multivariate Data, Springer-Verlag.

Examples


## Not run:  requires LattE and 4ti2


## handedness introductory example
############################################################

data(handy); handy

(out <- loglinear(~ Gender + Handedness, data = handy))

# you can also specify the same model using variable indices...
(out <- loglinear(~ 1 + 2, data = handy))

# ... or as a list of facets given by indices
(out <- loglinear(list(1, 2), data = handy))

# ... or as a list of facets given by name
(out <- loglinear(list("Gender", "Handedness"), data = handy))

# ... and even via a pre-computed configuration matrix
# this method does come with somewhat reduced output
A <- hmat(c(2, 2), 1:2)
(out <- loglinear(A, data = handy))



# loglinear performs the same tasks as loglin and loglm,
# but loglinear gives the exact test p values and more goodness-of-fit statistics
stats::loglin(handy, list(1, 2))
MASS::loglm(~ Gender + Handedness, data = handy)
# loglm is just a wrapper of loglin

# we can check loglinear's output with
fisher.test(handy)$p.value
out$p.value









# comparisons between loglinear, stats::loglin, and MASS::loglm
############################################################

(loglinearFit <- loglinear(~ Gender + Handedness, data = handy))
(loglinFit    <- stats::loglin(handy, list(1, 2), fit = TRUE, param = TRUE))
(loglmFit     <- MASS::loglm(~ Gender + Handedness, data = handy))



# the expected table given the sufficient statistics can be computed
# via two methods, iterative proportional fitting, and the mcmc itself:
loglinearFit$exp # ipf
loglinear(~ Gender + Handedness, data = handy, method = "mcmc")$exp
loglinFit$fit # the equivalent in loglin; this is used by default in loglinear




# the parameter values of the loglinear model can be accessed
loglinearFit$param
loglinFit$param




# the p-value for the goodness-of-fit of the overall model is available as well :
# loglinear gives the exact conditional p-value
# (conditional on the sufficient statistics)
# the five numbers correspond the probability of observering a table that is
# "more weird" than the observed table, where "more weird" is determined
# by having a larger X2 value (or G2, FT, CR, or NM)
loglinearFit$p.value

# in this case (a 2x2 table with the independence model), we can check that
# the above p-values are coorect up to Monte Carlo error
fisher.test(handy)$p.value


# loglin gives the p-values using the unconditional asymptotic distribution:
# note that the two are quite different in this case, although the conclusion
# is the same
c(
  "X2" = pchisq(loglinFit$pearson, df = loglinFit$df, lower.tail = FALSE),
  "G2" = pchisq(loglinFit$lrt,     df = loglinFit$df, lower.tail = FALSE)
)

# mid p-values are available as well:
loglinearFit$mid.p.value # the mid (exact conditional) p-value is also available




# the test statistics based on the observed table and the expected
# table under the model are available
loglinearFit$statistic
c(X2 = loglinFit$pearson, G2 = loglinFit$lrt) # loglin only gives X2 and G2
# note that PR is un-normalized log-probability



# the markov moves used for the proposal distribution of the metropolis-hastings
# algorithm are returned. the proposal distribution is uniform on +/-
# the moves added to the current table
loglinearFit$moves
# they are easier understood as tables
vec2tab(loglinearFit$moves, dim(handy))
# notice that the marginals stay fixed:
handy + vec2tab(loglinearFit$moves, dim(handy))




# these were computed as the markov basis of the integer matrix
# (for different models, different schemes may be employed)
loglinearFit$A
markov(loglinearFit$A)
loglinearFit$moves




# the moves are also sometimes written in tableau form (LAS p.13)
tableau(loglinearFit$moves, dim(handy))
# that's +1 the the table in elements [1,1] and [2,2]
# and -1 in the table in elements [1,2] and [2,1]




# the acceptance probability of the MCMC is retained
loglinearFit$accept_prob




# various model assessment measures are also available
loglinearFit$quality




# the number of independent parameters per term are in df
loglinearFit$df




# as an added help, you may find the visuals in vcd useful:
library(vcd)
mosaic(~ Gender + Handedness, data = handy, shade = TRUE, legend = TRUE)
# note mosaic's use of the asymptotic X^2 test



















## politics example - with computing the exact p value by hand
############################################################

data(politics); politics

(out <- loglinear(~ Personality + Party, data = politics))
loglinFit <- stats::loglin(politics, as.list(1:2), fit = TRUE, param = TRUE)

out$p.value
# exact without monte-carlo error
sum(dhyper(c(0:3,6:9), 10, 10, 9))
fisher.test(politics)$p.value
round(dhyper(0:9, 10, 10, 9), 4)

# we can sample from the hypergeometric distribution on the fiber using
# rhyper
rhyper(100, 10, 10, 9)


# comparisons :
out$exp
loglinFit$fit

out$param
loglinFit$param

out$p.value # exact
c(
  X2 = pchisq(loglinFit$pearson, df = loglinFit$df, lower.tail = FALSE),
  G2 = pchisq(loglinFit$lrt,     df = loglinFit$df, lower.tail = FALSE)
) # asymptotic approximation
fisher.test(politics)$p.value # the exact conditional p-value

out$statistic # accurate to monte carlo error
c(X2 = loglinFit$pearson, G2 = loglinFit$lrt)

vcd::mosaic(~ Personality + Party, data = politics, shade = TRUE, legend = TRUE)



# alternative model specifications :
loglinear(~ Personality + Party, data = politics)
loglinear(~ 1 + 2, data = politics)
loglinear(list(1, 2), data = politics)
loglinear(list("Personality", "Party"), data = politics)









## eyeHairColor from the Diaconis and Sturmfels reference
############################################################

data(HairEyeColor)
eyeHairColor <- margin.table(HairEyeColor, 2:1)

out <- loglinear(~ Eye + Hair, data = eyeHairColor)

# the default fisher.test doesn't work even with workspace = 2E9
# (with over 4.5Gb in memory) because it is trying to enumerate the fiber.
#fisher.test(eyeHairColor, workspace = 2E9)
# it can, however, compute Monte Carlo p-values for RxC tables, like loglinear
fisher.test(eyeHairColor, simulate.p.value = TRUE, B = 1e6)
out$p.value

# you can see the markov moves used in the mcmc in tableau notation
tableau(out$moves, dim(eyeHairColor))

# library(vcd)
# mosaic(~ Eye + Hair, data = HairEyeColor, shade = TRUE, legend = TRUE)






## abortion preference example from the
## Diaconis and Sturmfels reference pp. 379--381
## a no 3-way interaction model
############################################################

data(abortion); abortion

(loglinearFit <- loglinear(subsets(1:3, 2), data = abortion,
  iter = 10000, burn = 50000, thin = 50
))
loglinFit <- loglin(abortion, subsets(1:3, 2), fit = TRUE, param = TRUE)

vec2tab(rowMeans(loglinearFit$steps), dim(abortion)) # cf. p. 380
loglinFit$fit

all.equal(loglinearFit$param, loglinFit$param)



qqplot(rchisq(1055, df = 8), out$sampsStats$X2s)
curve(1*x, from = 0, to = 30, add = TRUE, col = "red")

( nMoves <- 2*ncol(out$moves) ) # DS uses 110
# (the markov basis is larger than it needs to be)










## loglin no three-way interaction model example
############################################################

# the help for fits the no three-way interaction model on HairEyeColor,
# finds a .66196 p-value using the asymptotic distribution, and concludes
# a good fit:
data(HairEyeColor)

loglinearFit <- loglinear(subsets(1:3, 2), data = HairEyeColor)
loglinFit    <- loglin(HairEyeColor, subsets(1:3, 2), fit = TRUE, param = TRUE)




# p values
loglinearFit$p.value
pchisq(loglinFit$lrt, loglinFit$df, lower.tail = FALSE) # see ?loglin

# test statistics
loglinearFit$statistic
c(X2 = loglinFit$pearson, G2 = loglinFit$lrt)

# fits (estimated tables)
loglinearFit$obs
round(loglinearFit$exp, 1)
round(loglinFit$fit, 1)


# checking the autocorrelation of mcmc
acf(loglinearFit$sampsStats$PRs)

# poor mixing is a known limitation of markov bases strategies
# one strategy is to try to thin the mcmc
loglinearFit <- loglinear(subsets(1:3, 2), data = HairEyeColor, thin = 100)
acf(loglinearFit$sampsStats$PRs) # got it! (overkill, actually)


# the slight differences in loglinFit$fit and loglinearFit$exp (both done with ipf from loglin)
# are due to differences in variable order:
loglin(HairEyeColor, subsets(1:3, 2), fit = TRUE)$fit
loglin(HairEyeColor, subsets(1:3, 2)[c(1,3,2)], fit = TRUE)$fit

# let's look at a few model moves
vec2tab(loglinearFit$moves[,1], dim(HairEyeColor))
vec2tab(loglinearFit$moves[,50], dim(HairEyeColor))
-vec2tab(loglinearFit$moves[,50], dim(HairEyeColor))

# they contribute 0 to the marginals of the table
# (the sufficient statistics of the model)
exampleMove <- loglinearFit$moves[,50]
vec2tab(exampleMove, dim(HairEyeColor))
loglinearFit$A %*% exampleMove

# two tables with same sufficient statistics
HairEyeColor
HairEyeColor + vec2tab(exampleMove, dim(HairEyeColor))

# here are the sufficient statistics:
loglinearFit$A %*% tab2vec(HairEyeColor)
loglinearFit$A %*% tab2vec(HairEyeColor + vec2tab(exampleMove, dim(HairEyeColor)))









## a table with positive marginals but no MLE for
## the no-three way interaction model
############################################################


data(haberman); haberman

mod <- loglinear(subsets(1:3, 2), data = haberman)

loglinFit <- loglin(haberman, subsets(1:3, 2), param = TRUE, fit = TRUE)
loglinFit$fit
loglinFit$param
c(X2 = loglinFit$pearson, G2 = loglinFit$lrt)

loglinearFit <- loglinear(subsets(1:3, 2), data = haberman, method = "mcmc")
loglinearFit$exp
loglinearFit$param
loglinearFit$statistic

A <- hmat(rep(2, 3), subsets(1:3, 2))
count_tables(haberman, A) # there's only one table in the fiber!










## an example from agresti, p.322
############################################################

data(drugs); drugs
ftable(aperm(drugs, c(3, 1, 2))) # = table 8.3

out <- loglinear(~ Alcohol + Cigarette + Marijuana, data = drugs)
matrix(round(aperm(out$exp, c(2,1,3)), 1), byrow = FALSE)

loglin(drugs, as.list(1:3), fit = TRUE)$fit
loglin(drugs, as.list(1:3), param = TRUE)$param

# # the saturated model issues a warning from markov, but works :
# out <- loglinear(~ Alcohol * Cigarette * Marijuana, data = drugs)
# matrix(round(aperm(out$exp, c(2,1,3)), 1), byrow = FALSE) # = the data


ftable(aperm(out$exp, c(3,1,2)))

stats <- loglin(drugs, as.list(1:3), fit = TRUE, param = TRUE)


## considered via glm

df <- as.data.frame(drugs)
mod <- glm(Freq ~ Alcohol + Cigarette + Marijuana, data = df, family = poisson)
summary(mod)
mod$fitted.values


# the same can be done with glm :

mod <- glm(
  Freq ~ Alcohol + Cigarette + Marijuana,
  data = as.data.frame(drugs), family = poisson
)
summary(mod)
matrix(round(mod$fitted.values[c(1,3,2,4,5,7,6,8)],1))



mod <- glm(
  Freq ~ Alcohol * Cigarette + Marijuana,
  data = as.data.frame(drugs), family = poisson
)
summary(mod)
matrix(round(mod$fitted.values[c(1,3,2,4,5,7,6,8)],1))


mod <- glm(
  Freq ~ Alcohol * Cigarette * Marijuana,
  data = as.data.frame(drugs), family = poisson
)
summary(mod)
matrix(round(mod$fitted.values[c(1,3,2,4,5,7,6,8)],1))














## End(Not run)

dkahle/algstat documentation built on May 23, 2023, 12:29 a.m.