library(knitr) opts_chunk$set( tidy=FALSE, size="small" )
#require( gRbase ) prettyVersion <- packageDescription("gRbase")$Version prettyDate <- format(Sys.Date())
library(gRbase) options("width"=100, "digits"=4) options(useFancyQuotes="UTF-8") knitr::opts_chunk$set( collapse = TRUE, comment = "#>", chk = "markup" ) chk = "markup" options("digits"=3)
This note describes some operations on arrays in R. These operations have been implemented to facilitate implementation of graphical models and Bayesian networks in R.
The documentation of R states the following about arrays: An array in R can have one, two or more dimensions. It is simply a vector which is stored with additional attributes giving the dimensions (attribute "dim") and optionally names for those dimensions (attribute "dimnames"). A two-dimensional array is the same thing as a matrix. One-dimensional arrays often look like vectors, but may be handled differently by some functions.
Arrays appear for example in connection with cross classified data. The array hec below is an excerpt of the HairEyeColor array in R:
hec <- c(32, 53, 11, 50, 10, 25, 36, 66, 9, 34, 5, 29) dim(hec) <- c(2, 3, 2) dimnames(hec) <- list(Hair = c("Black", "Brown"), Eye = c("Brown", "Blue", "Hazel"), Sex = c("Male", "Female")) hec
Above, hec is an array because it has a dim attribute. Moreover, \code{hec} also has a dimnames attribute naming the levels of each dimension. Notice that each dimension is given a name.
Printing arrays can take up a lot of space. Alternative views on an array can be obtained with ftable() or by converting the array to a dataframe with as.data.frame.table(). We shall do so in the following.
##flat <- function(x) {ftable(x, row.vars=1)} flat <- function(x, n=4) {as.data.frame.table(x) |> head(n)} hec |> flat()
An array with named dimensions is in this package called a named array. The functionality described below relies heavily on arrays having named dimensions. A check for an object being a named array is provided by is_named_array()
is_named_array(hec)
Another way is to use tab_new() from gRbase. This function is flexible wrt the input; for example:
dn <- list(Hair=c("Black", "Brown"), Eye=~Brown:Blue:Hazel, Sex=~Male:Female) counts <- c(32, 53, 11, 50, 10, 25, 36, 66, 9, 34, 5, 29) z3 <- tab_new(~Hair:Eye:Sex, levels=dn, value=counts) z4 <- tab_new(c("Hair", "Eye", "Sex"), levels=dn, values=counts)
Default dimnames are generated with
z5 <- tab_new(~Hair:Eye:Sex, levels=c(2, 3, 2), values = counts) dimnames(z5) |> str()
Using tab_new, arrays can be normalized to sum to one in two ways: 1) Normalization can be over the first variable for each configuration of all other variables and 2) over all configurations. For example:
z6 <- tab_new(~Hair:Eye:Sex, levels=c(2, 3, 2), values=counts, normalize="first") z6 |> flat()
In the following we shall denote the dimnames (or variables) of the array \code{hec} by $H$, $E$ and $S$ and we let $(h,e,s)$ denote a configuration of these variables. The contingency table above shall be denoted by $T_{HES}$ and we shall refer to the $(h,e,s)$-entry of $T_{HES}$ as $T_{HES}(h,e,s)$.
Normalize an array with \rr{tab_normalize()} Entries of an array can be normalized to sum to one in two ways: 1) Normalization can be over the first variable for each configuration of all other variables and 2) over all configurations. For example:
tab_normalize(z5, "first") |> flat()
We can subset arrays (this will also be called ``slicing'') in different ways. Notice that the result is not necessarily an array. Slicing can be done using standard R code or using \rr{tab_slice}. The virtue of tab_slice() comes from the flexibility when specifying the slice:
The following leads from the original $2\times 3 \times 2$ array to a $2 \times 2$ array by cutting away the Sex=Male and Eye=Brown slice of the array:
tab_slice(hec, slice=list(Eye=c("Blue", "Hazel"), Sex="Female")) ## Notice: levels can be written as numerics ## tab_slice(hec, slice=list(Eye=2:3, Sex="Female"))
We may also regard the result above as a $2 \times 2 \times 1$ array:
tab_slice(hec, slice=list(Eye=c("Blue", "Hazel"), Sex="Female"), drop=FALSE)
If slicing leads to a one dimensional array, the output will by default not be an array but a vector (without a dim attribute). However, the result can be forced to be a 1-dimensional array:
## A vector: t1 <- tab_slice(hec, slice=list(Hair=1, Sex="Female")); t1 ## A 1-dimensional array: t2 <- tab_slice(hec, slice=list(Hair=1, Sex="Female"), as.array=TRUE); t2 ## A higher dimensional array (in which some dimensions only have one level) t3 <- tab_slice(hec, slice=list(Hair=1, Sex="Female"), drop=FALSE); t3
The difference between the last two forms can be clarified:
t2 |> flat() t3 |> flat()
Collapsing: The $HE$--marginal array $T_{HE}$ of $T_{HES}$ is the array with values \begin{displaymath} T_{HE}(h,e) = \sum_s T_{HES}(h,e,s) \end{displaymath} Inflating: The ``opposite'' operation is to extend an array. For example, we can extend $T_{HE}$ to have a third dimension, e.g.\ \code{Sex}. That is \begin{displaymath} \tilde T_{SHE}(s,h,e) = T_{HE}(h,e) \end{displaymath} so $\tilde T_{SHE}(s,h,e)$ is constant as a function of $s$.
With gRbase we can collapse arrays with:
he <- tab_marg(hec, c("Hair", "Eye")) he
## Alternatives tab_marg(hec, ~Hair:Eye) tab_marg(hec, c(1, 2)) hec %a_% ~Hair:Eye
Notice that collapsing is a projection in the sense that applying the operation again does not change anything (except possibly changing the order of variables):
he1 <- tab_marg(hec, c("Hair", "Eye")) he2 <- tab_marg(he1, c("Eye", "Hair")) tab_equal(he1, he2)
Expand an array by adding additional dimensions with tab_expand():
extra.dim <- list(Sex=c("Male", "Female")) tab_expand(he, extra.dim)
## Alternatives he %a^% extra.dim
Notice that expanding and collapsing brings us back to where we started:
(he %a^% extra.dim) %a_% c("Hair", "Eye")
A reorganization of the table can be made with tab_perm() (similar to aperm()), but tab_perm() allows for a formula and for variable abbreviation:
tab_perm(hec, ~Eye:Sex:Hair) |> flat()
Alternative forms (the first two also works for \code{aperm}):
tab_perm(hec, c("Eye", "Sex", "Hair")) tab_perm(hec, c(2, 3, 1)) tab_perm(hec, ~Ey:Se:Ha) tab_perm(hec, c("Ey", "Se", "Ha"))
Two arrays are defined to be identical 1) if they have the same dimnames and 2) if, possibly after a permutation, all values are identical (up to a small numerical difference):
hec2 <- tab_perm(hec, 3:1) tab_equal(hec, hec2)
## Alternative hec %a==% hec2
We can align one array according to the ordering of another:
hec2 <- tab_perm(hec, 3:1) tab_align(hec2, hec)
## Alternative: tab_align(hec2, dimnames(hec))
The product of two arrays $T_{HE}$ and $T_{HS}$ is defined to be the array $\tilde T_{HES}$ with entries $$ \tilde T_{HES}(h,e,s)= T_{HE}(h,e) + T_{HS}(h,s) $$
The sum, difference and quotient is defined similarly: This is done with tab_rod(), tab_add(), tab_diff() and tab_div():
hs <- tab_marg(hec, ~Hair:Eye) tab_mult(he, hs)
Available operations:
tab_add(he, hs) tab_subt(he, hs) tab_mult(he, hs) tab_div(he, hs) tab_div0(he, hs) ## Convention 0/0 = 0
Shortcuts:
## Alternative he %a+% hs he %a-% hs he %a*% hs he %a/% hs he %a/0% hs ## Convention 0/0 = 0
Multiplication and addition of (a list of) multiple arrays is accomplished with tab_prod() and tab_sum() (much like prod() and sum()):
es <- tab_marg(hec, ~Eye:Sex) tab_sum(he, hs, es) ## tab_sum(list(he, hs, es))
Lists of arrays are processed with
tab_list_add(list(he, hs, es)) tab_list_mult(list(he, hs, es))
If an array consists of non--negative numbers then it may be regarded as an (unnormalized) discrete multivariate density. With this view, the following examples should be self explanatory:
tabDist(hec, marg=~Hair:Eye) tabDist(hec, cond=~Sex) tabDist(hec, marg=~Hair, cond=~Sex)
Multiply values in a slice by some number and all other values by another number:
tab_slice_mult(es, list(Sex="Female"), val=10, comp=0)
A classical example of a Bayesian network is the ``sprinkler example'', see e.g.\ (https://en.wikipedia.org/wiki/Bayesian_network): \begin{quote} \em Suppose that there are two events which could cause grass to be wet: either the sprinkler is on or it is raining. Also, suppose that the rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler is usually not turned on). Then the situation can be modeled with a Bayesian network. \end{quote}
We specify conditional probabilities $p(r)$, $p(s|r)$ and $p(w|s,r)$ as follows (notice that the vertical conditioning bar ($|$) is indicated by the horizontal underscore:
yn <- c("y","n") lev <- list(rain=yn, sprinkler=yn, wet=yn) r <- tab_new(~rain, levels=lev, values=c(.2, .8)) s_r <- tab_new(~sprinkler:rain, levels = lev, values = c(.01, .99, .4, .6)) w_sr <- tab_new( ~wet:sprinkler:rain, levels=lev, values=c(.99, .01, .8, .2, .9, .1, 0, 1)) r s_r |> flat() w_sr |> flat()
The joint distribution $p(r,s,w)=p(r)p(s|r)p(w|s,r)$ can be obtained with tab_prod():
joint <- tab_prod(r, s_r, w_sr); joint |> flat()
What is the probability that it rains given that the grass is wet? We find $p(r,w)=\sum_s p(r,s,w)$ and then $p(r|w)=p(r,w)/p(w)$. Can be done in various ways: with tabDist():
tabDist(joint, marg=~rain, cond=~wet)
## Alternative: rw <- tab_marg(joint, ~rain + wet) tab_div(rw, tab_marg(rw, ~wet)) ## or rw %a/% (rw %a_% ~wet)
## Alternative: x <- tab_slice_mult(rw, slice=list(wet="y")); x tabDist(x, marg=~rain)
We consider the $3$--way \code{lizard} data from \grbase:
data(lizard, package="gRbase") lizard |> flat()
Consider the two factor log--linear model for the \verb'lizard' data. Under the model the expected counts have the form $$ \log m(d,h,s)= a_1(d,h)+a_2(d,s)+a_3(h,s) $$ If we let $n(d,h,s)$ denote the observed counts, the likelihood equations are: Find $m(d,h,s)$ such that \begin{aligned} m(d,h)=n(d,h), \quad m(d,s)=n(d,s), \quad m(h,s)=n(h,s) \end{aligned} where $m(d,h)=\sum_s m(d,h.s)$ etc. The updates are as follows: For the first term we have
$$ m(d,h,s) \leftarrow m(d,h,s) \frac{n(d,h)}{m(d,h)} % , \mbox{ where } % m(d,h) = \sum_s m(d,h,s) $$
After iterating the updates will not change and we will have equality: $ m(d,h,s) = m(d,h,s) \frac{n(d,h)}{m(d,h)}$ and summing over $s$ shows that the equation $m(d,h)=n(d,h)$ is satisfied.
A rudimentary implementation of iterative proportional scaling for log--linear models is straight forward:
myips <- function(indata, glist){ fit <- indata fit[] <- 1 ## List of sufficient marginal tables md <- lapply(glist, function(g) tab_marg(indata, g)) n_iter <- 4 n_generators <- length(glist) for (i in 1:n_iter){ for (j in 1:n_generators){ mf <- tab_marg(fit, glist[[j]]) # adj <- tab_div( md[[ j ]], mf) # fit <- tab_mult( fit, adj ) ## or adj <- md[[ j ]] %a/% mf fit <- fit %a*% adj } } pearson <- sum((fit - indata)^2 / fit) list(pearson=pearson, fit=fit) } glist <- list(c("species", "diam"),c("species", "height"),c("diam", "height")) fm1 <- myips(lizard, glist) fm1$pearson fm1$fit |> flat() fm2 <- loglin(lizard, glist, fit=T) fm2$pearson fm2$fit |> flat()
For e.g.\ a $2\times 3 \times 2$ array, the entries are such that the first variable varies fastest so the ordering of the cells are $(1,1,1)$, $(2,1,1)$, $(1,2,1)$, $(2,2,1)$, $(1,3,1)$ and so on. To find the value of such a cell, say, $(j,k,l)$ in the array (which is really just a vector), the cell is mapped into an entry of a vector.
For example, cell $(2,3,1)$ (Hair=Brown, Eye=Hazel, Sex=Male) must be mapped to entry $4$ in
hec
c(hec)
For illustration we do:
cell2name <- function(cell, dimnames) { unlist(lapply(1:length(cell), function(m){ dimnames[[m]][cell[m]] }))} cell2name(c(2,3,1), dimnames(hec))
\subsection{\code{cell2entry()}, \code{entry2cell()} and \code{next_cell()} }
The map from a cell to the corresponding entry is provided by \rr{cell2entry()}. The reverse operation, going from an entry to a cell (which is much less needed) is provided by \rr {entry2cell()}.
cell2entry(c(2,3,1), dim=c(2, 3, 2)) entry2cell(6, dim=c(2, 3, 2))
Given a cell, say $i=(2,3,1)$ in a $2\times 3\times 2$ array we often want to find the next cell in the table following the convention that the first factor varies fastest, that is $(1,1,2)$. This is provided by next_cell().
next_cell(c(2,3,1), dim=c(2, 3, 2))
Given that we look at cells for which for which the index in dimension $2$ is at level $3$ (that is Eye=Hazel), i.e.\ cells of the form $(j,3,l)$. Given such a cell, what is then the next cell that also satisfies this constraint. This is provided by next_cell_slice().
next_cell_slice(c(1,3,1), slice_marg=2, dim=c( 2, 3, 2 )) next_cell_slice(c(2,3,1), slice_marg=2, dim=c( 2, 3, 2 ))
Given that in dimension $2$ we look at level $3$. We want to find entries for the cells of the form $(j,3,l)$.\footnote{FIXME:slicecell and sliceset should be renamed}
slice2entry(slice_cell=3, slice_marg=2, dim=c( 2, 3, 2 ))
To verify that we indeed get the right cells:
r <- slice2entry(slice_cell=3, slice_marg=2, dim=c( 2, 3, 2 )) lapply(lapply(r, entry2cell, c( 2, 3, 2 )), cell2name, dimnames(hec))
Using the operations above we can obtain the combinations of the factors as a matrix:
head( fact_grid( c(2, 3, 2) ), 6 )
A similar dataframe can also be obtained with the standard R function \code{expand.grid} (but \code{factGrid} is faster)
head( expand.grid(list(1:2, 1:3, 1:2)), 6 )
\appendix
Slicing using standard R code can be done as follows:
hec[, 2:3, ] |> flat() ## A 2 x 2 x 2 array hec[1, , 1] ## A vector hec[1, , 1, drop=FALSE] ## A 1 x 3 x 1 array
Programmatically we can do the above as
do.call("[", c(list(hec), list(TRUE, 2:3, TRUE))) |> flat() do.call("[", c(list(hec), list(1, TRUE, 1))) do.call("[", c(list(hec), list(1, TRUE, 1), drop=FALSE))
\grbase\ provides two alterntives for each of these three cases above:
tab_slice_prim(hec, slice=list(TRUE, 2:3, TRUE)) |> flat() tab_slice(hec, slice=list(c(2, 3)), margin=2) |> flat() tab_slice_prim(hec, slice=list(1, TRUE, 1)) tab_slice(hec, slice=list(1, 1), margin=c(1, 3)) tab_slice_prim(hec, slice=list(1, TRUE, 1), drop=FALSE) tab_slice(hec, slice=list(1, 1), margin=c(1, 3), drop=FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.