Description Usage Arguments Value Note Author(s) See Also Examples

This function splits a matrix or dataframe into two sparse matrices: an incidence and an index matrix. The incidence matrix links the observations (rows) to all possible values that occur in the original matrix. The index matrix links the values to the attributes (columns). This encoding allows for highly efficient calculations on nominal data.

1 2 3 | ```
splitTable(data,
attributes = colnames(data), observations = rownames(data),
name.binder = ":", split = NULL)
``` |

`data` |
a matrix (or data.frame) with observations as rows and nominal attributes as columns. Numerical values in the data will be interpreted as classes (i.e. as nominal data, aka categorical data). |

`attributes, observations` |
The row names and column names of the data will by default be extracted from the input matrix. However, in special situations they can be added separately. Note that names of the attributes (‘column names’) are needed for the production of unique value names. In case of absent column names, new column names of the form ‘X1’ are automatically generated. |

`name.binder` |
Character string to be added between attribute names and value names. Defaults to colon ‘:’. |

`split` |
Character string to split values in each cell of the table, e.g. a comma or semicolon. |

A list containing the various row and column names, and the two sparse pattern matrices of format `ngCMatrix`

:

`attributes` |
vector of attribute names |

`values` |
vector of unique value names |

`observations` |
vector of observation names |

`OV` |
sparse pattern matrix with observations as rows (O) and values as columns (V) |

`AV` |
sparse pattern matrix with attributes as rows (A) and values as columns (V) |

Input of data as a matrix or as a data.frame might lead to different ordering of the values because collation differs per locale (see the discussion at `ttMatrix`

, which does the heavy lifting here).

The term ‘attribute’ is used in instead of the more common term ‘variable’ to allow for the capital A to uniquely identify attributes and V to identify values.

Michael Cysouw

More methods to use such split tables can be found at `sim.nominal`

.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ```
# start with a simple example from the MASS library
# compare the original data with the encoding as sparse matrices
library(MASS)
farms
splitTable(farms)
# As a more involved example, consider the WALS data included in this package
# Transforming the reasonably large WALS data.frame \code{wals$data} is fast
# (2566 observations, 131 attributes, 630 unique values)
# The function `str' gives a useful summary of the result of the splitting
data(wals)
system.time(W <- splitTable(wals$data))
str(W)
# Some basic use examples on the complete WALS data.
# The OV-matrix can be used to quickly count the number of similarities
# between all pairs of observations. Note that with the large amount of missing values
# the resulting numbers are not really meaningfull. Some normalisation is necessary.
system.time( O <- tcrossprod(W$OV*1) )
O[1:10,1:10]
# The number of comparisons available for each pair of attributes
system.time( N <- crossprod(tcrossprod(W$OV*1, W$AV*1)) )
N[1:10,1:10]
# compute the number of available datapoints per observation (language) in WALS
# once the sparse matrices W are computed, such calculations are much quicker than 'apply'
system.time( avail1 <- rowSums(W$OV) )
system.time( avail2 <- apply(wals$data,1,function(x){sum(!is.na(x))}))
names(avail2) <- NULL
all.equal(avail1, avail2)
# Very unequal availability of data over languages in WALS
hist(avail1)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.