splitTable: Construct sparse matrices from a nominal matrix/dataframe

Description Usage Arguments Value Note Author(s) See Also Examples

View source: R/split.R

Description

This function splits a matrix or dataframe into two sparse matrices: an incidence and an index matrix. The incidence matrix links the observations (rows) to all possible values that occur in the original matrix. The index matrix links the values to the attributes (columns). This encoding allows for highly efficient calculations on nominal data.

Usage

1
2
3
splitTable(data, 
    attributes = colnames(data), observations = rownames(data),
		name.binder = ":", split = NULL)

Arguments

data

a matrix (or data.frame) with observations as rows and nominal attributes as columns. Numerical values in the data will be interpreted as classes (i.e. as nominal data, aka categorical data).

attributes, observations

The row names and column names of the data will by default be extracted from the input matrix. However, in special situations they can be added separately. Note that names of the attributes (‘column names’) are needed for the production of unique value names. In case of absent column names, new column names of the form ‘X1’ are automatically generated.

name.binder

Character string to be added between attribute names and value names. Defaults to colon ‘:’.

split

Character string to split values in each cell of the table, e.g. a comma or semicolon.

Value

A list containing the various row and column names, and the two sparse pattern matrices of format ngCMatrix:

attributes

vector of attribute names

values

vector of unique value names

observations

vector of observation names

OV

sparse pattern matrix with observations as rows (O) and values as columns (V)

AV

sparse pattern matrix with attributes as rows (A) and values as columns (V)

Note

Input of data as a matrix or as a data.frame might lead to different ordering of the values because collation differs per locale (see the discussion at ttMatrix, which does the heavy lifting here).

The term ‘attribute’ is used in instead of the more common term ‘variable’ to allow for the capital A to uniquely identify attributes and V to identify values.

Author(s)

Michael Cysouw

See Also

More methods to use such split tables can be found at sim.nominal.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# start with a simple example from the MASS library
# compare the original data with the encoding as sparse matrices
library(MASS)
farms
splitTable(farms)

# As a more involved example, consider the WALS data included in this package
# Transforming the reasonably large WALS data.frame \code{wals$data} is fast
# (2566 observations, 131 attributes, 630 unique values)
# The function `str' gives a useful summary of the result of the splitting
data(wals)
system.time(W <- splitTable(wals$data))
str(W) 

# Some basic use examples on the complete WALS data.
# The OV-matrix can be used to quickly count the number of similarities 
# between all pairs of observations. Note that with the large amount of missing values
# the resulting numbers are not really meaningfull. Some normalisation is necessary.
system.time( O <- tcrossprod(W$OV*1) )
O[1:10,1:10]

# The number of comparisons available for each pair of attributes
system.time( N <- crossprod(tcrossprod(W$OV*1, W$AV*1)) )
N[1:10,1:10]

# compute the number of available datapoints per observation (language) in WALS
# once the sparse matrices W are computed, such calculations are much quicker than 'apply'
system.time( avail1 <- rowSums(W$OV) )
system.time( avail2 <- apply(wals$data,1,function(x){sum(!is.na(x))}))
names(avail2) <- NULL
all.equal(avail1, avail2)

# Very unequal availability of data over languages in WALS
hist(avail1)

cysouw/qlcMatrix documentation built on April 22, 2018, 4:59 a.m.