In psolymos/mefa4: Multivariate Data Handling with S4 Classes and Sparse Matrices

mefa4 is a reimplementation of the S3 object classes found in the mefa R package. The new S4 class "Mefa" has all the consistency checks that S3 classes can not have, and most importantly, it stores the cross-tabuted results as a compact sparse matrix (S4 object class "dgCMatrix" of the Matrix package). The use of sparse matrices speed up computations, and reduces object sizes considerably. This vignette introduces the main functions, classes and methods of the package mefa4.

Introduction

The aim of the mefa and mefa4 packages are to help in storing cross tabulated ecological data tables (community data) together with attributes for rows (samples) and columns (species, taxa). This allows that one can easily subset the relational data object without separately manipulating 2--3 pieces of R objects. By doing so, the chances of errors are reduced.

As ecological data sets are increasing in size, it is necessary to find more efficient ways of data storage and manipulation. To this end, it was in the air for some time to redesign the mefa package and take advantages of sparse matrices from the Matrix package. This is done at the costs of some old functionalities not being available for S4 classes at the time being. Here I give an overview so the user can decide how to use the parallel availability of old S3 and newer S4 classes.

Comparisons

S3 and S4 classes

The S3 classes defined in mefa were stcs and mefa. stcs is a data frame with several attributes:

library(mefa)
x <- data.frame(
    sample = paste("Sample", c(1,1,2,2,3,4), sep="."),
    species = c(paste("Species", c(1,1,1,2,3), sep="."),  "zero.pseudo"),
    count = c(1,2,10,3,4,0),
    segment = letters[c(6,13,6,13,6,6)],
    stringsAsFactors = TRUE)
s <- stcs(x)
attributes(s)

These attributes ensure that the cross-tab made by the function mefa() creates a proper cross-tab by eliminating the column that is only a placeholder for empty samples, etc.:

samp <- data.frame(samples=levels(x$sample), var1=1:2,
    stringsAsFactors = TRUE)
taxa <- data.frame(specnames=levels(x$species), var2=c("b","a"),
    stringsAsFactors = TRUE)
rownames(samp) <- samp$samples
rownames(taxa) <- taxa$specnames
(m <- mefa(s, samp, taxa))
m$xtab

The stcs step is almost redundant, and inefficient relative to the stats::xtabs function with sparse = TRUE. This function is adapted to some extent, so it can subset the cross-tabulated results before returning the value (rdrop and cdrop arguments, that is available as the Xtab function in the mefa4 package). This takes a formula, and can be applied directly on a data frame. The formula can have a left-hand side, or the left-hand side can be missing. The right-hand side can contain 2--3 factors, and the result will be a sparse matrix or a list of sparse matrices, respectively:

library(mefa4)
x0  <- Xtab(~ sample + species, x)
x1  <- Xtab(count ~ sample + species, x)
x11 <- Xtab(count ~ sample + species + segment, x)

Dropping some rows/columns can be done in several ways. A logical statement implies that all empty rows/columns are dropped, but indices (numeric or character) can also be used:

x2  <- Xtab(count ~ sample + species, x, cdrop=FALSE, rdrop=TRUE)
x21 <- Xtab(count ~ sample + species, x, cdrop=TRUE, rdrop=FALSE)
(x22 <- Xtab(count ~ sample + species, x, cdrop="zero.pseudo"))

The results here are sparse matrices in compact mode, this means that redundant indices are only kept once, so it is more compact than a long formatted database representation stored in an stcs object or in the original data frame, or a triplet representation of a sparse matrix. See vignettes in the Matrix package for more details on S4 sparse matrix classes.

The S4 class "Mefa" is defined in the mefa4 package. It can be created by the Mefa() function, and the result has 4 slots:

(x3 <- Mefa(x1, samp, taxa))

The xtab slot stores the cross-tab in sparse matrix format. The samp slot stores the row attributes for xtab as data frame or can be NULL. The taxa slot stores the column attributes for xtab as data frame or can be NULL. Validity checks are done to ensure proper object classes to be used and matching dimnames. The option that a column in the attribute tables can be specified to find matching names is not available in the new implementation. Corresponding rownames of the data frames has to match dimnames of xtab. The join slot can be "left" (all rows/columns in the xtab are kept, matching attributes are selected, non-matching attributes are excluded, and missing attributes are filled up with NA) or "inner" (only the intersection of corresponding dimnames are used to form the return value).

The call in Mefa() can take any matrix or sparse matrix as argument, but it will be stored in a sparse mode. Here we use a matrix as input, and samp has missing values ("left" join is used by default):

(x4 <- Mefa(as.matrix(x1), samp[1:2,]))

The effect of "inner" join is as follows:

(x5 <- Mefa(x2, samp, taxa, join="inner"))
(x51 <- Mefa(x2, samp[1:2,], taxa, join="inner"))

A "Mefa" object with only xtab can also be defined:

(x6 <- Mefa(x1))

The equivalent of the melt method of the mefa package is the Melt function in mefa4. It can be used to reverse the side effects of the cross-tabulation, thus making a data frame from a matrix, sparse matrix, list of sparse matrices, a mefa or a Mefa object:

Melt(x1)
Melt(x11)

The structure of the S3 and S4 classes are very similar, and even the accessor methods (xtab(), samp(), taxa(), segm()) work properly on both types. The S4 class does not have a slot for a call, and there is no segm element/slot either. This means that a "Mefa" object cannot have 3 dimensions, only 2. Xtab can create 3-dimensional sparse array-like objects (list of sparse matrices of the same dimensions), but there is no formal S4 class that can handle sparse matrix lists as part of a "Mefa" object. The as.mefa method can convert such a list of sparse matrices into an S3 "mefa" object with segments.

Coercions back and forth

Coercion methods are defined in both the mefa and mefa4 packages to ensure that S3 and S4 objects are interchangeable:

as.stcs(x1)
as.mefa(x1)
as.stcs(x3)
a <- as.mefa(x3)
xtab(a)
samp(a)
taxa(a)
segm(a)
segm(x3)
as.Mefa(a)
as.Xtab(a)
s <- melt(a)
as.Xtab(s)
as.Mefa(s)
melt(x1)
melt(x3)

Subsetting and replacement

Accessing and replacing parts of the "Mefa" object is conveniently done by methods xtab, samp, and taxa (the segm S3 method only returns the xtab slot of an S4 "Mefa" object):

xtab(x3)
x1[3,1] <- 999
xtab(x3) <- x1
xtab(x3)

Attribute tables can be set to NULL, or replaced:

samp(x3)
samp(x3) <- NULL
samp(x3)
samp(x3) <- samp[1:3,]
samp(x3)

taxa(x3)
taxa(x3) <- NULL
taxa(x3)
taxa(x3) <- taxa[1:3,]
taxa(x3)

Replacing parts of these attribute tables can be done as

samp(x3)[1,]
samp(x3)[1,2] <- 3
samp(x3)[1,]

Subsetting the whole "Mefa" object is done via the [ method:

x3[3:2, 1:2]
x3[3:2, ]
x3[ ,1:2]

Methods for S4 classes

Simple methods are provided for convenience:

dim(x5)
dimnames(x5)
dn <- list(paste("S", 1:dim(x5)[1], sep=""),
    paste("SPP", 1:dim(x5)[2], sep=""))
dimnames(x5) <- dn
dimnames(x5)[[1]] <- paste("S", 1:dim(x5)[1], sep="_")
dimnames(x5)[[2]] <- paste("SPP", 1:dim(x5)[2], sep="_")
t(x5)

Utilities

Grouping rows and columns

The aggregate method was defined for S3 mefa objects. Its equivalent (although it cannot sum the cells simultaneously for rows and columns, but it was done in 2 subsequent steps anyway) is the groupSums method. The MARGIN argument indicates if rows (MARGIN = 1) or columns (MARGIN = 2) are to be added together:

groupSums(as.matrix(x2), 1, c(1,1,2))
groupSums(as.matrix(x2), 2, c(1,1,2,2))
groupSums(x2, 1, c(1,1,2))
groupSums(x2, 2, c(1,1,2,2))
groupSums(x5, 1, c(1,1,2))
groupSums(x5, 2, c(1,1,2,2))

A simple extension of this is the groupMeans method:

groupMeans(as.matrix(x2), 1, c(1,1,2))
groupMeans(as.matrix(x2), 2, c(1,1,2,2))
groupMeans(x2, 1, c(1,1,2))
groupMeans(x2, 2, c(1,1,2,2))
groupMeans(x5, 1, c(1,1,2))
groupMeans(x5, 2, c(1,1,2,2))

Combining objects

mbind can be used to combine 2 matrices (dense or sparse). The 2 input objects are combined in a left join manner, which means that all the elements in the first object are retained, and only non-overlapping elements in the second object are used. Elements of the returning object that are not part of either objects (outer set) are filled up with value provided as fill argument.

x=matrix(1:4,2,2)
rownames(x) <- c("a", "b")
colnames(x) <- c("A", "B")
y=matrix(11:14,2,2)
rownames(y) <- c("b", "c")
colnames(y) <- c("B", "C")
mbind(x, y)
mbind(x, y, fill=0)
mbind(as(x, "sparseMatrix"), as(y, "sparseMatrix"))

"Mefa" objects can be combined in a similar way, where attribute tables are combined in a left join fashion (S3 "mefa" objects have to be coerced by the as.Mefa method beforehand -- this is so because the S3 class does not allow NA values in $xtab, and it is safer to avoid unnecessary complications):

sampx <- data.frame(x1=1:2, x2=2:1,
    stringsAsFactors = TRUE)
rownames(sampx) <- rownames(x)
sampy <- data.frame(x1=3:4, x3=10:11,
    stringsAsFactors = TRUE)
rownames(sampy) <- rownames(y)
taxay <- data.frame(x1=1:2, x2=2:1,
    stringsAsFactors = TRUE)
rownames(taxay) <- colnames(y)
taxax <- NULL
mbind(Mefa(x, sampx), Mefa(y, sampy, taxay))

Performance comparisons

We compare the performance of the mefa and mefa4 packages. We are using a long formatted raw data file from the Alberta Biodiversity Monitoring Institute database (available at https://www.abmi.ca):

library(mefa)
library(mefa4)
data(abmibirds)

This is the processing with mefa and S3 object classes (we are storing the results and processing times):

b3 <- abmibirds
b3 <- b3[!(b3$Scientific.Name %in% c("VNA", "DNC", "PNA")),]
levels(b3$Scientific.Name)[levels(b3$Scientific.Name) 
    %in% c("NONE", "SNI")] <- "zero.pseudo"
b3$Counts <- ifelse(b3$Scientific.Name == "zero.pseudo", 0, 1)
b3$Label <- with(b3, paste(ABMI.Site, Year, 
    Point.Count.Station, sep="_"))
x3 <- b3[!duplicated(b3$Label), c("Label", 
    "ABMI.Site", "Year", "Field.Date", 
    "Point.Count.Station", "Wind.Conditions", "Precipitation")]
rownames(x3) <- x3$Label
z3 <- b3[!duplicated(b3$Scientific.Name), c("Common.Name",
    "Scientific.Name", "Taxonomic.Resolution", 
    "Unique.Taxonomic.Identification.Number")]
rownames(z3) <- z3$Scientific.Name
z3 <- z3[z3$Scientific.Name != "zero.pseudo",]
t31 <- system.time(s3 <- suppressWarnings(stcs(b3[,
    c("Label","Scientific.Name","Counts")])))
t32 <- system.time(m30 <- mefa(s3))
t33 <- system.time(m31 <- mefa(s3, x3, z3))
y30 <- m30$xtab
t34 <- system.time(m32 <- mefa(y30, x3, z3))
m32

The equivalent processing with mefa4 and S4 object classes:

b4 <- abmibirds
b4$Label <- with(b4, paste(ABMI.Site, Year, 
    Point.Count.Station, sep="_"))
x4 <- b4[!duplicated(b4$Label), c("Label", "ABMI.Site", 
    "Year", "Field.Date", "Point.Count.Station",
    "Wind.Conditions", "Precipitation")]
rownames(x4) <- x4$Label
z4 <- b4[!duplicated(b4$Scientific.Name), c("Common.Name",
    "Scientific.Name", "Taxonomic.Resolution", 
    "Unique.Taxonomic.Identification.Number")]
rownames(z4) <- z4$Scientific.Name
t41 <- system.time(s4 <- Xtab(~ Label + Scientific.Name, 
    b4, cdrop = c("NONE", "SNI"), 
    subset = !(b4$Scientific.Name %in% c("VNA", "DNC", "PNA")), 
    drop.unused.levels = TRUE))
t42 <- system.time(m40 <- Mefa(s4))
t43 <- system.time(m41 <- Mefa(s4, x4, z4))
y40 <- as.matrix(m40@xtab)
t44 <- system.time(m42 <- Mefa(y40, x4, z4))
m42
sum(m42@xtab)

Let us compare object sizes and processing times, stars indicate similar S3 (*=3) and S4 (*=4) objects:

res <- cbind("SIZE, *=3"=c("b*"=object.size(b3),
    "s*"=object.size(s3),
    "y*0"=object.size(y30),
    "m*0"=object.size(m30),
    "m*1"=object.size(m31),
    "m*2"=object.size(m32)),
"SIZE, *=4"=c("b*"=object.size(b4),
    "s*"=object.size(s4),
    "y*0"=object.size(y40),
    "m*0"=object.size(m40),
    "m*1"=object.size(m41),
    "m*2"=object.size(m42)),
"TIME, *=3"=c("b*"=NA,
    "s*"=t31[3],
    "y*0"=NA,
    "m*0"=t32[3],
    "m*1"=t33[3],
    "m*2"=t34[3]),
"TIME, *=4"=c("b*"=NA,
    "s*"=t41[3],
    "y*0"=NA,
    "m*0"=t42[3],
    "m*1"=t43[3],
    "m*2"=t44[3]))
(res <- cbind(res, "SIZE"=res[,2]/res[,1], "TIME"=res[,4]/res[,3]))

The compressed sparse matrix representation is r round(100*res[2,5], 1)% of the stcs object in size. "Mefa" object sizes are maximum of r round(100*max(res[5:6,5]),1)% of their S3 representatives. Processing time speed-up is enormous with sparse matrices (r round(100*mean(res[4:5,6]),1)%), and still quite high by standard matrices (r round(100*res[6,6],1)%).

Check that objects are the same:

stopifnot(identical(dim(y30), dim(y40)))
stopifnot(identical(setdiff(rownames(y30), rownames(y40)), character(0)))
stopifnot(identical(setdiff(rownames(y40), rownames(y30)), character(0)))
stopifnot(identical(setdiff(colnames(y30), colnames(y40)), character(0)))
stopifnot(identical(setdiff(colnames(y40), colnames(y30)), character(0)))

The aggregation also improved quite a bit with sparse matrices:

system.time(xx3 <- aggregate(m31, "ABMI.Site"))
system.time(xx4 <- groupSums(m41, 1, m41@samp$ABMI.Site))

Conclusions

The redesign of the old S3 classes into S4 ones resulted in large savings in computing time and object sizes. Old features are still available due to the free conversion between the two implementations.

psolymos/mefa4 documentation built on Sept. 25, 2024, 4:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com