mefa4 is a reimplementation of the S3 object classes found in the mefa R package. The new S4 class
"Mefa"
has all the consistency checks that S3 classes can not have, and most importantly, it stores the cross-tabuted results as a compact sparse matrix (S4 object class"dgCMatrix"
of the Matrix package). The use of sparse matrices speed up computations, and reduces object sizes considerably. This vignette introduces the main functions, classes and methods of the package mefa4.
The aim of the mefa and mefa4 packages are to help in storing cross tabulated ecological data tables (community data) together with attributes for rows (samples) and columns (species, taxa). This allows that one can easily subset the relational data object without separately manipulating 2--3 pieces of R objects. By doing so, the chances of errors are reduced.
As ecological data sets are increasing in size, it is
necessary to find more efficient ways of
data storage and manipulation. To this end, it was in
the air for some time to redesign
the mefa
package and take advantages of sparse matrices
from the Matrix package.
This is done at the costs of some old functionalities not being
available for S4 classes at the time being.
Here I give an overview so the user can decide how to use the
parallel availability of old S3 and newer S4 classes.
The S3 classes defined in mefa were stcs
and mefa
.
stcs
is a data frame with several attributes:
library(mefa) x <- data.frame( sample = paste("Sample", c(1,1,2,2,3,4), sep="."), species = c(paste("Species", c(1,1,1,2,3), sep="."), "zero.pseudo"), count = c(1,2,10,3,4,0), segment = letters[c(6,13,6,13,6,6)], stringsAsFactors = TRUE) s <- stcs(x) attributes(s)
These attributes ensure that the cross-tab made by the function
mefa()
creates a proper
cross-tab by eliminating the column that is only a placeholder
for empty samples, etc.:
samp <- data.frame(samples=levels(x$sample), var1=1:2, stringsAsFactors = TRUE) taxa <- data.frame(specnames=levels(x$species), var2=c("b","a"), stringsAsFactors = TRUE) rownames(samp) <- samp$samples rownames(taxa) <- taxa$specnames (m <- mefa(s, samp, taxa)) m$xtab
The stcs
step is almost redundant, and inefficient
relative to the stats::xtabs
function with sparse = TRUE
. This function is
adapted to some extent, so it can subset
the cross-tabulated results before returning the
value (rdrop
and cdrop
arguments,
that is available as the Xtab
function in the
mefa4 package). This takes a formula,
and can be applied directly on a data frame.
The formula can have a left-hand side, or the left-hand side can be
missing. The right-hand side can contain 2--3 factors,
and the result will be a sparse matrix
or a list of sparse matrices, respectively:
library(mefa4) x0 <- Xtab(~ sample + species, x) x1 <- Xtab(count ~ sample + species, x) x11 <- Xtab(count ~ sample + species + segment, x)
Dropping some rows/columns can be done in several ways. A logical statement implies that all empty rows/columns are dropped, but indices (numeric or character) can also be used:
x2 <- Xtab(count ~ sample + species, x, cdrop=FALSE, rdrop=TRUE) x21 <- Xtab(count ~ sample + species, x, cdrop=TRUE, rdrop=FALSE) (x22 <- Xtab(count ~ sample + species, x, cdrop="zero.pseudo"))
The results here are sparse matrices in compact mode, this means that
redundant indices are only kept once, so it is more compact than
a long formatted database representation stored in an stcs
object or in the original data frame, or a triplet representation
of a sparse matrix. See vignettes in the
Matrix package for more details on S4 sparse matrix classes.
The S4 class "Mefa"
is defined in the mefa4 package.
It can be created by the Mefa()
function, and the result has 4 slots:
(x3 <- Mefa(x1, samp, taxa))
The xtab
slot stores the cross-tab in sparse matrix format.
The samp
slot stores the row attributes for xtab
as data frame or can be NULL
.
The taxa
slot stores the column attributes for xtab
as data frame or can be NULL
. Validity checks are done to ensure
proper object classes to be used and matching dimnames.
The option that a column in the attribute tables can be specified
to find matching names is not available in the new implementation.
Corresponding rownames of the data frames has to match dimnames of xtab
.
The join
slot can be "left"
(all rows/columns in the
xtab
are kept, matching attributes are selected, non-matching
attributes are excluded, and missing attributes are filled up with NA
)
or "inner"
(only the intersection of corresponding dimnames are used to
form the return value).
The call in Mefa()
can take any matrix or sparse matrix as argument,
but it will be stored in a sparse mode. Here we use a matrix as input,
and samp
has missing values ("left"
join is used by default):
(x4 <- Mefa(as.matrix(x1), samp[1:2,]))
The effect of "inner"
join is as follows:
(x5 <- Mefa(x2, samp, taxa, join="inner")) (x51 <- Mefa(x2, samp[1:2,], taxa, join="inner"))
A "Mefa"
object with only xtab
can also be defined:
(x6 <- Mefa(x1))
The equivalent of the melt
method of the mefa package is
the Melt
function in mefa4. It can be used to reverse
the side effects of the cross-tabulation, thus making a data frame
from a matrix, sparse matrix, list of sparse matrices,
a mefa
or a Mefa
object:
Melt(x1) Melt(x11)
The structure of the S3 and S4 classes are very similar, and even the
accessor methods (xtab()
, samp()
, taxa()
, segm()
)
work properly on both types. The S4 class does not have a slot for
a call, and there is no segm
element/slot either. This means that
a "Mefa"
object cannot have 3 dimensions, only 2. Xtab
can create
3-dimensional sparse array-like objects (list of sparse matrices of the same dimensions),
but there is no formal S4 class that can handle sparse matrix lists as part of
a "Mefa"
object.
The as.mefa
method can convert such a list of sparse matrices into an S3
"mefa"
object with segments.
Coercion methods are defined in both the mefa and mefa4 packages to ensure that S3 and S4 objects are interchangeable:
as.stcs(x1) as.mefa(x1) as.stcs(x3) a <- as.mefa(x3) xtab(a) samp(a) taxa(a) segm(a) segm(x3) as.Mefa(a) as.Xtab(a) s <- melt(a) as.Xtab(s) as.Mefa(s) melt(x1) melt(x3)
Accessing and replacing parts of the "Mefa"
object
is conveniently done by methods xtab
, samp
,
and taxa
(the segm
S3 method only returns the
xtab
slot of an S4 "Mefa"
object):
xtab(x3) x1[3,1] <- 999 xtab(x3) <- x1 xtab(x3)
Attribute tables can be set to NULL
, or replaced:
samp(x3) samp(x3) <- NULL samp(x3) samp(x3) <- samp[1:3,] samp(x3)
taxa(x3) taxa(x3) <- NULL taxa(x3) taxa(x3) <- taxa[1:3,] taxa(x3)
Replacing parts of these attribute tables can be done as
samp(x3)[1,] samp(x3)[1,2] <- 3 samp(x3)[1,]
Subsetting the whole "Mefa"
object is done via the [
method:
x3[3:2, 1:2] x3[3:2, ] x3[ ,1:2]
Simple methods are provided for convenience:
dim(x5) dimnames(x5) dn <- list(paste("S", 1:dim(x5)[1], sep=""), paste("SPP", 1:dim(x5)[2], sep="")) dimnames(x5) <- dn dimnames(x5)[[1]] <- paste("S", 1:dim(x5)[1], sep="_") dimnames(x5)[[2]] <- paste("SPP", 1:dim(x5)[2], sep="_") t(x5)
The aggregate
method was defined for
S3 mefa
objects. Its equivalent (although it cannot
sum the cells simultaneously for rows and columns, but
it was done in 2 subsequent steps anyway) is the
groupSums
method. The MARGIN
argument
indicates if rows (MARGIN = 1
) or columns
(MARGIN = 2
) are to be added together:
groupSums(as.matrix(x2), 1, c(1,1,2)) groupSums(as.matrix(x2), 2, c(1,1,2,2)) groupSums(x2, 1, c(1,1,2)) groupSums(x2, 2, c(1,1,2,2)) groupSums(x5, 1, c(1,1,2)) groupSums(x5, 2, c(1,1,2,2))
A simple extension of this is the groupMeans
method:
groupMeans(as.matrix(x2), 1, c(1,1,2)) groupMeans(as.matrix(x2), 2, c(1,1,2,2)) groupMeans(x2, 1, c(1,1,2)) groupMeans(x2, 2, c(1,1,2,2)) groupMeans(x5, 1, c(1,1,2)) groupMeans(x5, 2, c(1,1,2,2))
mbind
can be used to combine 2 matrices (dense or sparse).
The 2 input objects are combined in a left join manner, which means
that all the elements in the first object are retained,
and only non-overlapping elements in the second object are used.
Elements of the returning object that are not part of either objects (outer set)
are filled up with value provided as fill
argument.
x=matrix(1:4,2,2) rownames(x) <- c("a", "b") colnames(x) <- c("A", "B") y=matrix(11:14,2,2) rownames(y) <- c("b", "c") colnames(y) <- c("B", "C") mbind(x, y) mbind(x, y, fill=0) mbind(as(x, "sparseMatrix"), as(y, "sparseMatrix"))
"Mefa"
objects can be combined in a similar way, where
attribute tables are combined in a left join fashion
(S3 "mefa"
objects have to be coerced
by the as.Mefa
method beforehand -- this is so
because the S3 class does not allow NA
values in
$xtab
, and it is safer to avoid unnecessary complications):
sampx <- data.frame(x1=1:2, x2=2:1, stringsAsFactors = TRUE) rownames(sampx) <- rownames(x) sampy <- data.frame(x1=3:4, x3=10:11, stringsAsFactors = TRUE) rownames(sampy) <- rownames(y) taxay <- data.frame(x1=1:2, x2=2:1, stringsAsFactors = TRUE) rownames(taxay) <- colnames(y) taxax <- NULL mbind(Mefa(x, sampx), Mefa(y, sampy, taxay))
We compare the performance of the mefa and mefa4 packages. We are using a long formatted raw data file from the Alberta Biodiversity Monitoring Institute database (available at https://www.abmi.ca):
library(mefa) library(mefa4) data(abmibirds)
This is the processing with mefa and S3 object classes (we are storing the results and processing times):
b3 <- abmibirds b3 <- b3[!(b3$Scientific.Name %in% c("VNA", "DNC", "PNA")),] levels(b3$Scientific.Name)[levels(b3$Scientific.Name) %in% c("NONE", "SNI")] <- "zero.pseudo" b3$Counts <- ifelse(b3$Scientific.Name == "zero.pseudo", 0, 1) b3$Label <- with(b3, paste(ABMI.Site, Year, Point.Count.Station, sep="_")) x3 <- b3[!duplicated(b3$Label), c("Label", "ABMI.Site", "Year", "Field.Date", "Point.Count.Station", "Wind.Conditions", "Precipitation")] rownames(x3) <- x3$Label z3 <- b3[!duplicated(b3$Scientific.Name), c("Common.Name", "Scientific.Name", "Taxonomic.Resolution", "Unique.Taxonomic.Identification.Number")] rownames(z3) <- z3$Scientific.Name z3 <- z3[z3$Scientific.Name != "zero.pseudo",] t31 <- system.time(s3 <- suppressWarnings(stcs(b3[, c("Label","Scientific.Name","Counts")]))) t32 <- system.time(m30 <- mefa(s3)) t33 <- system.time(m31 <- mefa(s3, x3, z3)) y30 <- m30$xtab t34 <- system.time(m32 <- mefa(y30, x3, z3)) m32
The equivalent processing with mefa4 and S4 object classes:
b4 <- abmibirds b4$Label <- with(b4, paste(ABMI.Site, Year, Point.Count.Station, sep="_")) x4 <- b4[!duplicated(b4$Label), c("Label", "ABMI.Site", "Year", "Field.Date", "Point.Count.Station", "Wind.Conditions", "Precipitation")] rownames(x4) <- x4$Label z4 <- b4[!duplicated(b4$Scientific.Name), c("Common.Name", "Scientific.Name", "Taxonomic.Resolution", "Unique.Taxonomic.Identification.Number")] rownames(z4) <- z4$Scientific.Name t41 <- system.time(s4 <- Xtab(~ Label + Scientific.Name, b4, cdrop = c("NONE", "SNI"), subset = !(b4$Scientific.Name %in% c("VNA", "DNC", "PNA")), drop.unused.levels = TRUE)) t42 <- system.time(m40 <- Mefa(s4)) t43 <- system.time(m41 <- Mefa(s4, x4, z4)) y40 <- as.matrix(m40@xtab) t44 <- system.time(m42 <- Mefa(y40, x4, z4)) m42 sum(m42@xtab)
Let us compare object sizes and processing times, stars indicate
similar S3 (*=3
) and S4 (*=4
) objects:
res <- cbind("SIZE, *=3"=c("b*"=object.size(b3), "s*"=object.size(s3), "y*0"=object.size(y30), "m*0"=object.size(m30), "m*1"=object.size(m31), "m*2"=object.size(m32)), "SIZE, *=4"=c("b*"=object.size(b4), "s*"=object.size(s4), "y*0"=object.size(y40), "m*0"=object.size(m40), "m*1"=object.size(m41), "m*2"=object.size(m42)), "TIME, *=3"=c("b*"=NA, "s*"=t31[3], "y*0"=NA, "m*0"=t32[3], "m*1"=t33[3], "m*2"=t34[3]), "TIME, *=4"=c("b*"=NA, "s*"=t41[3], "y*0"=NA, "m*0"=t42[3], "m*1"=t43[3], "m*2"=t44[3])) (res <- cbind(res, "SIZE"=res[,2]/res[,1], "TIME"=res[,4]/res[,3]))
The compressed sparse matrix representation is r round(100*res[2,5], 1)
% of the stcs
object in size.
"Mefa"
object sizes are maximum of r round(100*max(res[5:6,5]),1)
% of their S3 representatives.
Processing time speed-up is enormous with sparse matrices (r round(100*mean(res[4:5,6]),1)
%), and still
quite high by standard matrices (r round(100*res[6,6],1)
%).
Check that objects are the same:
stopifnot(identical(dim(y30), dim(y40))) stopifnot(identical(setdiff(rownames(y30), rownames(y40)), character(0))) stopifnot(identical(setdiff(rownames(y40), rownames(y30)), character(0))) stopifnot(identical(setdiff(colnames(y30), colnames(y40)), character(0))) stopifnot(identical(setdiff(colnames(y40), colnames(y30)), character(0)))
The aggregation also improved quite a bit with sparse matrices:
system.time(xx3 <- aggregate(m31, "ABMI.Site")) system.time(xx4 <- groupSums(m41, 1, m41@samp$ABMI.Site))
The redesign of the old S3 classes into S4 ones resulted in large savings in computing time and object sizes. Old features are still available due to the free conversion between the two implementations.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.