| group.knowns | R Documentation |
Group a TRAMPknowns object so that knowns
with similar TRFLP patterns and knowns that share the same species
name “group” together. In general, this function will be called
automatically whenever appropriate (e.g. when loading a data set or
adding new knowns). Please see Details to understand why this
function is necessary, and how it works.
The main reason for manually calling group.knowns is to change
the default values of the arguments; if you call group.knowns
on a TRAMPknowns object, then any subsequent automatic call to
group.knowns will use any arguments you passed in the
manual group.knowns call (e.g. after doing
group.knowns(x, cut.height=20), all future groupings will use
cut.height=20).
group.knowns(x, ...)
## S3 method for class 'TRAMPknowns'
group.knowns(x, dist.method, hclust.method, cut.height, ...)
## S3 method for class 'TRAMP'
group.knowns(x, ...)
x |
A |
dist.method |
Distance method used in calculating similarity
between different knowns (see |
hclust.method |
Clustering method used in generating clusters
from the similarity matrix (see |
cut.height |
Passed to |
... |
Arguments passed to further methods. |
group.knowns groups together knowns in a
TRAMPknowns object based on two criteria: (1) TRFLP
profiles that are very similar across shared enzyme/primer
combinations (based on clustering) and (2) TRFLP profiles that belong
to the same species (i.e. share a common species column in the
info data.frame of x; see TRAMPknowns for
more information). This is to solve three issues in TRFLP analysis:
The TRFLP profile of a single species can have variation in
peak sizes due to DNA sequence variation. By including multiple
collections of each species, variation in TRFLP profiles can be
accounted for. If a TRAMPknowns object contains
multiple collections of a species, these will be aggregated by
group.knowns. This aggregation is essential for community
analysis, as leaving individual collections will artificially
inflate the number of “present species” when running
TRAMP.
Some authors have taken an alternative approach by using a larger
tolerance in matching peaks between samples and knowns (effectively
increasing accept.error in TRAMP) to account
for within-species variation. This is not recommended, as it
dramatically increases the risk of incorrect matches.
Distinctly different TRFLP profiles may occur within a species
(or in some cases within an individual); see Avis et al. (2006).
group.knowns looks at the species column of the
info data.frame of x and joins any knowns with
identical species values as a group.
This can also be used where multiple profiles are present in an
individual.
Different species may share a similar TRFLP profile and
therefore be indistinguishable using TRFLP. If these patterns are
not grouped, two species will be recorded as present wherever either
is present. group.knowns prevents this by joining knowns with
“very similar” TRFLP patterns as a group. Ideally, these
problematic groups can be resolved by increasing the number of
enzyme/primer pairs in the data.
Groups names are generated by concatenating all unique (sorted) species names together, separated by commas.
To determine if knowns are “similar enough” to form a group, we
use R's clustering tools: dist, hclust
and cutree. First, we generate a distance matrix of the
knowns profiles using dist, and using method
dist.method (see Example below; this is very similar to what
TRAMP does, and dist.method should be specified
accordingly). We then generate clusters using hclust,
and using method hclust.method, and “cut” the tree at
cut.height using cutree.
Knowns are grouped together iteratively; so that all groups sharing a common cluster are grouped together, and all knowns that share a common species name are grouped together. In certain cases this may chain together seemingly unrelated groups.
Because group.knowns is generic, it can be run on either a
TRAMPknowns or a TRAMP object. When run
on a TRAMP object, it updates the TRAMPknowns object
(stored as x$knowns), so that subsequent calls to
plot.TRAMPknowns or summary.TRAMPknowns
(for example) will use the new grouping parameters.
Parameters set by group.knowns are retained as part of the
object, so that when adding additional knowns (add.known
and combine), or when subsetting a knowns database (see
[.TRAMPknowns,
aka TRAMPindexing), the same grouping parameters will be
used.
For group.knowns.TRAMPknowns, a new TRAMPknowns object.
The cluster.pars element will have been updated with new
parameters, if any were specified.
For group.knowns.TRAMP, a new TRAMP object, with an
updated knowns element. Note that the original
TRAMPknowns object (i.e. the one from which the TRAMP
object was constructed) will not be modified.
Warning about missing data: where there are NA values in
certain combinations, NAs may be present in the final distance
matrix, which means we cannot use hclust to generate the
clusters! In general, NA values are fine. They just can't be
everywhere.
Avis PG, Dickie IA, Mueller GM 2006. A ‘dirty’ business: testing the limitations of terminal restriction fragment length polymorphism (TRFLP) analysis of soil fungi. Molecular Ecology 15: 873-882.
TRAMPknowns, which describes the TRAMPknowns
object.
build.knowns, which attempts to generate a knowns
database from a TRAMPsamples data set.
plot.TRAMPknowns, which graphically displays the
relationships between knowns.
data(demo.knowns)
data(demo.samples)
demo.knowns <- group.knowns(demo.knowns, cut.height=2.5)
plot(demo.knowns)
## Increasing cut.height makes groups more inclusive:
plot(group.knowns(demo.knowns, cut.height=100))
res <- TRAMP(demo.samples, demo.knowns)
m1.ungrouped <- summary(res)
m1.grouped <- summary(res, group=TRUE)
ncol(m1.grouped) # 94 groups
res2 <- group.knowns(res, cut.height=100)
m2.ungrouped <- summary(res2)
m2.grouped <- summary(res2, group=TRUE)
ncol(m2.grouped) # Now only 38 groups
## group.knowns results in the same distance matrix as produced by
## TRAMP, therefore using the same method (e.g. method="maximum") is
## important. The example below shows how the matrix produced by
## dist(summary(x)) (as calculated by group.knowns) is the same as that
## produced by TRAMP:
f <- function(x, method="maximum") {
## Create a pseudo-samples object from our knowns
y <- x
y$data$height <- 1
names(y$info)[names(y$info) == "knowns.pk"] <- "sample.pk"
names(y$data)[names(y$data) == "knowns.fk"] <- "sample.fk"
class(y) <- "TRAMPsamples"
## Run TRAMP, clean up and return
## (If method != "maximum", rescale the error to match that
## generated by dist()).
z <- TRAMP(y, x, method=method)
if ( method != "maximum" ) z$error <- z$error * z$n
names(dimnames(z$error)) <- NULL
z
}
g <- function(x, method="maximum")
as.matrix(dist(summary(x), method=method))
all.equal(f(demo.knowns, "maximum")$error, g(demo.knowns, "maximum"))
all.equal(f(demo.knowns, "euclidian")$error, g(demo.knowns, "euclidian"))
all.equal(f(demo.knowns, "manhattan")$error, g(demo.knowns, "manhattan"))
## However, TRAMP is over 100 times slower in this special case.
system.time(f(demo.knowns))
system.time(g(demo.knowns))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.