Description Usage Arguments Details Value Distance metrics More than one distance metric Custom distance metrics Author(s) See Also Examples
ClusterSystems
inputs a list of biological systems and re-clusters
the genes on an input variable of interest. Set overlap
between the output clusters and clusters defined by input system
membership is measured.
1 2 | clusterSystems(systems, distances = NULL, customDistanceFn = NULL,
dataSources = NULL, combineMatrices, plotVennDiagrams = TRUE, k)
|
systems |
List of input systems that should be
re-clustered. Each element is formatted according to the output of
|
distances |
Character vector indicating choice of data to be used
to measure distance between input genes. Vector containing one or more of
|
customDistanceFn |
Optional list of functions. Any custom functions
that are to be used to calculate pairwise distances between genes. Default
is NULL. If this parameter is not NULL, then |
dataSources |
Optional list of input data appropriate to be used in
conjunction with |
combineMatrices |
Length 1 character vector indicating the method by which
to combine distance matrices if more than one choice of distance data is provided.
If only one type of distance data is provided, then this parameter is
ignored. One of |
plotVennDiagrams |
Logical flag; indicated whether or not Venn diagrams representing the set overlap of the output should be printed when the function is called. Default is TRUE. |
k |
Optional integer. The number of clusters. Default is |
The input systems are clustered according to the specified variables(s) of interest using PAM (partition around medoids) clustering. The similarity between input and output sets is measured using the Jaccard index of set overlap. P values for the observed Jaccard Indexes are calculated by measuring the Jaccard index of 1000 clusters randomly sampled from the input genes.
A named list of length 4. Elements of the list include:
$Clusters
A named vector of integers from 1 to k. Names of the
elements represent the genes belonging to that cluster.
$Best_matches
A named vector of length k representing the cluster
with the best set overlap for each input system.
$Jaccard_indexes
The Jaccard indexes measuring the similarity
between each system and the cluster which is its best match.
$P_values
P values representing the probability of having a
Jaccard index greater than those observed by choosing a random cluster
of the same size from the set of input genes.
There are several different built in methods to compute the distance between a pair of genes.
"expression_profile"
uses GEO data to compute the similarity
between 2 genes as the absolute value of the Spearmann correlation
between their expression profiles. Distance is then taken as 1 - similarity.
"transcription_factor"
GTRD data to compute the number of
shared transcription that bind upstream of the 2 genes divided by the
total number of transcription factors.
In effect, the Jaccard index of the sets of transcription factors from the
2 genes. Distance is then calculated as 1 - similarity.
"network_jaccard"
uses STRING network data to calculate the Jaccard
similarity between the immediate neighbours of the 2 genes. Distance is
then calculated as 1 - similarity.
"network_distance"
uses STRING network data and calculates the
shortest path between the 2 genes.
If more than one distance data type is provided, the various distance metrics will be combined into one distance matrix according to one of the following methods:
'sum'
indicates the distances between a pair of genes will be
defined as the sum of the distances according to each metric.
'product'
indicates that the distance between genes is defined as
the product of their pairwise distances. This method more strongly
penalizes genes that are distant by more than one metric.
'maximum'
or 'minimum'
indicate that the distance between
2 genes should be the maximum or minimum of the distances measured by each
metric respectively.
In addition to several distance metrics which are built in to the function, the user has the option of defining a new distance metric to measure pairwise similarity between genes. If this is the case, an appropriate source of data (mapping from HGNC symbols to data of interest) that can be used as in input for each custom distance function.
Rachel Silverstein (aut)
For examples of distance functions that can be used to create
custom distance matrices, see expr_dist
,
jaccard_dist
, and tf_dist
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ## Not run:
myDB <- fetchData("SysDB")
rootSysIDs <- SyDBgetRootSysIDs(myDB)
sys_names <- names(rootSysIDs)
systems <- SyDBgetSysSymbols(myDB, sys_names)
# Cluster all of the systems in the database according to
# the sum of transcription factor distance and Jaccard network distance:
clusterSystems(systems,
distances = c("transcription_factor", "network_jaccard"),
combineMatrices = 'sum')
# Cluster all of the systems using expression profile distance
# but format it like a custom distance function
GEO <- fetchData("GEOprofiles")
clusterSystems(systems,
distances = NULL,
customDistanceFn = list(expr_dist),
dataSources = list(GEO))
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.