fpSim: Fingerprint Search
In girke-lab/ChemmineR-git-svn-bridge: Cheminformatics Toolkit for R

Description Usage Arguments Value Author(s) References See Also Examples

Search function for fingerprints, such as PubChem or atom pair fingerprints. Enables structure similarity comparisons, searching and clustering.

1
2
3

fpSim(x, y, sorted=TRUE, method="Tanimoto", 
		addone=1, cutoff=0, top="all", alpha=1, beta=1,
		parameters=NULL,scoreType="similarity")

`x`	Query molecule of class `numeric`, `FP` or `FPset` (of length one) containing binary fingerprint data. Both `x` and `y` need to have the same number of bits and should contain the same type of fingerprints.
`y`	Subject molecule(s) of class `numeric`, `matrix`, `FP` or `FPset` containing binary fingerprint data.
`sorted`	return results sorted or unsorted
`method`	Similarity coefficient to return. One can choose here from several predefined similarity measures: "Tanimoto" (default), "Euclidean", "Tversky" or "Dice". Alternatively, one can pass on any custom similarity function containing the arguments a, b, c and d. For instance, one can define "myfct <- function(a, b, c, d) c/(alphaa + betab + c)" and then pass on `method=myfct`. The variable 'c' is the number of "on-bits" common in both compounds, 'd' is the number of "off-bits" common in both compounds, and 'a' and 'b' are the number of "on-bits" that are unique in one or the other compound, respectively. The predefined methods will run a C++ version of this function which is about twice as fast as the R version. When a custom similarity function is given however, it will fall back to using the R version.
`addone`	Value to add to numerator and denominator of similarity coefficient to avoid devision by zero when fingerprint(s) contain only "off-bits" (zeros). Note: if `addone > 0` then fingerprints with no "on-bits" will receive the highest similarity score. Typically, this occurs only with extremely small molecules.
`cutoff`	allows to restrict results to hits above a similarity cutoff value; default `cutoff=0` returns results for all compounds in `y`.
`top`	allows to restrict number of subject molecules to return; default `top="all"` returns results for all compounds in `y` above `cutoff` value.
`alpha`	Only used when method="Tversky". Allows to specify the weighting variable 'alpha' of the Tversky index: c/(alphaa + betab + c)
`beta`	Only used when method="Tversky". Allows to specify the weighting variable 'beta' of the Tversky index.
`parameters`	Parameters for computing Z-scores, E-values, and p-values. Pass this data if you want these scores returned. This data can be generated with the `genParameters` function.
`scoreType`	If using the `parameters` argument, this argument specified which type of score the `cutoff` and `sorted` arguments should be applied to. It should be one of "similarity" (default), "zscore", "evalue", or "pvalue".

Returns numeric vector with similarity coefficients as values and compound identifiers as names.

Thomas Girke, Kevin Horan

Tanimoto similarity coefficient: Tanimoto TT (1957) IBM Internal Report 17th Nov see also Jaccard P (1901) Bulletin del la Societe Vaudoisedes Sciences Naturelles 37, 241-272.

PubChem fingerprint specification: ftp://ftp.ncbi.nih.gov/pubchem/specifications/pubchem_fingerprints.txt

Functions: fp2bit

## Load PubChem SDFset sample
data(sdfsample); sdfset <- sdfsample
cid(sdfset) <- sdfid(sdfset)

## Convert base 64 encoded fingerprints to character vector or binary matrix
fpset <- fp2bit(sdfset)

## Alternatively, one can use atom pair fingerprints 
## Not run: 
fpset <- desc2fp(sdf2ap(sdfset))

## End(Not run)

## Pairwise compound structure comparisons
fpSim(x=fpset[1], y=fpset[2], method="Tanimoto")

## Structure similarity searching: x is query and y is fingerprint database  
fpSim(x=fpset[1], y=fpset) 

## Controlling the output
fpSim(x=fpset[1], y=fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1) 

## Use custom distance function
myfct <- function(a, b, c, d) c/(a+b+c+d)
fpSim(x=fpset[1], y=fpset, method=myfct) 

## Compute fingerprint-based Tanimoto similarity matrix 
simMA <- sapply(cid(fpset), function(x) fpSim(x=fpset[x], fpset, sorted=FALSE)) 

## Hierarchical clustering with simMA as input
hc <- hclust(as.dist(1-simMA), method="single")

## Plot hierarchical clustering tree
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)