fpSim: Fingerprint Search

Description Usage Arguments Value Author(s) References See Also Examples

View source: R/sim.R

Description

Search function for fingerprints, such as PubChem or atom pair fingerprints. Enables structure similarity comparisons, searching and clustering.

Usage

1
2
3
fpSim(x, y, sorted=TRUE, method="Tanimoto", 
		addone=1, cutoff=0, top="all", alpha=1, beta=1,
		parameters=NULL,scoreType="similarity")

Arguments

x

Query molecule of class numeric, FP or FPset (of length one) containing binary fingerprint data. Both x and y need to have the same number of bits and should contain the same type of fingerprints.

y

Subject molecule(s) of class numeric, matrix, FP or FPset containing binary fingerprint data.

sorted

return results sorted or unsorted

method

Similarity coefficient to return. One can choose here from several predefined similarity measures: "Tanimoto" (default), "Euclidean", "Tversky" or "Dice". Alternatively, one can pass on any custom similarity function containing the arguments a, b, c and d. For instance, one can define "myfct <- function(a, b, c, d) c/(alpha*a + beta*b + c)" and then pass on method=myfct. The variable 'c' is the number of "on-bits" common in both compounds, 'd' is the number of "off-bits" common in both compounds, and 'a' and 'b' are the number of "on-bits" that are unique in one or the other compound, respectively.

The predefined methods will run a C++ version of this function which is about twice as fast as the R version. When a custom similarity function is given however, it will fall back to using the R version.

addone

Value to add to numerator and denominator of similarity coefficient to avoid devision by zero when fingerprint(s) contain only "off-bits" (zeros). Note: if addone > 0 then fingerprints with no "on-bits" will receive the highest similarity score. Typically, this occurs only with extremely small molecules.

cutoff

allows to restrict results to hits above a similarity cutoff value; default cutoff=0 returns results for all compounds in y.

top

allows to restrict number of subject molecules to return; default top="all" returns results for all compounds in y above cutoff value.

alpha

Only used when method="Tversky". Allows to specify the weighting variable 'alpha' of the Tversky index: c/(alpha*a + beta*b + c)

beta

Only used when method="Tversky". Allows to specify the weighting variable 'beta' of the Tversky index.

parameters

Parameters for computing Z-scores, E-values, and p-values. Pass this data if you want these scores returned. This data can be generated with the genParameters function.

scoreType

If using the parameters argument, this argument specified which type of score the cutoff and sorted arguments should be applied to. It should be one of "similarity" (default), "zscore", "evalue", or "pvalue".

Value

Returns numeric vector with similarity coefficients as values and compound identifiers as names.

Author(s)

Thomas Girke, Kevin Horan

References

Tanimoto similarity coefficient: Tanimoto TT (1957) IBM Internal Report 17th Nov see also Jaccard P (1901) Bulletin del la Societe Vaudoisedes Sciences Naturelles 37, 241-272.

PubChem fingerprint specification: ftp://ftp.ncbi.nih.gov/pubchem/specifications/pubchem_fingerprints.txt

See Also

Functions: fp2bit

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## Load PubChem SDFset sample
data(sdfsample); sdfset <- sdfsample
cid(sdfset) <- sdfid(sdfset)

## Convert base 64 encoded fingerprints to character vector or binary matrix
fpset <- fp2bit(sdfset)

## Alternatively, one can use atom pair fingerprints 
## Not run: 
fpset <- desc2fp(sdf2ap(sdfset))

## End(Not run)

## Pairwise compound structure comparisons
fpSim(x=fpset[1], y=fpset[2], method="Tanimoto")

## Structure similarity searching: x is query and y is fingerprint database  
fpSim(x=fpset[1], y=fpset) 

## Controlling the output
fpSim(x=fpset[1], y=fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1) 

## Use custom distance function
myfct <- function(a, b, c, d) c/(a+b+c+d)
fpSim(x=fpset[1], y=fpset, method=myfct) 

## Compute fingerprint-based Tanimoto similarity matrix 
simMA <- sapply(cid(fpset), function(x) fpSim(x=fpset[x], fpset, sorted=FALSE)) 

## Hierarchical clustering with simMA as input
hc <- hclust(as.dist(1-simMA), method="single")

## Plot hierarchical clustering tree
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE) 

Example output

   650002 
0.3947368 
   650001    650094    650004    650085    650077    650079    650092    650074 
1.0000000 0.5375000 0.5000000 0.5000000 0.4861111 0.4583333 0.4555556 0.4444444 
   650104    650102    650082    650072    650054    650011    650016    650087 
0.4444444 0.4342105 0.4324324 0.4285714 0.4235294 0.4193548 0.4133333 0.4117647 
   650033    650048    650002    650039    650091    650032    650089    650056 
0.4000000 0.3977273 0.3947368 0.3846154 0.3815789 0.3775510 0.3766234 0.3750000 
   650049    650067    650050    650090    650024    650020    650070    650097 
0.3733333 0.3658537 0.3648649 0.3647059 0.3636364 0.3552632 0.3500000 0.3456790 
   650069    650046    650015    650075    650003    650071    650098    650012 
0.3406593 0.3404255 0.3333333 0.3333333 0.3250000 0.3235294 0.3235294 0.3205128 
   650023    650096    650058    650026    650005    650065    650066    650041 
0.3194444 0.3194444 0.3157895 0.3132530 0.3076923 0.3058824 0.3058824 0.3000000 
   650080    650044    650009    650013    650068    650078    650099    650061 
0.2987013 0.2962963 0.2941176 0.2916667 0.2891566 0.2857143 0.2857143 0.2835821 
   650062    650007    650093    650105    650086    650019    650040    650052 
0.2835821 0.2763158 0.2763158 0.2745098 0.2739726 0.2727273 0.2727273 0.2727273 
   650034    650028    650021    650031    650103    650038    650059    650060 
0.2656250 0.2637363 0.2631579 0.2575758 0.2571429 0.2535211 0.2535211 0.2535211 
   650008    650083    650035    650037    650076    650063    650064    650017 
0.2525253 0.2500000 0.2465753 0.2465753 0.2391304 0.2361111 0.2361111 0.2352941 
   650073    650100    650022    650029    650106    650030    650010    650045 
0.2325581 0.2314815 0.2266667 0.2222222 0.2195122 0.2093023 0.2054795 0.2051282 
   650043    650025    650095    650006    650027    650036    650042    650081 
0.2000000 0.1971831 0.1940299 0.1911765 0.1875000 0.1857143 0.1830986 0.1794872 
   650101    650014    650047    650088 
0.1704545 0.1470588 0.1408451 0.0468750 
   650001    650034    650088    650031 
1.0000000 1.0000000 1.0000000 0.8947368 
     650001      650094      650072      650092      650085      650011 
0.061523438 0.041015625 0.040039062 0.039062500 0.038085938 0.037109375 
     650032      650074      650054      650004      650077      650048 
0.035156250 0.034179688 0.034179688 0.034179688 0.033203125 0.033203125 
     650102      650079      650104      650082      650046      650090 
0.031250000 0.031250000 0.030273438 0.030273438 0.030273438 0.029296875 
     650069      650016      650099      650067      650056      650039 
0.029296875 0.029296875 0.028320312 0.028320312 0.028320312 0.028320312 
     650033      650015      650002      650091      650089      650105 
0.028320312 0.028320312 0.028320312 0.027343750 0.027343750 0.026367188 
     650097      650087      650070      650049      650024      650075 
0.026367188 0.026367188 0.026367188 0.026367188 0.026367188 0.025390625 
     650050      650020      650066      650065      650026      650003 
0.025390625 0.025390625 0.024414062 0.024414062 0.024414062 0.024414062 
     650100      650012      650008      650068      650058      650044 
0.023437500 0.023437500 0.023437500 0.022460938 0.022460938 0.022460938 
     650040      650028      650005      650096      650080      650023 
0.022460938 0.022460938 0.022460938 0.021484375 0.021484375 0.021484375 
     650098      650076      650071      650093      650083      650052 
0.020507812 0.020507812 0.020507812 0.019531250 0.019531250 0.019531250 
     650041      650019      650013      650007      650086      650078 
0.019531250 0.019531250 0.019531250 0.019531250 0.018554688 0.018554688 
     650073      650021      650009      650062      650061      650106 
0.018554688 0.018554688 0.018554688 0.017578125 0.017578125 0.016601562 
     650103      650060      650059      650038      650037      650035 
0.016601562 0.016601562 0.016601562 0.016601562 0.016601562 0.016601562 
     650030      650064      650063      650034      650031      650022 
0.016601562 0.015625000 0.015625000 0.015625000 0.015625000 0.015625000 
     650045      650029      650017      650101      650043      650027 
0.014648438 0.014648438 0.014648438 0.013671875 0.013671875 0.013671875 
     650010      650081      650025      650095      650042      650036 
0.013671875 0.012695312 0.012695312 0.011718750 0.011718750 0.011718750 
     650006      650047      650014      650088 
0.011718750 0.008789062 0.008789062 0.001953125 

ChemmineR documentation built on Feb. 28, 2021, 2:02 a.m.