eiPerformanceTest: Test the performance of LSH search
In eiR: Accelerated similarity searching of small molecules

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/core.R

Tests the performance of embedding and LSH.

1 2	eiPerformanceTest(runId,distance=getDefaultDist(descriptorType), conn=defaultConn(dir),dir=".",K=200, searchK=-1)

`runId`	The id number identifying a particular set of settings for a database. This is generally the number returned by `eiMakeDb`. If your coming from an older version of eiR, you should not use this value instead of specifying `r`, `d`, and `descriptorType`.
`distance`	The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors.
`conn`	Database connection to use.
`dir`	The directory where the "data" directory lives. Defaults to the current directory.
`K`	Number of search results to use for LSH performance test.
`searchK`	Tunable Annoy LSH parameter. A larger value will give more accurate results, but will take longer time to return. See Annoy page for details. https://github.com/spotify/annoy

This function can be used to tune the two Annoy LSH parameters, numTrees, and searchK.

NumTrees is provided to the eiMakeDb function and affects the build time and the index size. A larger value will produce more accurate results, but use more disk space.

SearchK is given to the eiQuery function, or to this function. A larger value will give more accurate results, but will require more time to run.

This function will perform two different tests. The first test is how well the embedding is working. When the eiMakeDb function is run, you can specify the number of test samples to use for this test. If not specified, it will default to 10% of the data set size. During this test, we take each sample and compute its distance to every other compund in the dataset using both the given descriptor distance function (e.g., "AP" or "fingerprint"), as well as the euclidean distance computed on the embedded version. We then measure how similar the resulting ranks of these lists are using Rank Based Overlap (Webber,2010) (http://www.williamwebber.com/research/papers/wmz10_tois.pdf). The similarity for each sample is output in a file called 'embedding.performance' in the work directory. Each line corresponds to one sample.

The second test compares the rankings produced using the descriptor distance function, to the rankings produced by the final output of the LSH search, for each sample query. Again, rank based overlap (RBO) is used to compare the rankings. The results are output in the same format as for the fist test, in a file called 'indexed.performance'.

RBO is a similarity measure that produces a value in the range of [0,1]. Values closer to 0 are very dissimilar, while values closer to 1 are more similar.

Returns the results of the indexing test. Each element of the resulting vector is the RBO similarity for the coresponding query. Creates files in dir/run-r-d.

Kevin Horan

eiInit eiMakeDb eiQuery

   library(snow)

   r<- 50
   d<- 40

   #initialize 
   data(sdfsample)
   dir=file.path(tempdir(),"perf")
   dir.create(dir)
   eiInit(sdfsample,dir=dir,skipPriorities=TRUE)

   #create compound db
   runId = eiMakeDb(r,d,numSamples=20,dir=dir,
      cl=makeCluster(1,type="SOCK",outfile=""))

   eiPerformanceTest(runId,dir=dir,K=22)