eiQuery: Perform a query on an embedded database
In girke-lab/eiR: Accelerated similarity searching of small molecules

View source: R/core.R

eiQuery

R Documentation

Perform a query on an embedded database

Description

Finds similar compounds for each query.

Usage

	eiQuery(runId,queries,format="sdf",
		dir=".",distance=getDefaultDist(descriptorType),conn=defaultConn(dir),
		asSimilarity=FALSE, K=200, searchK=-1,lshData=NULL,
		mainIds = readIddb(conn,file.path(dir, Main)))

Arguments

`runId`	The id number identifying a particular set of settings for a database. This is generally the number returned by `eiMakeDb`. If your coming from an older version of eiR, you should not use this value instead of specifying `r`, `d`,`refIddb`, and `descriptorType`.
`queries`	This can be either an SDFset, or a file containg 1 or more query compounds.
`format`	The format in which the queries are given. Valid values are: "sdf" when `queries` is either a filename of an sdf file, or and SDFset object; "compound_id" when `queries` is a list of id numbers; and "name", when `queries` is a list of compound names, as returned by `cid(apset)`.
`dir`	The directory where the "data" directory lives. Defaults to the current directory.
`distance`	The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors. The Tanimoto function is used by default.
`conn`	Database connection to use.
`asSimilarity`	If true, return similarity values instead of distance values. This only works in the given distance function returns values between 0 and 1. This is true for the default atom pair and finger print distance functions.
`K`	The number of results to return.
`searchK`	Tunable Annoy LSH parameter. A larger value will give more accurate results, but will take longer time to return. The default value of -1 will allow the value to chosen automatically, which will set a value of numTrees * (approximate number of nearest neighbors). See Annoy page for details. https://github.com/spotify/annoy
`lshData`	DEPRECATED. This is no longer used.
`mainIds`	A vector of all id numbers in the current database. This is mainly provided as an option here to avoid having to re-read the id list multiple times when executing several queries. If not supplied it will read it in itself.

Details

This function identifies the database by the r, d, and refIddb parameters. The queries can be given in a few different formats, see the queries parameter for details. The LSH algorithm is used to quickly identify compounds similar to the queries. This function must use a distance function rather than a similarity function. However, if the distance function given returns values between 0 and 1, then the asSimilarity parameter may be used to return similarity values rather than distance values.

Value

Returns a data frame with columns 'query', 'target', 'target_ids', and 'distance'. 'query' and 'target' are the compound names and distance is the distance between them, as computed by the given distance function.'target_ids' is the compound id of the target. Query namess are repeated for each matching target found. If asSimilarity is true then instead of a "distance" column there will be a "similarity" column.

Author(s)

Kevin Horan

Examples


   library(snow)
   r<- 50
   d<- 40

   #initialize
   data(sdfsample)
   dir=file.path(tempdir(),"query")
   dir.create(dir)
   eiInit(sdfsample,dir=dir,skipPriorities=TRUE)

   #create compound db
   runId=eiMakeDb(r,d,numSamples=20,dir=dir,
      cl=makeCluster(1,type="SOCK",outfile=""))

   #find compounds similar two each query
   results = eiQuery(runId,sdfsample[1:2],K=15,dir=dir)

girke-lab/eiR documentation built on April 19, 2023, 12:52 p.m.