The primary method to locate data in the DataONE network is to use the web based search tool located at https://search.dataone.org.
Data can also be discovered programmatically from R using the dataone R package method query
. Both the web page and the R method
access the same
underlying search mechanism, the Apache Foundation Solr search engine that runs on the
DataONE Coordinating Node. The DataONE Solr index, similar to a catalog or database, contains
information for every dataset that is available from the DataONE network.
The same search mechanism is available on DataONE Member Nodes. However, the search index on a member node only contains dataset information for datasets that are contained on that member node.
Information about the Solr search engine can be obtained at Solr Resources.
Additional information about searching DataONE can be viewed at Content Discovery.
query
MethodSome familiarity with Solr is helpful when using the query
method
effectively, however basic queries can be very powerful and the examples in this document can provide a starting point. As an alternative to
composing queries using Solr syntax, a simpler search mechanism is available with the query
method that uses name, value pairs (See the section: *A Simplified Search").
Additional information about the query
method can be obtained from the R help facility, e.g. ?query
from the R command line.
The following example queries DataONE and returns values in an R list, with each value converted from the Solr result to the appropriate R data type:
library(dataone) cn <- CNode("PROD") queryParamList <- list(q="id:doi*", rows="5", fq="abstract:carbon", fl="id,title,dateUploaded,abstract,datasource,size") result <- query(cn, solrQuery=queryParamList, as="list")
The 'solrQuery' argument takes a list of query parameters that are sent to the DataONE Solr search engine and control how the
search is performed. The name of each list element is a Solr keyword which is combined with the list element value to create
each Solr query term. All these terms are combined to create the complete Solr query. The queryParamlist
in the example
above will be used to construct this Solr query:
?q=id:doi*&rows=5,&fq=abstract:carbon&fl=id,title,dateUploaded,abstract,datasource,size
The q=id:doi*
is the main query term and specifies that all DataONE objects that have an id
field
value beginning with doi*
should be returned.
The &fq=abstract:carbon
term is a filter query
, which filters the results so that only results with an abstract
field containing
the word carbon
will be returned.
The &fl=id,title,dateUploaded,abstract,datasource,size
term is a field
specifier, so only those fields in the list will be
included in the result set.
The &rows=5
term specifies that a maximum of five results will be returned.
The result
object contains all the data values found and returned from the query. Each element of the returned list contains information
for one dataset. Each returned attribute for a dataset can be accessed with the appropriate element name, for example, to
access the title information of the first dataset returned, use the R statement:
result[[1]]$title
To print out selected information for all returned values, use:
ids <- lapply(result, function(x) { message(sprintf("id: %s", x$id)) message(sprintf("origin member node: %s", x$datasource)) message(sprintf("title: %s", x$title)) message(sprintf("date uploaded: %s", x$dateUploaded)) x$id })
The complete list of possible searchable values stored for a dataset can be viewed using getQueryEngineDescription():
cn <- CNode() getQueryEngineDescription(cn, "solr")
However, the values available for a particular dataset may be a subset of these, depending on the metadata provided when the dataset is uploaded to DataONE.
The DataONE Coordinating Node (CN) contains metadata about datasets from all Member Nodes (MN) in the network. As the above example shows, sending a query to the CN may find matching datasets located on potentially any Member Node in the network.
To return results as a data frame, use the as
parameter as shown below. In addition all values will be stored as character
, because parse=FALSE
is specified:
cn <- CNode("PROD") result <- query(cn, queryParamList, as="data.frame", parse=FALSE)
The result
is a data frame with each row containing information for one dataset. To print all returned DataONE identifiers from the result:
result[,'id']
A search may be performed by just specifying query fields and values using the searchTerms parameter. For example, to search for datasets that mention 'kelp' in the abstract and have an attribute description that contains the word 'biomass', the following query could be used:
cn <- CNode("PROD") mn <- getMNode(cn, "urn:node:KNB") mySearchTerms <- list(abstract="kelp", attribute="biomass") result <- query(cn, searchTerms=mySearchTerms, as="data.frame")
Using the searchTerms parameter causes the query method to construct a Solr query based on the list, that is passed on to the DataONE node specified.
The names in the searchTerm list are the query field names available from the Solr query engine being used. These names can be determined using the getQueryEngineDescription function.
If it is known which Member Node holds the data of interest, then a search can be limited to just that MN by sending the search directly to that MN instead of the CN. For example, if the dataset of interest is on The Knowledge Network for Biocomplexity (KNB) Member Node, then the search is performed with the statements:
# Query the data holdings on a member node cn <- CNode("PROD") mn <- getMNode(cn, "urn:node:KNB") queryParams <- list(q="abstract:habitat", fl="id,title,abstract") result <- query(mn, queryParams, as="data.frame", parse=FALSE) # Choose the first matchin PID pid <- result[1,'id']
An alternative way to locate datasets on a particular node is to send the query to the CN but limit the returned results to data holdings that originated from the specific member node by using Solr filter query parameter (&fq
) can be used
for the datasource
field:
cn <- CNode("PROD") queryParams <- 'q=id:*&fl=id,title&fq=datasource:"urn:node:KNB"&rows=5' result <- query(cn, queryParams, as="data.frame", parse=FALSE) result[,'id']
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.