knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The first thing to do is to establish a connection with the Finder server by specifying its IP address and optionally the port. In our case, 10.49.4.6
.
library(finderquery) con <- finder_connect("10.49.4.6")
This connection object is passed to queries so it knows where to go to run the query.
Note that if you are working in an R console on the .18
or .19
machines, you may need to run the following for the Solr server to be reachable:
Sys.unsetenv("NO_PROXY") Sys.unsetenv("HTTPS_PROXY") Sys.unsetenv("HTTP_PROXY") Sys.unsetenv("https_proxy") Sys.unsetenv("http_proxy")
Two major types of queries are currently supported by this package.
query_fetch()
query_facet()
After a query is initiated, it can built upon by piping various operations:
filter_*()
: filter on specified values of a field (*), for both fetch and facet queriesselect_fields()
: specify fields to select in the returned documents, only for fetch queriessort_by()
: specify field by which to sort the returned documents, only for fetch queries.facet_by()
: specify a field to facet by, only for facet queriesfacet_date_range()
: specify a date range facet, only for facet queriesFetch queries simply retrieve documents based on filtering criteria.
A fetch query is initialized using query_fetch()
, which takes as its primary argument the connection object.
Additonal arguments include:
max
: The maximum number of documents to return. The default is 10. If set to -1
, the total number of documents that match the query parameters will be returned. Often it is good to run a query with the default value of max
or even max = 0
to get a feel for how many documents are in the query before deciding to pull the full set of documents. Even when max = 0
is set, the query result will include an indication of the total number of documents in the query. An example of this will be shown below. Setting max = -1
(or anything less than 0) will result in all documents being pulled.format
: One of "list", "xml", or "file". The default is "list", in which case the documents are read into memory and converted to a more R-convenient list format using list_to_xml()
. If "xml", the documents will be read into R and returned as an xml2 "xml_document" object. If "file", the xml file(s) will simply be downloaded and the path to the directory containing these files will be returned. Note that if max = -1
and format
is not "file", it will be forced to "file" (and a temporary directory will be used if not specified) since the number of documents returned could potentially be very large. After pulling all the documents, if the number of documents is small enough to read into memory in a manageable way (currently set at <=100k documents), the original format
specification will be honored.path
: If format = "file"
, the path for where to store the downloaded xml files can be specified. If not specified, a temporary directory will be used. The specified path should ideally be empty.size
: The number of documents to pull in each batch of pagination (see note below on pagination). The default is 10,000, which is the maximum allowed by Solr.Note: Fetch queries automatically take care of pagination using cursors to retrieve potentially very large sets of documents. The pagination limit is 10k documents, so iterative queries are run to fetch these in batches and piece them together upon retrieval.
Initiating an "empty" query is as easy as the following:
qry <- query_fetch(con)
The query can be built upon by passing the qry
object to other functions such as the filter_*()
functions to further refine the query.
At any point when constructing a query, the query object can either be passed to:
- get_query()
: prints out a Solr API call that can be inspected or pasted in a web browser to retrieve the results
- run()
: runs the query and returns the documents in the specified format
To see our simple query:
get_query(qry)
To run it:
res <- run(qry)
As stated earlier, by default the returned object is an R-friendly list. The structure is one list element per document:
str(res, 1) #> List of 10 #> $ :List of 13 #> ..- attr(*, "id")= chr "ecuadorenvivo-031d3f5dd16162472d0c4726887f69c1" #> $ :List of 14 #> ..- attr(*, "id")= chr "ndr-128df0ba5b42818a59251419a7e17f64" #> $ :List of 15 #> ..- attr(*, "id")= chr "ndr-138d21d990ff5b41d1f9247b0d1dea97" #> $ :List of 15 #> ..- attr(*, "id")= chr "ndr-927a20c1c73d32a0c5f49b14064e2e4f" #> $ :List of 13 #> ..- attr(*, "id")= chr "ndr-9b95c471e05bd7783c9253fc8e3637b9" #> $ :List of 12 #> ..- attr(*, "id")= chr "ndr-1c2a88c3fae62353e26b862a4706a0a3" #> $ :List of 14 #> ..- attr(*, "id")= chr "ndr-299d9142b0b5a72ba7a6496d3d4fccc8" #> $ :List of 15 #> ..- attr(*, "id")= chr "deutschewelle-ro-72795b85ad41330a95a633210156a25c" #> ..- attr(*, "duplicate")= chr "deutschewelle-sq-72795b85ad41330a95a633210156a25c" #> $ :List of 15 #> ..- attr(*, "id")= chr "deutschewelle-ro-f4fd8c657065e830055b0397da119ea9" #> ..- attr(*, "duplicate")= chr "deutschewelle-sr-f4fd8c657065e830055b0397da119ea9" #> $ :List of 16 #> ..- attr(*, "id")= chr "cleantechnica-d2dd2e280252de07a4524aad60621ffd" #> - attr(*, "class")= chr [1:2] "finder_docs" "list" #> - attr(*, "meta")=List of 6
The number of fields in a document varies per document, depending on its content. To look at the structure of the first document:
str(res[[1]], 1) #> List of 13 #> $ title :List of 2 #> $ link :List of 1 #> $ description:List of 2 #> $ contentType:List of 1 #> $ pubDate :List of 1 #> $ source :List of 1 #> ..- attr(*, "url")= chr "http://www.ecuadorenvivo.com/index.php?format=feed&type=rss" #> ..- attr(*, "country")= chr "EC" #> $ language :List of 1 #> $ guid :List of 1 #> $ category :List of 2 #> $ favicon :List of 1 #> $ georss :List of 1 #> ..- attr(*, "name")= chr "Guayaquil:Guayaquil:Guayas:Ecuador" #> ..- attr(*, "id")= chr "16913475" #> ..- attr(*, "lat")= chr "-2.20382" #> ..- attr(*, "lon")= chr "-79.8975" #> ..- attr(*, "count")= chr "2" #> ..- attr(*, "pos")= chr "72,256" #> ..- attr(*, ".class")= chr "2" #> ..- attr(*, "iso")= chr "EC" #> ..- attr(*, "charpos")= chr "72,256" #> ..- attr(*, "wordlen")= chr "9,9" #> $ tonality :List of 1 #> $ text :List of 1 #> ..- attr(*, "wordCount")= chr "188" #> - attr(*, "id")= chr "ecuadorenvivo-031d3f5dd16162472d0c4726887f69c1"
Note that the structure of returned documents is difficult to flatten to a tabular format. Most fields can appear a variable number of times. Such as the title for this first document, for example.
res[[1]]$title #> [[1]] #> [1] "Solucionan problema de acumulación de aguas en Pascuales | Municipio de Guayaquil" #> #> [[2]] #> [1] "Solve problem of water cumulation Pascuales' municipality of Guayaquil" #> attr(,"lang") #> [1] "en"
In addition to fields appearing an unpredictable number of times, they also have attributes that are often important to preserve, but are not always predictable. In this case there is an attribute indicating that the second title is an english translation. Preserving attributes also makes flattening more difficult.
Recall that the default fetch query returns 10 documents. We can see how many total documents match the query parameters by using a convenience function n_docs()
on our output:
n_docs(res) #> 9004209
It is probably more desirable for a fetch query to pinpoint records of interest rather than to retrieve all documents. This can be done by adding filters to the query.
Filtering is added by filter functions specified for each filterable field, each of which begins with filter_
and ends with the field name being filtered.
The following sections illustrate examples of all of the available filters in thsi package.
Note that filters can apply to both fetch and facet queries.
Several fields in the data are categorical and can be filtered based on a specified term or set of terms. These are case insensitive.
filter_category()
filter_category()
allows you to filter on the category
document field.
Note that is convenient to build queries using the "pipe" ("%>%
") operator, which allows us to string together multiple commands, including run()
at the end to run the query.
Here we filter to documents that contain the "CoronavirusInfection" category.
res <- query_fetch(con) %>% filter_category("CoronavirusInfection") %>% run()
To see a list of category values that exist in the data:
head(valid_categories(con)) #> [1] "abrin" "acaricides" "acinetobacter" "acremonium" #> [5] "acrolein" "acrylamidfurans"
If you specify a vector of terms, all documents will be selected that match any of the provided values (OR logic). For example:
res <- query_fetch(con) %>% filter_category(c("CoronavirusInfection", "PAHO")) %>% run()
If you would like to exclude a term you can prepend it with "-" or "!". For example, if we want to filter to documents that contain the category "CoronavirusInfection" but NOT the category "PAHO":
res <- query_fetch(con) %>% filter_pubdate(from = as.Date(Sys.time()) - 1) %>% filter_category(c("CoronavirusInfection", "-PAHO")) %>% run()
We can use any valid Solr specification of term combinations. For example, for documents that contain the category "CoronavirusInfection" AND the category "PAHO":
res <- query_fetch(con) %>% filter_pubdate(from = as.Date(Sys.time()) - 1) %>% filter_category("CoronavirusInfection AND PAHO") %>% run()
filter_country()
To filter on documents for which the source comes from Germany or France:
res <- query_fetch(con) %>% filter_country(c("de", "fr")) %>% run()
Note that when you specify a vector of values, it matches documents where any of those terms are found.
To see a list of country values that exist in the data:
head(valid_countries(con)) #> [1] "ad" "ae" "af" "ag" "ai" "al"
filter_language()
We can use filter_language()
to filter on languages in the documents:
res <- query_fetch(con) %>% filter_language(c("de", "fr")) %>% run()
To see a list of language values that exist in the data:
head(valid_languages(con)) #> [1] "af" "am" "ar" "az" "be" "bg"
filter_source()
To filter on the document source, source we can use filter_source()
.
Here, we filter on documents with source "bbc", where the "" is a wildcard.
res <- query_fetch(con) %>% filter_source("bbc*") %>% run()
To inspect the actual sources returned:
unique(unlist(lapply(res, function(x) x$source))) #> [1] "bbc-swahili" "bbc-health-html" "bbc-spanish" #> [4] "bbc-portuguese-brasil" "bbcnepalirss" "bbc-turkce"
To see a list of source values that exist in the data:
head(valid_sources(con)) #> [1] "055firenze" "100noticias" "10minuta" "112-utrecht" #> [5] "112achterhoek" "112brabantnieuws"
filter_duplicate()
We can filter on whether the document is flagged as a duplicate or not with filter_duplicate()
:
res <- query_fetch(con, max = 10) %>% filter_duplicate(TRUE) %>% run()
Duplicate is either true
or false
:
head(valid_duplicate(con)) #> [1] "false" "true"
The filter_text()
function allows you to specify strings or a vector of strings to search for in the document's title, description, and document text, including translations.
Below are some examples of using filter_text()
:
# having word matching either "european" or "commision" res <- query_fetch(con) %>% filter_text(c("european", "commission")) %>% run() # having a word starting with "euro" res <- query_fetch(con, max = 20) %>% filter_text("euro*") %>% run() # having a word that matches "euro?ean" (question mark is single wild card) res <- query_fetch(con, max = 20) %>% filter_text("euro?ean") %>% run() # having word that sounds like "europese" (with default edit distance 2) res <- query_fetch(con, max = 20) %>% filter_text("europese~") %>% run() # having word that sounds like "Europian" (with edit distance 1) res <- query_fetch(con, max = 20) %>% filter_text("Europian~1") %>% run()
Two date filter functions are available, filter_pubdate()
and filter_indexdate()
, which filter on the publication date and indexed date, respectively.
These functions have arguments from
and to
which take a "Date" or "POSIXct" as input. If left unspecified, the filter will be open-ended on that side of the range. Instead of specifying a range, you can also specify on
to get all articles on a specific day.
The example below queries all documents in German that have category "CoronavirusInfection" that were published in the last two days:
res <- query_fetch(con) %>% filter_language("de") %>% filter_category("CoronavirusInfection") %>% filter_pubdate(from = as.Date(Sys.time()) - 2) %>% run()
Note here that we are stringing multiple filter commands together to get to what we are looking for. You can string as many filters together as you wish, but note that if you apply the same filter twice, the second one will override the first.
A few other filter functions are available.
filter_tonality()
filter_tonality()
is a special range filter that takes numeric values. In addition to having from
and to
parameters that behave in the same way as the date range filters, it also has a value
parameter which can be used to retrive documents with a specific tonality value.
res <- query_fetch(con) %>% filter_tonality(4) %>% run() unlist(lapply(res, function(x) x$tonality)) #> [1] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" res <- query_fetch(con) %>% filter_tonality(from = 4) %>% run() unlist(lapply(res, function(x) x$tonality)) #> [1] "12" "12" "8" "23" "6" "14" "9" "4" "4" "10"
Additional filters exist to filter on various IDs: filter_entityid()
, filter_georssid()
, filter_guid()
.
Below is an example of filtering on guid:
res <- query_fetch(con, max = 5) %>% filter_guid("washtimes-*") %>% run() unlist(lapply(res, function(x) x$guid)) #> [1] "washtimes-91e4808dcddb198e95b7583fdb162225" #> [2] "washtimes-b1f724b70e57280dbd42ed82763fb341" #> [3] "washtimes-3ddcef4e8631e60586391695c43a82ac" #> [4] "washtimes-7a44cd4213b3a663664da786b1242a38" #> [5] "washtimes-73daa638c1adafe89815c7c3f0c42e06"
Another operation available only for fetch queries is sort_docs()
, which allows you to specify fields to sort by as part of the fetch.
For example, here we run a query similar to one shown earlier where we additionaly sort the results by date of publication.
res <- query_fetch(con) %>% filter_language("de") %>% filter_category("CoronavirusInfection") %>% filter_pubdate(from = as.Date(Sys.time()) - 2) %>% sort_by("pubdate", asc = FALSE) %>% run() unlist(sapply(res, function(x) x$pubDate)) #> [1] "2020-09-12T21:40+0000" "2020-09-12T21:40+0000" "2020-09-12T21:40+0000" #> [4] "2020-09-12T21:39+0000" "2020-09-12T21:35+0000" "2020-09-12T21:35+0000" #> [7] "2020-09-12T21:32+0000" "2020-09-12T21:32+0000" "2020-09-12T21:31+0000" #> [10] "2020-09-12T21:30+0000"
To see what fields are available to sort on:
queryable_fields()
An operation available only for fetch queries, select_fields()
, allows us to specify which fields should be returned for each document. This is useful if documents contain some fields that are very large and we don't want to include them in our results.
Note that, to our knowledge, Finder does not provide a way to specify these fields at the time the query is run, so all fields are returned by the query and then fields are removed after being fetched. This means that there are not performance gains in terms of network transfer time, but there are gains in final file size.
To see what values are acceptable for a selectable field:
selectable_fields()
Note that these do not exactly match the fields that we can filter on. What is represented in a document vs. what is exposed by Finder to query on are not necessarily the same thing.
Suppose we have our same query as before, but we only want to pull out the document title and publication date:
res <- query_fetch(con) %>% filter_language("de") %>% filter_category("CoronavirusInfection") %>% filter_pubdate(from = as.Date(Sys.time()) - 2) %>% sort_by("pubdate", asc = FALSE) %>% select_fields(c("title", "pubDate")) %>% run()
All of the results of queries above have used the default output format of "list", where the xml that is returned from the query is transformed to an R-friendly list as shown earlier.
If we instead specify that the output format is "file", a path is returned which we can read in or do something else with:
res <- query_fetch(con, format = "file") %>% run() res #> [1] "/tmp/Rtmpf3ZFDa/file13cb810cbc6cb.xml"
If we specify the format to be "xml", then the raw xml is read into R using the xml2 package.
res <- query_fetch(con, format = "xml") %>% run() res #> {xml_document} #> <rss version="2.0" xmlns:emm="http://emm.jrc.it" xmlns:iso="http://www.iso.org/3166" xmlns:gphin="http://gphin.canada.ca"> #> [1] <channel>\n <title/>\n <pubDate>Sat, 12 Sep 2020 21:34:33 UTC</pubDate>\n < ...```
Note that we can convert this to our list format with the following:
res <- xml_to_list(res)
There is also a function, list_to_df()
, to convert fetch results to a data frame. Note that in general, a lot of information can be lost when doing this, such as attributes or entries that have more than one element.
Columns "title" and "description" often have two elements with the second being the English translation. Because of this, columns "title", "title_en", "description", and "description_en" are created to preserve this. Columns "entity", "category", "fullgeo" usually have more than one entry for each document and as such are added to the data frame as "list columns".
res <- query_fetch(con) %>% run() res_df <- list_to_df(res) res_df #># A tibble: 10 x 17 #> title title_en link description description_en contentType pubDate source #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Dipl… "Diplom… http… "Dans sa r… In the headin… text/html 2020-0… Emmne… #> 2 Séné… "Senega… http… "L’avocat … The senegales… text/html 2020-0… Emmne… #> 3 La q… "Quaran… http… "Pour évit… To avoid this… text/html 2020-0… Emmne… #> 4 Son … "His So… http… "Le consei… The national … text/html 2020-0… Emmne… #> 5 Covi… "Covid … http… ".\n\n#Aut… . # other cou… text/html 2020-0… Emmne… #> 6 Pers… "Africa… http… "#Autres p… # other count… text/html 2020-0… Emmne… #> 7 Coro… "Corona… http… "Le corona… The Coronavir… text/html 2020-0… Emmne… #> 8 اليو… "اليونا… http… "." . text/html 2020-0… cairo… #> 9 ترام… "ترامب … http… "." . text/html 2020-0… cairo… #>10 Colo… "Colomb… http… "MADRID, 9… Madrid, 9 Sep… text/html 2020-0… europ… #># … with 9 more variables: language <chr>, guid <chr>, category <list>, #># favicon <chr>, entity <list>, georss <chr>, fullgeo <list>, tonality <chr>, #># text <chr>
A function write_docs_csv()
also exists that will flatten the list column variables into strings and write the result to a csv file:
write_docs_csv(res_df, path = "/tmp/tmp.csv")
In the previous fetch examples, the return object docs
has been a list format of the document content of the query.
In a many cases we may wish to do a bulk download of many articles. If we specify a path
argument to query_fetch()
, the results will be written in batches to the specified directory.
When running a potentially large query, it is good to first run it with max = 0
so that no documents are returned. Queries that may involve a large number of documents but do not return any documents can run fairly quickly, and we can examine the results to find out how many documents match the query parameters and whether it would be a good idea to try to download them.
For example, with a query similar to what we specified earlier, suppose we want to get all documents in the last 5 days that are in German and have the "CoronavirusInfection" category.
res <- query_fetch(con, max = 0) %>% filter_language("de") %>% filter_category("CoronavirusInfection") %>% filter_pubdate(from = as.Date(Sys.time()) - 5) %>% sort_by("pubdate", asc = FALSE) %>% run() n_docs(res) #> [1] 14032
We see that there are about 14k documents, which is not a lot to try to download.
Now, to get all of these documents with pagination, we can set max = -1
and provide a directory to write the output to:
tf <- tempfile() dir.create(tf) res <- query_fetch(con, max = -1, format = "file", path = tf) %>% filter_language("de") %>% filter_category("CoronavirusInfection") %>% filter_pubdate(from = as.Date(Sys.time()) - 5) %>% # select_fields("text") %>% sort_by("pubdate", asc = FALSE) %>% run() #> 10000 documents fetched (71%)... #> 14032 documents fetched (100%)... list.files(tf) #> [1] "out0001.xml" "out0002.xml"
The ~14k documents were downloaded over two files located in the output directory we specified.
Note that since 14k documents isn't that large, we may want to read them all into memory in R in the more convenient list format:
xml <- lapply(list.files(tf, full.names = TRUE), xml2::read_xml) docs <- unlist(lapply(xml,xml_to_list), recursive = FALSE) length(docs) #> [1] 14032
Facet queries are constructed by doing the following:
query_facet()
facet_by()
facet_date_range()
To initiate a facet query, we use the function query_facet()
, and pass it our connection object.
query <- query_facet(con)
Similarly to fetch queries, we can call get_query()
or run()
to print the query string or run the query.
Suppose we want to tabulate the frequency of all of the categories in the documents over the last two days. We can do this by adding facet_by()
to our query, specifying the field name "category".
query_facet(con) %>% filter_pubdate(from = as.Date(Sys.time()) - 2) %>% facet_by("category") %>% get_query() #> [1] "op=search&q=*:*&rows=0&facet=true&native=true&facet.field=category&facet.limit=-1&facet.sort=count&facet.mincount=0&facet.offset=0&fq=pubdate%3A%5B2020-09-10T00%3A00%3A00Z%20TO%20%2A%5D"
Here we see the facet query string that is constructed by this query specification.
To run the query, we use run()
as we did with fetch queries. Here, we additionaly specify in facet_by()
that we only want the top 10 categories.
res <- query_facet(con) %>% filter_pubdate(from = as.Date(Sys.time()) - 2) %>% facet_by("category", limit = 10) %>% run() res #> # A tibble: 10 x 2 #> category n #> <chr> <dbl> #> 1 fifa2018participatingcountries 458411 #> 2 euro 452639 #> 3 coronavirusinfection 209279 #> 4 paho 208024 #> 5 wpro 115488 #> 6 radnucnonelist 109946 #> 7 emro 94203 #> 8 usa 91441 #> 9 italy 61976 #> 10 implantrisks 51553
Different from fetch queries, facet queries always return a data frame.
It is possible to facet on up to two fields.
res <- query_facet(con) %>% filter_pubdate(from = as.Date(Sys.time()) - 2) %>% facet_by(c("category", "language"), limit = 100) %>% run() res #> # A tibble: 4,403 x 3 #> category language n #> <chr> <chr> <dbl> #> 1 euro en 55307 #> 2 euro it 47138 #> 3 euro es 39303 #> 4 euro el 35447 #> 5 euro ru 33457 #> 6 euro de 32450 #> 7 euro fr 21507 #> 8 euro tr 14694 #> 9 euro pt 13721 #> 10 euro ar 13421 #> # … with 4,393 more rows
To see what fields are avilable to facet on:
facetable_fields()
A faceting function is available to specify faceting on dates, facet_date_range()
.
The main parameters for this function are:
field
: either "pubdate" (default) or "indexdate"start
, end
: the start and end dates to facet over, specified as "Date" objectsgap
: The bin size to facet over. This is specified using the range_gap()
function as will be illustrated the examples below.To facet daily by publication date for all "CoronavirusInfection" articles over the past 20 days:
res <- query_facet(con) %>% filter_category("coronavirusinfection") %>% facet_date_range( start = as.Date(Sys.time()) - 20, end = as.Date(Sys.time()), gap = range_gap(1, "DAY") ) %>% run() #> # A tibble: 20 x 2 #> pubdate n #> <dttm> <dbl> #> 1 2020-08-23 00:00:00 54439 #> 2 2020-08-24 00:00:00 79749 #> 3 2020-08-25 00:00:00 87616 #> 4 2020-08-26 00:00:00 82739 #> 5 2020-08-27 00:00:00 82490 #> 6 2020-08-28 00:00:00 79061 #> 7 2020-08-29 00:00:00 54291 #> 8 2020-08-30 00:00:00 50336 #> 9 2020-08-31 00:00:00 76238 #> 10 2020-09-01 00:00:00 84864 #> 11 2020-09-02 00:00:00 86083 #> 12 2020-09-03 00:00:00 85665 #> 13 2020-09-04 00:00:00 81934 #> 14 2020-09-05 00:00:00 57294 #> 15 2020-09-06 00:00:00 50430 #> 16 2020-09-07 00:00:00 77083 #> 17 2020-09-08 00:00:00 85925 #> 18 2020-09-09 00:00:00 90019 #> 19 2020-09-10 00:00:00 75486 #> 20 2020-09-11 00:00:00 76962
The range_gap()
function takes two parameters. The first is the number of time units the facet window gap should be and the second is the time unit, which is one of "DAY" (default), "MINUTE", "HOUR", "WEEK", "MONTH", or "YEAR".
Here we run the same facet but hourly:
res <- query_facet(con) %>% filter_category("coronavirusinfection") %>% facet_date_range( start = as.Date(Sys.time()) - 20, end = as.Date(Sys.time()), gap = range_gap(1, "HOUR") ) %>% run() #> # A tibble: 480 x 2 #> pubdate n #> <dttm> <dbl> #> 1 2020-08-23 00:00:00 1043 #> 2 2020-08-23 00:00:00 1131 #> 3 2020-08-23 00:00:00 1186 #> 4 2020-08-23 00:00:00 1088 #> 5 2020-08-23 00:00:00 2566 #> 6 2020-08-23 00:00:00 2276 #> 7 2020-08-23 00:00:00 2344 #> 8 2020-08-23 00:00:00 2075 #> 9 2020-08-23 00:00:00 2915 #> 10 2020-08-23 00:00:00 2491 #> # … with 470 more rows
There are many ways queries can be constructed with Solr/Finder. The functions for fetching and faceting provided above are meant to cover the vast majority of use cases, but their simplified API might not allow for some very special cases. If one is very familiar with Finder/Solr and wants to use this package to execute their own custom queries, there is a simple mechanism for doing this:
# query_str() allows you to run a query that you have already constructed a query string for # default is to return a list res <- query_str(con, "op=search&q=*:*&rows=0") %>% run() str(res$rss$channel, 1) #> List of 8 #> $ title : list() #> $ pubDate :List of 1 #> $ q :List of 1 #> $ message : list() #> $ QTime :List of 1 #> $ numFound:List of 1 #> $ start :List of 1 #> $ rows :List of 1 # to return in xml format res <- query_str(con, "op=search&q=*:*&rows=0", format = "xml") %>% run() res #> {xml_document} #> <rss version="2.0" xmlns:emm="http://emm.jrc.it" xmlns:iso="http://www.iso.org/ 3166" xmlns:gphin="http://gphin.canada.ca"> #> [1] <channel>\n <title/>\n <pubDate>Sat, 12 Sep 2020 22:01:54 UTC</pubDate>\n < ...
This package is experimental and has not undergone rigorous testing to verify the correctness of the constructed queries. Use at your own risk.
The package has been written to cover a large number of immediate use cases. However, there are many additional features and parameters of Elasticsearch that could be exposed through this interface in the future.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.