In WorldHealthOrganization/finderquery: Utilities for Building EIOS Finder Queries

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Setup

The first thing to do is to establish a connection with the Finder server by specifying its IP address and optionally the port. In our case, 10.49.4.6.

library(finderquery)

con <- finder_connect("10.49.4.6")

This connection object is passed to queries so it knows where to go to run the query.

Note that if you are working in an R console on the .18 or .19 machines, you may need to run the following for the Solr server to be reachable:

Sys.unsetenv("NO_PROXY")
Sys.unsetenv("HTTPS_PROXY")
Sys.unsetenv("HTTP_PROXY")
Sys.unsetenv("https_proxy")
Sys.unsetenv("http_proxy")

Queries

Overview

Two major types of queries are currently supported by this package.

Fetch: Documents are retrieved according to specified criteria, query initiated with query_fetch()
Facet: Documents are counted by specified fields in the data, query initiated with query_facet()

After a query is initiated, it can built upon by piping various operations:

filter_*(): filter on specified values of a field (*), for both fetch and facet queries
select_fields(): specify fields to select in the returned documents, only for fetch queries
sort_by(): specify field by which to sort the returned documents, only for fetch queries.
facet_by(): specify a field to facet by, only for facet queries
facet_date_range(): specify a date range facet, only for facet queries

Fetch queries

Fetch queries simply retrieve documents based on filtering criteria.

Initiating a query

A fetch query is initialized using query_fetch(), which takes as its primary argument the connection object.

Additonal arguments include:

max: The maximum number of documents to return. The default is 10. If set to -1, the total number of documents that match the query parameters will be returned. Often it is good to run a query with the default value of max or even max = 0 to get a feel for how many documents are in the query before deciding to pull the full set of documents. Even when max = 0 is set, the query result will include an indication of the total number of documents in the query. An example of this will be shown below. Setting max = -1 (or anything less than 0) will result in all documents being pulled.
format: One of "list", "xml", or "file". The default is "list", in which case the documents are read into memory and converted to a more R-convenient list format using list_to_xml(). If "xml", the documents will be read into R and returned as an xml2 "xml_document" object. If "file", the xml file(s) will simply be downloaded and the path to the directory containing these files will be returned. Note that if max = -1 and format is not "file", it will be forced to "file" (and a temporary directory will be used if not specified) since the number of documents returned could potentially be very large. After pulling all the documents, if the number of documents is small enough to read into memory in a manageable way (currently set at <=100k documents), the original format specification will be honored.
path: If format = "file", the path for where to store the downloaded xml files can be specified. If not specified, a temporary directory will be used. The specified path should ideally be empty.
size: The number of documents to pull in each batch of pagination (see note below on pagination). The default is 10,000, which is the maximum allowed by Solr.

Note: Fetch queries automatically take care of pagination using cursors to retrieve potentially very large sets of documents. The pagination limit is 10k documents, so iterative queries are run to fetch these in batches and piece them together upon retrieval.

Initiating an "empty" query is as easy as the following:

qry <- query_fetch(con)

The query can be built upon by passing the qry object to other functions such as the filter_*() functions to further refine the query.

At any point when constructing a query, the query object can either be passed to: - get_query(): prints out a Solr API call that can be inspected or pasted in a web browser to retrieve the results - run(): runs the query and returns the documents in the specified format

To see our simple query:

get_query(qry)

To run it:

res <- run(qry)

As stated earlier, by default the returned object is an R-friendly list. The structure is one list element per document:

str(res, 1)
#> List of 10
#>  $ :List of 13
#>   ..- attr(*, "id")= chr "ecuadorenvivo-031d3f5dd16162472d0c4726887f69c1"
#>  $ :List of 14
#>   ..- attr(*, "id")= chr "ndr-128df0ba5b42818a59251419a7e17f64"
#>  $ :List of 15
#>   ..- attr(*, "id")= chr "ndr-138d21d990ff5b41d1f9247b0d1dea97"
#>  $ :List of 15
#>   ..- attr(*, "id")= chr "ndr-927a20c1c73d32a0c5f49b14064e2e4f"
#>  $ :List of 13
#>   ..- attr(*, "id")= chr "ndr-9b95c471e05bd7783c9253fc8e3637b9"
#>  $ :List of 12
#>   ..- attr(*, "id")= chr "ndr-1c2a88c3fae62353e26b862a4706a0a3"
#>  $ :List of 14
#>   ..- attr(*, "id")= chr "ndr-299d9142b0b5a72ba7a6496d3d4fccc8"
#>  $ :List of 15
#>   ..- attr(*, "id")= chr "deutschewelle-ro-72795b85ad41330a95a633210156a25c"
#>   ..- attr(*, "duplicate")= chr "deutschewelle-sq-72795b85ad41330a95a633210156a25c"
#>  $ :List of 15
#>   ..- attr(*, "id")= chr "deutschewelle-ro-f4fd8c657065e830055b0397da119ea9"
#>   ..- attr(*, "duplicate")= chr "deutschewelle-sr-f4fd8c657065e830055b0397da119ea9"
#>  $ :List of 16
#>   ..- attr(*, "id")= chr "cleantechnica-d2dd2e280252de07a4524aad60621ffd"
#>  - attr(*, "class")= chr [1:2] "finder_docs" "list"
#>  - attr(*, "meta")=List of 6

The number of fields in a document varies per document, depending on its content. To look at the structure of the first document:

str(res[[1]], 1)
#> List of 13
#>  $ title      :List of 2
#>  $ link       :List of 1
#>  $ description:List of 2
#>  $ contentType:List of 1
#>  $ pubDate    :List of 1
#>  $ source     :List of 1
#>   ..- attr(*, "url")= chr "http://www.ecuadorenvivo.com/index.php?format=feed&type=rss"
#>   ..- attr(*, "country")= chr "EC"
#>  $ language   :List of 1
#>  $ guid       :List of 1
#>  $ category   :List of 2
#>  $ favicon    :List of 1
#>  $ georss     :List of 1
#>   ..- attr(*, "name")= chr "Guayaquil:Guayaquil:Guayas:Ecuador"
#>   ..- attr(*, "id")= chr "16913475"
#>   ..- attr(*, "lat")= chr "-2.20382"
#>   ..- attr(*, "lon")= chr "-79.8975"
#>   ..- attr(*, "count")= chr "2"
#>   ..- attr(*, "pos")= chr "72,256"
#>   ..- attr(*, ".class")= chr "2"
#>   ..- attr(*, "iso")= chr "EC"
#>   ..- attr(*, "charpos")= chr "72,256"
#>   ..- attr(*, "wordlen")= chr "9,9"
#>  $ tonality   :List of 1
#>  $ text       :List of 1
#>   ..- attr(*, "wordCount")= chr "188"
#>  - attr(*, "id")= chr "ecuadorenvivo-031d3f5dd16162472d0c4726887f69c1"

Note that the structure of returned documents is difficult to flatten to a tabular format. Most fields can appear a variable number of times. Such as the title for this first document, for example.

res[[1]]$title
#> [[1]]
#> [1] "Solucionan problema de acumulación de aguas en Pascuales | Municipio de Guayaquil"
#> 
#> [[2]]
#> [1] "Solve problem of water cumulation Pascuales' municipality of Guayaquil"
#> attr(,"lang")
#> [1] "en"

In addition to fields appearing an unpredictable number of times, they also have attributes that are often important to preserve, but are not always predictable. In this case there is an attribute indicating that the second title is an english translation. Preserving attributes also makes flattening more difficult.

Recall that the default fetch query returns 10 documents. We can see how many total documents match the query parameters by using a convenience function n_docs() on our output:

n_docs(res)
#> 9004209

Adding filters to fetch queries

It is probably more desirable for a fetch query to pinpoint records of interest rather than to retrieve all documents. This can be done by adding filters to the query.

Filtering is added by filter functions specified for each filterable field, each of which begins with filter_ and ends with the field name being filtered.

The following sections illustrate examples of all of the available filters in thsi package.

Note that filters can apply to both fetch and facet queries.

Term filters

Several fields in the data are categorical and can be filtered based on a specified term or set of terms. These are case insensitive.

`filter_category()`

filter_category() allows you to filter on the category document field.

Note that is convenient to build queries using the "pipe" ("%>%") operator, which allows us to string together multiple commands, including run() at the end to run the query.

Here we filter to documents that contain the "CoronavirusInfection" category.

res <- query_fetch(con) %>%
  filter_category("CoronavirusInfection") %>%
  run()

To see a list of category values that exist in the data:

head(valid_categories(con))
#> [1] "abrin"           "acaricides"      "acinetobacter"   "acremonium"     
#> [5] "acrolein"        "acrylamidfurans"

If you specify a vector of terms, all documents will be selected that match any of the provided values (OR logic). For example:

res <- query_fetch(con) %>%
  filter_category(c("CoronavirusInfection", "PAHO")) %>%
  run()

If you would like to exclude a term you can prepend it with "-" or "!". For example, if we want to filter to documents that contain the category "CoronavirusInfection" but NOT the category "PAHO":

res <- query_fetch(con) %>%
  filter_pubdate(from = as.Date(Sys.time()) - 1) %>%
  filter_category(c("CoronavirusInfection", "-PAHO")) %>%
  run()

We can use any valid Solr specification of term combinations. For example, for documents that contain the category "CoronavirusInfection" AND the category "PAHO":

res <- query_fetch(con) %>%
  filter_pubdate(from = as.Date(Sys.time()) - 1) %>%
  filter_category("CoronavirusInfection AND PAHO") %>%
  run()

`filter_country()`

To filter on documents for which the source comes from Germany or France:

res <- query_fetch(con) %>%
  filter_country(c("de", "fr")) %>%
  run()

Note that when you specify a vector of values, it matches documents where any of those terms are found.

To see a list of country values that exist in the data:

head(valid_countries(con))
#> [1] "ad" "ae" "af" "ag" "ai" "al"

`filter_language()`

We can use filter_language() to filter on languages in the documents:

res <- query_fetch(con) %>%
  filter_language(c("de", "fr")) %>%
  run()

To see a list of language values that exist in the data:

head(valid_languages(con))
#> [1] "af" "am" "ar" "az" "be" "bg"

`filter_source()`

To filter on the document source, source we can use filter_source().

Here, we filter on documents with source "bbc", where the "" is a wildcard.

res <- query_fetch(con) %>%
  filter_source("bbc*") %>%
  run()

To inspect the actual sources returned:

unique(unlist(lapply(res, function(x) x$source)))
#> [1] "bbc-swahili"           "bbc-health-html"       "bbc-spanish"          
#> [4] "bbc-portuguese-brasil" "bbcnepalirss"          "bbc-turkce"

To see a list of source values that exist in the data:

head(valid_sources(con))
#> [1] "055firenze"       "100noticias"      "10minuta"         "112-utrecht"
#> [5] "112achterhoek"    "112brabantnieuws"

`filter_duplicate()`

We can filter on whether the document is flagged as a duplicate or not with filter_duplicate():

res <- query_fetch(con, max = 10) %>%
  filter_duplicate(TRUE) %>%
  run()

Duplicate is either true or false:

head(valid_duplicate(con))
#> [1] "false" "true"

Text filter

The filter_text() function allows you to specify strings or a vector of strings to search for in the document's title, description, and document text, including translations.

Below are some examples of using filter_text():

# having word matching either "european" or "commision"
res <- query_fetch(con) %>%
  filter_text(c("european", "commission")) %>%
  run()

# having a word starting with "euro"
res <- query_fetch(con, max = 20) %>%
  filter_text("euro*") %>%
  run()

# having a word that matches "euro?ean" (question mark is single wild card)
res <- query_fetch(con, max = 20) %>%
  filter_text("euro?ean") %>%
  run()

# having word that sounds like "europese" (with default edit distance 2)
res <- query_fetch(con, max = 20) %>%
  filter_text("europese~") %>%
  run()

# having word that sounds like "Europian" (with edit distance 1)
res <- query_fetch(con, max = 20) %>%
  filter_text("Europian~1") %>%
  run()

Date filters

Two date filter functions are available, filter_pubdate() and filter_indexdate(), which filter on the publication date and indexed date, respectively.

These functions have arguments from and to which take a "Date" or "POSIXct" as input. If left unspecified, the filter will be open-ended on that side of the range. Instead of specifying a range, you can also specify on to get all articles on a specific day.

The example below queries all documents in German that have category "CoronavirusInfection" that were published in the last two days:

res <- query_fetch(con) %>%
  filter_language("de") %>%
  filter_category("CoronavirusInfection") %>%
  filter_pubdate(from = as.Date(Sys.time()) - 2) %>%
  run()

Note here that we are stringing multiple filter commands together to get to what we are looking for. You can string as many filters together as you wish, but note that if you apply the same filter twice, the second one will override the first.

Other filters

A few other filter functions are available.

`filter_tonality()`

filter_tonality() is a special range filter that takes numeric values. In addition to having from and to parameters that behave in the same way as the date range filters, it also has a value parameter which can be used to retrive documents with a specific tonality value.

res <- query_fetch(con) %>%
  filter_tonality(4) %>%
  run()

unlist(lapply(res, function(x) x$tonality))
#> [1] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4"


res <- query_fetch(con) %>%
  filter_tonality(from = 4) %>%
  run()

unlist(lapply(res, function(x) x$tonality))
#> [1] "12" "12" "8"  "23" "6"  "14" "9"  "4"  "4"  "10"

ID filters

Additional filters exist to filter on various IDs: filter_entityid(), filter_georssid(), filter_guid().

Below is an example of filtering on guid:

res <- query_fetch(con, max = 5) %>%
  filter_guid("washtimes-*") %>%
  run()

unlist(lapply(res, function(x) x$guid))
#> [1] "washtimes-91e4808dcddb198e95b7583fdb162225"
#> [2] "washtimes-b1f724b70e57280dbd42ed82763fb341"
#> [3] "washtimes-3ddcef4e8631e60586391695c43a82ac"
#> [4] "washtimes-7a44cd4213b3a663664da786b1242a38"
#> [5] "washtimes-73daa638c1adafe89815c7c3f0c42e06"

Specifying fields to sort on

Another operation available only for fetch queries is sort_docs(), which allows you to specify fields to sort by as part of the fetch.

For example, here we run a query similar to one shown earlier where we additionaly sort the results by date of publication.

res <- query_fetch(con) %>%
  filter_language("de") %>%
  filter_category("CoronavirusInfection") %>%
  filter_pubdate(from = as.Date(Sys.time()) - 2) %>%
  sort_by("pubdate", asc = FALSE) %>%
  run()

unlist(sapply(res, function(x) x$pubDate))
#>  [1] "2020-09-12T21:40+0000" "2020-09-12T21:40+0000" "2020-09-12T21:40+0000"
#>  [4] "2020-09-12T21:39+0000" "2020-09-12T21:35+0000" "2020-09-12T21:35+0000"
#>  [7] "2020-09-12T21:32+0000" "2020-09-12T21:32+0000" "2020-09-12T21:31+0000"
#> [10] "2020-09-12T21:30+0000"

To see what fields are available to sort on:

queryable_fields()

Specifying fields to return

An operation available only for fetch queries, select_fields(), allows us to specify which fields should be returned for each document. This is useful if documents contain some fields that are very large and we don't want to include them in our results.

Note that, to our knowledge, Finder does not provide a way to specify these fields at the time the query is run, so all fields are returned by the query and then fields are removed after being fetched. This means that there are not performance gains in terms of network transfer time, but there are gains in final file size.

To see what values are acceptable for a selectable field:

selectable_fields()

Note that these do not exactly match the fields that we can filter on. What is represented in a document vs. what is exposed by Finder to query on are not necessarily the same thing.

Suppose we have our same query as before, but we only want to pull out the document title and publication date:

res <- query_fetch(con) %>%
  filter_language("de") %>%
  filter_category("CoronavirusInfection") %>%
  filter_pubdate(from = as.Date(Sys.time()) - 2) %>%
  sort_by("pubdate", asc = FALSE) %>%
  select_fields(c("title", "pubDate")) %>%
  run()

Fetch output formats

All of the results of queries above have used the default output format of "list", where the xml that is returned from the query is transformed to an R-friendly list as shown earlier.

If we instead specify that the output format is "file", a path is returned which we can read in or do something else with:

res <- query_fetch(con, format = "file") %>%
  run()

res
#> [1] "/tmp/Rtmpf3ZFDa/file13cb810cbc6cb.xml"

If we specify the format to be "xml", then the raw xml is read into R using the xml2 package.

res <- query_fetch(con, format = "xml") %>%
  run()

res
#> {xml_document}
#> <rss version="2.0" xmlns:emm="http://emm.jrc.it" xmlns:iso="http://www.iso.org/3166" xmlns:gphin="http://gphin.canada.ca">
#> [1] <channel>\n  <title/>\n  <pubDate>Sat, 12 Sep 2020 21:34:33 UTC</pubDate>\n  < ...```

Note that we can convert this to our list format with the following:

res <- xml_to_list(res)

Converting to data frame

There is also a function, list_to_df(), to convert fetch results to a data frame. Note that in general, a lot of information can be lost when doing this, such as attributes or entries that have more than one element.

Columns "title" and "description" often have two elements with the second being the English translation. Because of this, columns "title", "title_en", "description", and "description_en" are created to preserve this. Columns "entity", "category", "fullgeo" usually have more than one entry for each document and as such are added to the data frame as "list columns".

res <- query_fetch(con) %>%
  run()

res_df <- list_to_df(res)

res_df                                                                          
#># A tibble: 10 x 17
#>   title title_en link  description description_en contentType pubDate source
#>   <chr> <chr>    <chr> <chr>       <chr>          <chr>       <chr>   <chr> 
#> 1 Dipl… "Diplom… http… "Dans sa r… In the headin… text/html   2020-0… Emmne…
#> 2 Séné… "Senega… http… "L’avocat … The senegales… text/html   2020-0… Emmne…
#> 3 La q… "Quaran… http… "Pour évit… To avoid this… text/html   2020-0… Emmne…
#> 4 Son … "His So… http… "Le consei… The national … text/html   2020-0… Emmne…
#> 5 Covi… "Covid … http… ".\n\n#Aut… . # other cou… text/html   2020-0… Emmne…
#> 6 Pers… "Africa… http… "#Autres p… # other count… text/html   2020-0… Emmne…
#> 7 Coro… "Corona… http… "Le corona… The Coronavir… text/html   2020-0… Emmne…
#> 8 اليو… "اليونا… http… "."         .              text/html   2020-0… cairo…
#> 9 ترام… "ترامب … http… "."         .              text/html   2020-0… cairo…
#>10 Colo… "Colomb… http… "MADRID, 9… Madrid, 9 Sep… text/html   2020-0… europ…
#># … with 9 more variables: language <chr>, guid <chr>, category <list>,
#>#   favicon <chr>, entity <list>, georss <chr>, fullgeo <list>, tonality <chr>,
#>#   text <chr>

A function write_docs_csv() also exists that will flatten the list column variables into strings and write the result to a csv file:

write_docs_csv(res_df, path = "/tmp/tmp.csv")

Fetching to disk

In the previous fetch examples, the return object docs has been a list format of the document content of the query.

In a many cases we may wish to do a bulk download of many articles. If we specify a path argument to query_fetch(), the results will be written in batches to the specified directory.

When running a potentially large query, it is good to first run it with max = 0 so that no documents are returned. Queries that may involve a large number of documents but do not return any documents can run fairly quickly, and we can examine the results to find out how many documents match the query parameters and whether it would be a good idea to try to download them.

For example, with a query similar to what we specified earlier, suppose we want to get all documents in the last 5 days that are in German and have the "CoronavirusInfection" category.

res <- query_fetch(con, max = 0) %>%
  filter_language("de") %>%
  filter_category("CoronavirusInfection") %>%
  filter_pubdate(from = as.Date(Sys.time()) - 5) %>%
  sort_by("pubdate", asc = FALSE) %>%
  run()

n_docs(res)
#> [1] 14032

We see that there are about 14k documents, which is not a lot to try to download.

Now, to get all of these documents with pagination, we can set max = -1 and provide a directory to write the output to:

tf <- tempfile()
dir.create(tf)

res <- query_fetch(con, max = -1, format = "file", path = tf) %>%
  filter_language("de") %>%
  filter_category("CoronavirusInfection") %>%
  filter_pubdate(from = as.Date(Sys.time()) - 5) %>%
  # select_fields("text") %>%
  sort_by("pubdate", asc = FALSE) %>%
  run()
#> 10000 documents fetched (71%)...
#> 14032 documents fetched (100%)...

list.files(tf)
#> [1] "out0001.xml" "out0002.xml"

The ~14k documents were downloaded over two files located in the output directory we specified.

Note that since 14k documents isn't that large, we may want to read them all into memory in R in the more convenient list format:

xml <- lapply(list.files(tf, full.names = TRUE), xml2::read_xml)
docs <- unlist(lapply(xml,xml_to_list), recursive = FALSE)
length(docs)
#> [1] 14032

Facet gueries

Facet queries are constructed by doing the following:

Initiate a facet query using query_facet()
Optionally specify filters to apply to the documents
Build on this query by specifying one of:
Fields to facet on using facet_by()
Date binning using facet_date_range()

Initiating a guery

To initiate a facet query, we use the function query_facet(), and pass it our connection object.

query <- query_facet(con)

Similarly to fetch queries, we can call get_query() or run() to print the query string or run the query.

Faceting by fields

Suppose we want to tabulate the frequency of all of the categories in the documents over the last two days. We can do this by adding facet_by() to our query, specifying the field name "category".

query_facet(con) %>%
  filter_pubdate(from = as.Date(Sys.time()) - 2) %>%
  facet_by("category") %>%
  get_query()
#> [1] "op=search&q=*:*&rows=0&facet=true&native=true&facet.field=category&facet.limit=-1&facet.sort=count&facet.mincount=0&facet.offset=0&fq=pubdate%3A%5B2020-09-10T00%3A00%3A00Z%20TO%20%2A%5D"

Here we see the facet query string that is constructed by this query specification.

To run the query, we use run() as we did with fetch queries. Here, we additionaly specify in facet_by() that we only want the top 10 categories.

res <- query_facet(con) %>%
  filter_pubdate(from = as.Date(Sys.time()) - 2) %>%
  facet_by("category", limit = 10) %>%
  run()
res
#> # A tibble: 10 x 2
#>    category                            n
#>    <chr>                           <dbl>
#>  1 fifa2018participatingcountries 458411
#>  2 euro                           452639
#>  3 coronavirusinfection           209279
#>  4 paho                           208024
#>  5 wpro                           115488
#>  6 radnucnonelist                 109946
#>  7 emro                            94203
#>  8 usa                             91441
#>  9 italy                           61976
#> 10 implantrisks                    51553

Different from fetch queries, facet queries always return a data frame.

It is possible to facet on up to two fields.

res <- query_facet(con) %>%
  filter_pubdate(from = as.Date(Sys.time()) - 2) %>%
  facet_by(c("category", "language"), limit = 100) %>%
  run()
res
#> # A tibble: 4,403 x 3
#>    category language     n
#>    <chr>    <chr>    <dbl>
#>  1 euro     en       55307
#>  2 euro     it       47138
#>  3 euro     es       39303
#>  4 euro     el       35447
#>  5 euro     ru       33457
#>  6 euro     de       32450
#>  7 euro     fr       21507
#>  8 euro     tr       14694
#>  9 euro     pt       13721
#> 10 euro     ar       13421
#> # … with 4,393 more rows

To see what fields are avilable to facet on:

facetable_fields()

Faceting by dates

A faceting function is available to specify faceting on dates, facet_date_range().

The main parameters for this function are:

field: either "pubdate" (default) or "indexdate"
start, end: the start and end dates to facet over, specified as "Date" objects
gap: The bin size to facet over. This is specified using the range_gap() function as will be illustrated the examples below.

To facet daily by publication date for all "CoronavirusInfection" articles over the past 20 days:

res <- query_facet(con) %>%
  filter_category("coronavirusinfection") %>%
  facet_date_range(
    start = as.Date(Sys.time()) - 20,
    end = as.Date(Sys.time()),
    gap = range_gap(1, "DAY")
  ) %>%
  run()
#> # A tibble: 20 x 2
#>    pubdate                 n
#>    <dttm>              <dbl>
#>  1 2020-08-23 00:00:00 54439
#>  2 2020-08-24 00:00:00 79749
#>  3 2020-08-25 00:00:00 87616
#>  4 2020-08-26 00:00:00 82739
#>  5 2020-08-27 00:00:00 82490
#>  6 2020-08-28 00:00:00 79061
#>  7 2020-08-29 00:00:00 54291
#>  8 2020-08-30 00:00:00 50336
#>  9 2020-08-31 00:00:00 76238
#> 10 2020-09-01 00:00:00 84864
#> 11 2020-09-02 00:00:00 86083
#> 12 2020-09-03 00:00:00 85665
#> 13 2020-09-04 00:00:00 81934
#> 14 2020-09-05 00:00:00 57294
#> 15 2020-09-06 00:00:00 50430
#> 16 2020-09-07 00:00:00 77083
#> 17 2020-09-08 00:00:00 85925
#> 18 2020-09-09 00:00:00 90019
#> 19 2020-09-10 00:00:00 75486
#> 20 2020-09-11 00:00:00 76962

The range_gap() function takes two parameters. The first is the number of time units the facet window gap should be and the second is the time unit, which is one of "DAY" (default), "MINUTE", "HOUR", "WEEK", "MONTH", or "YEAR".

Here we run the same facet but hourly:

res <- query_facet(con) %>%
  filter_category("coronavirusinfection") %>%
  facet_date_range(
    start = as.Date(Sys.time()) - 20,
    end = as.Date(Sys.time()),
    gap = range_gap(1, "HOUR")
  ) %>%
  run()
#> # A tibble: 480 x 2
#>    pubdate                 n
#>    <dttm>              <dbl>
#>  1 2020-08-23 00:00:00  1043
#>  2 2020-08-23 00:00:00  1131
#>  3 2020-08-23 00:00:00  1186
#>  4 2020-08-23 00:00:00  1088
#>  5 2020-08-23 00:00:00  2566
#>  6 2020-08-23 00:00:00  2276
#>  7 2020-08-23 00:00:00  2344
#>  8 2020-08-23 00:00:00  2075
#>  9 2020-08-23 00:00:00  2915
#> 10 2020-08-23 00:00:00  2491
#> # … with 470 more rows

String queries

There are many ways queries can be constructed with Solr/Finder. The functions for fetching and faceting provided above are meant to cover the vast majority of use cases, but their simplified API might not allow for some very special cases. If one is very familiar with Finder/Solr and wants to use this package to execute their own custom queries, there is a simple mechanism for doing this:

# query_str() allows you to run a query that you have already constructed a query string for
# default is to return a list
res <- query_str(con, "op=search&q=*:*&rows=0") %>%
  run()
str(res$rss$channel, 1)
#> List of 8
#>  $ title   : list()
#>  $ pubDate :List of 1
#>  $ q       :List of 1
#>  $ message : list()
#>  $ QTime   :List of 1
#>  $ numFound:List of 1
#>  $ start   :List of 1
#>  $ rows    :List of 1

# to return in xml format
res <- query_str(con, "op=search&q=*:*&rows=0", format = "xml") %>%
  run()

res
#> {xml_document}
#> <rss version="2.0" xmlns:emm="http://emm.jrc.it" xmlns:iso="http://www.iso.org/ 3166" xmlns:gphin="http://gphin.canada.ca">
#> [1] <channel>\n  <title/>\n  <pubDate>Sat, 12 Sep 2020 22:01:54 UTC</pubDate>\n  < ...

Limitations

This package is experimental and has not undergone rigorous testing to verify the correctness of the constructed queries. Use at your own risk.

The package has been written to cover a large number of immediate use cases. However, there are many additional features and parameters of Elasticsearch that could be exposed through this interface in the future.

WorldHealthOrganization/finderquery documentation built on Oct. 11, 2020, 12:50 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

WorldHealthOrganization/finderquery
Utilities for Building EIOS Finder Queries

In WorldHealthOrganization/finderquery: Utilities for Building EIOS Finder Queries

Setup

Queries

Overview

Fetch queries

Initiating a query

Adding filters to fetch queries

Term filters

`filter_category()`

`filter_country()`

`filter_language()`

`filter_source()`

`filter_duplicate()`

Text filter

Date filters

Other filters

`filter_tonality()`

ID filters

Specifying fields to sort on

Specifying fields to return

Fetch output formats

Converting to data frame

Fetching to disk

Facet gueries

Initiating a guery

Faceting by fields

Faceting by dates

String queries

Limitations

R Package Documentation

Browse R Packages

We want your feedback!

WorldHealthOrganization/finderquery Utilities for Building EIOS Finder Queries

In WorldHealthOrganization/finderquery: Utilities for Building EIOS Finder Queries

Setup

Queries

Overview

Fetch queries

Initiating a query

Adding filters to fetch queries

Term filters

filter_category()

filter_country()

filter_language()

filter_source()

filter_duplicate()

Text filter

Date filters

Other filters

filter_tonality()

ID filters

Specifying fields to sort on

Specifying fields to return

Fetch output formats

Converting to data frame

Fetching to disk

Facet gueries

Initiating a guery

Faceting by fields

Faceting by dates

String queries

Limitations

R Package Documentation

Browse R Packages

We want your feedback!

WorldHealthOrganization/finderquery
Utilities for Building EIOS Finder Queries

`filter_category()`

`filter_country()`

`filter_language()`

`filter_source()`

`filter_duplicate()`

`filter_tonality()`