solrium

knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE,
  warning = FALSE,
  message = FALSE
)

Project Status: Active – The project has reached a stable, usable state and is being actively developed. cran checks rstudio mirror downloads cran version

A general purpose R interface to Solr

Development is now following Solr v7 and greater - which introduced many changes, which means many functions here may not work with your Solr installation older than v7.

Be aware that currently some functions will only work in certain Solr modes, e.g, collection_create() won't work when you are not in Solrcloud mode. But, you should get an error message stating that you aren't.

Currently developing against Solr v8.2.0

Package API and ways of using the package

The first thing to look at is SolrClient to instantiate a client connection to your Solr instance. ping and schema are helpful functions to look at after instantiating your client.

There are two ways to use solrium:

  1. Call functions on the SolrClient object
  2. Pass the SolrClient object to functions

For example, if we instantiate a client like conn <- SolrClient$new(), then to use the first way we can do conn$search(...), and the second way by doing solr_search(conn, ...). These two ways of using the package hopefully make the package more user friendly for more people, those that prefer a more object oriented approach, and those that prefer more of a functional approach.

Collections

Functions that start with collection work with Solr collections when in cloud mode. Note that these functions won't work when in Solr standard mode

Cores

Functions that start with core work with Solr cores when in standard Solr mode. Note that these functions won't work when in Solr cloud mode

Documents

The following functions work with documents in Solr

#>  - add
#>  - delete_by_id
#>  - delete_by_query
#>  - update_atomic_json
#>  - update_atomic_xml
#>  - update_csv
#>  - update_json
#>  - update_xml

Search

Search functions, including solr_parse for parsing results from different functions appropriately

#>  - solr_all
#>  - solr_facet
#>  - solr_get
#>  - solr_group
#>  - solr_highlight
#>  - solr_mlt
#>  - solr_parse
#>  - solr_search
#>  - solr_stats

Install

Stable version from CRAN

install.packages("solrium")

Or development version from GitHub

remotes::install_github("ropensci/solrium")
library("solrium")

Setup

Use SolrClient$new() to initialize your connection. These examples use a remote Solr server, but work on any local Solr server.

(cli <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL))

You can also set whether you want simple or detailed error messages (via errors), and whether you want URLs used in each function call or not (via verbose), and your proxy settings (via proxy) if needed. For example:

SolrClient$new(errors = "complete")

Your settings are printed in the print method for the connection object

cli

For local Solr server setup:

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted example/exampledocs/*.xml

Search

(res <- cli$search(params = list(q='*:*', rows=2, fl='id')))

And you can get search metadata from the attributes:

attributes(res)

Search grouped data

Most recent publication by journal

cli$group(params = list(q='*:*', group.field='journal', rows=5, group.limit=1,
                        group.sort='publication_date desc',
                        fl='publication_date, score'))

First publication by journal

cli$group(params = list(q = '*:*', group.field = 'journal', group.limit = 1,
                        group.sort = 'publication_date asc',
                        fl = c('publication_date', 'score'),
                        fq = "publication_date:[1900-01-01T00:00:00Z TO *]"))

Search group query : Last 3 publications of 2013.

gq <- 'publication_date:[2013-01-01T00:00:00Z TO 2013-12-31T00:00:00Z]'
cli$group(
  params = list(q='*:*', group.query = gq,
                group.limit = 3, group.sort = 'publication_date desc',
                fl = 'publication_date'))

Search group with format simple

cli$group(params = list(q='*:*', group.field='journal', rows=5,
                        group.limit=3, group.sort='publication_date desc',
                        group.format='simple', fl='journal, publication_date'))

Facet

cli$facet(params = list(q='*:*', facet.field='journal', facet.query=c('cell', 'bird')))

Highlight

cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2))

Stats

out <- cli$stats(params = list(q='ecology', stats.field=c('counter_total_all','alm_twitterCount'), stats.facet='journal'))
out$data

More like this

solr_mlt is a function to return similar documents to the one

out <- cli$mlt(params = list(q='title:"ecology" AND body:"cell"', mlt.fl='title', mlt.mindf=1, mlt.mintf=1, fl='counter_total_all', rows=5))
out$docs
out$mlt

Parsing

solr_parse is a general purpose parser function with extension methods solr_parse.sr_search, solr_parse.sr_facet, and solr_parse.sr_high, for parsing solr_search, solr_facet, and solr_highlight function output, respectively. solr_parse is used internally within those three functions (solr_search, solr_facet, solr_highlight) to do parsing. You can optionally get back raw json or xml from solr_search, solr_facet, and solr_highlight setting parameter raw=TRUE, and then parsing after the fact with solr_parse. All you need to know is solr_parse can parse

For example:

(out <- cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2),
                      raw=TRUE))

Then parse

solr_parse(out, 'df')

Progress bars

only supported in the core search methods: search, facet, group, mlt, stats, high, all

library(httr)
invisible(cli$search(params = list(q='*:*', rows=100, fl='id'), progress = httr::progress()))
|==============================================| 100%

Advanced: Function Queries

Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to sort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field "val"

cli$search(params = list(q='_val_:"product(counter_total_all,alm_twitterCount)"',
  rows=5, fl='id,title', fq='doc_type:full'))

Here, we search for the papers with the most citations

cli$search(params = list(q='_val_:"max(counter_total_all)"',
    rows=5, fl='id,counter_total_all', fq='doc_type:full'))

Or with the most tweets

cli$search(params = list(q='_val_:"max(alm_twitterCount)"',
    rows=5, fl='id,alm_twitterCount', fq='doc_type:full'))

Using specific data sources

USGS BISON service

The occurrences service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/occurrences/select", port = NULL)
conn$search(params = list(q = '*:*', fl = c('decimalLatitude','decimalLongitude','scientificName'), rows = 2))

The species names service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/scientificName/select", port = NULL)
conn$search(params = list(q = '*:*'))

PLOS Search API

Most of the examples above use the PLOS search API... :)

Solr server management

This isn't as complete as searching functions show above, but we're getting there.

Cores

conn <- SolrClient$new()

Many functions, e.g.:

Create a core

conn$core_create(name = "foo_bar")

Collections

Many functions, e.g.:

Create a collection

conn$collection_create(name = "hello_world")

Add documents

Add documents, supports adding from files (json, xml, or csv format), and from R objects (including data.frame and list types so far)

df <- data.frame(id = c(67, 68), price = c(1000, 500000000))
conn$add(df, name = "books")

Delete documents, by id

conn$delete_by_id(name = "books", ids = c(3, 4))

Or by query

conn$delete_by_query(name = "books", query = "manu:bank")

Meta



ropensci/solrium documentation built on Sept. 12, 2022, 3:01 p.m.