In stewid/xapr: Access the Xapian search engine from R

Introduction

xapr is an R package that provides an interface to the Xapian search engine from R, allowing both indexing and retrieval operations. A great introduction to Xapian is the Getting Started with Xapian.

Indexing

Index the content of a data.frame to documents with the Xapian search engine A document is the data returned from a search.

The index plan is specified symbolically. An index plan has the form data ~ terms where data is the blob of data returned from a search and the terms are the basis for a search in Xapian. A first order term index the text in the column as free text. A specification of the form prefix:term indicates that the text in term should be indexed with the prefix prefix.

The prefix is a short string at the beginning of the term to indicate which field the term indexes. Valid prefixes are: 'A' ,'D', 'E', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y' and 'Z'. See http://xapian.org/docs/omega/termprefixes for a list of conventional prefixes.

The specification prefix*term is the same as term + prefix:term. The prefix X will create a user defined prefix by appending the uppercase term to X. The prefix Q will use data in the term column as a unique identifier for the document. NA values in indexed columns are skipped.

No response e.g. ~ term + prefix:term writes the row number as data to the document.

The specification ~X*. creates prefix terms with all columns plus free text.

If the response contains one or more columns, e.g. col_1 + col_2 ~ X*. the response is first converted to JSON. A compact form to convert all fields to JSON is to use . ~ terms. It is also possible to drop response fields e.g. . - col_1 - col_2 ~ X*. to include all fields in the response except col_1 and col_2.

Example

This is an R version of the Python example in the Getting Started with Xapian

We are going to build a simple search system based on museum catalogue data released under the Creative Commons Attribution-NonCommercial- ShareAlike license (http://creativecommons.org/licenses/by-nc-sa/3.0/) by the Science Museum in London, UK. (http://api.sciencemuseum.org.uk/documentation/collections/)

library(xapr)

## The first 100 rows of the museum catalogue data is distributed with
## the 'xapr' package
filename <- system.file("extdata/NMSI_100.csv", package="xapr")
nmsi <- read.csv(filename, as.is = TRUE)

## Create a temporary directory to hold the database
path <- tempfile(pattern="xapr-")
dir.create(path)

## Index the 'TITLE' and 'DESCRIPTION' fields with both a suitable
## prefix and without a prefix for general search. Use the 'id_NUMBER'
## as unique identifier. Store all the fields as JSON for display
## purposes.
db <- xindex(. ~ S*TITLE + X*DESCRIPTION + Q:id_NUMBER, nmsi, path)

## Display a summary of the Xapian database
summary(db)

## Run a search and display docid (rowname) and TITLE from each match
xsearch(db, "watch", TITLE ~ .)

## Run a search with multiple words
xsearch(db, "Dent watch", TITLE ~ .)

## Run a search with prefix
xsearch(db, "title:sunwatch", TITLE ~ title:S)

## Run a search with multiple prefixes
xsearch(db,
        "description:\"leather case\" AND title:sundial",
        TITLE ~ title:S + description:XDESCRIPTION)

Installation

The development files for the Xapian search engine must be installed.

$ sudo apt-get install libxapian-dev

To install the development version of xapr, it's easiest to use the devtools package:

# install.packages("devtools")
library(devtools)
install_github("stewid/xapr")

Another alternative is to use git and make

$ git clone https://github.com/stewid/xapr.git
$ cd xapr
$ make install

NOTE: The package is in a very early development phase. Functions and documentation may be incomplete and subject to change. Suggestions, bugs, forks and pull requests are appreciated. Get in touch.