Build Status

Introduction

xapr is an R package that provides an interface to the Xapian search engine from R, allowing both indexing and retrieval operations. A great introduction to Xapian is the Getting Started with Xapian.

Indexing

Index the content of a data.frame to documents with the Xapian search engine A document is the data returned from a search.

The index plan is specified symbolically. An index plan has the form data ~ terms where data is the blob of data returned from a search and the terms are the basis for a search in Xapian. A first order term index the text in the column as free text. A specification of the form prefix:term indicates that the text in term should be indexed with the prefix prefix.

The prefix is a short string at the beginning of the term to indicate which field the term indexes. Valid prefixes are: 'A' ,'D', 'E', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y' and 'Z'. See http://xapian.org/docs/omega/termprefixes for a list of conventional prefixes.

The specification prefix*term is the same as term + prefix:term. The prefix X will create a user defined prefix by appending the uppercase term to X. The prefix Q will use data in the term column as a unique identifier for the document. NA values in indexed columns are skipped.

No response e.g. ~ term + prefix:term writes the row number as data to the document.

The specification ~X*. creates prefix terms with all columns plus free text.

If the response contains one or more columns, e.g. col_1 + col_2 ~ X*. the response is first converted to JSON. A compact form to convert all fields to JSON is to use . ~ terms. It is also possible to drop response fields e.g. . - col_1 - col_2 ~ X*. to include all fields in the response except col_1 and col_2.

Example

This is an R version of the Python example in the Getting Started with Xapian

We are going to build a simple search system based on museum catalogue data released under the Creative Commons Attribution-NonCommercial- ShareAlike license (http://creativecommons.org/licenses/by-nc-sa/3.0/) by the Science Museum in London, UK. (http://api.sciencemuseum.org.uk/documentation/collections/)

library(xapr)

## The first 100 rows of the museum catalogue data is distributed with
## the 'xapr' package
filename <- system.file("extdata/NMSI_100.csv", package="xapr")
nmsi <- read.csv(filename, as.is = TRUE)

## Create a temporary directory to hold the database
path <- tempfile(pattern="xapr-")
dir.create(path)

## Index the 'TITLE' and 'DESCRIPTION' fields with both a suitable
## prefix and without a prefix for general search. Use the 'id_NUMBER'
## as unique identifier. Store all the fields as JSON for display
## purposes.
db <- xindex(. ~ S*TITLE + X*DESCRIPTION + Q:id_NUMBER, nmsi, path)

## Display a summary of the Xapian database
summary(db)

## Run a search and display docid (rowname) and TITLE from each match
xsearch(db, "watch", TITLE ~ .)

## Run a search with multiple words
xsearch(db, "Dent watch", TITLE ~ .)

## Run a search with prefix
xsearch(db, "title:sunwatch", TITLE ~ title:S)

## Run a search with multiple prefixes
xsearch(db,
        "description:\"leather case\" AND title:sundial",
        TITLE ~ title:S + description:XDESCRIPTION)

Installation

The development files for the Xapian search engine must be installed.

$ sudo apt-get install libxapian-dev

To install the development version of xapr, it's easiest to use the devtools package:

# install.packages("devtools")
library(devtools)
install_github("stewid/xapr")

Another alternative is to use git and make

$ git clone https://github.com/stewid/xapr.git
$ cd xapr
$ make install

NOTE: The package is in a very early development phase. Functions and documentation may be incomplete and subject to change. Suggestions, bugs, forks and pull requests are appreciated. Get in touch.

License

The xapr package is licensed under the GPL (>= 2). See these files for additional details:



stewid/xapr documentation built on April 19, 2021, 2:03 p.m.