xapr
is an R package that provides an interface to the
Xapian search engine from R, allowing both
indexing and retrieval operations. A great introduction to
Xapian is the
Getting Started with Xapian.
Index the content of a data.frame
to documents with the Xapian
search engine A document
is the data returned from a search.
The index plan is specified symbolically. An index plan has the form
data ~ terms
where data
is the blob of data returned from a search
and the terms
are the basis for a search in Xapian. A first order
term index the text in the column as free text. A specification of the
form prefix:term
indicates that the text in term
should be
indexed with the prefix prefix
.
The prefix is a short string at the beginning of the term to indicate which field the term indexes. Valid prefixes are: 'A' ,'D', 'E', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y' and 'Z'. See http://xapian.org/docs/omega/termprefixes for a list of conventional prefixes.
The specification prefix*term
is the same as term +
prefix:term
. The prefix X
will create a user defined prefix by
appending the uppercase term
to X
. The prefix Q
will use data
in the term
column as a unique identifier for the document. NA
values in indexed columns are skipped.
No response e.g. ~ term + prefix:term
writes the row number as
data to the document.
The specification ~X*.
creates prefix terms with all columns plus
free text.
If the response contains one or more columns, e.g. col_1 + col_2 ~
X*.
the response is first converted to JSON
. A compact form to
convert all fields to JSON
is to use . ~ terms
. It is also
possible to drop response fields e.g. . - col_1 - col_2 ~ X*.
to
include all fields in the response except col_1
and col_2
.
This is an R
version of the Python
example in the
Getting Started with Xapian
We are going to build a simple search system based on museum catalogue data released under the Creative Commons Attribution-NonCommercial- ShareAlike license (http://creativecommons.org/licenses/by-nc-sa/3.0/) by the Science Museum in London, UK. (http://api.sciencemuseum.org.uk/documentation/collections/)
library(xapr) ## The first 100 rows of the museum catalogue data is distributed with ## the 'xapr' package filename <- system.file("extdata/NMSI_100.csv", package="xapr") nmsi <- read.csv(filename, as.is = TRUE) ## Create a temporary directory to hold the database path <- tempfile(pattern="xapr-") dir.create(path) ## Index the 'TITLE' and 'DESCRIPTION' fields with both a suitable ## prefix and without a prefix for general search. Use the 'id_NUMBER' ## as unique identifier. Store all the fields as JSON for display ## purposes. db <- xindex(. ~ S*TITLE + X*DESCRIPTION + Q:id_NUMBER, nmsi, path) ## Display a summary of the Xapian database summary(db) ## Run a search and display docid (rowname) and TITLE from each match xsearch(db, "watch", TITLE ~ .) ## Run a search with multiple words xsearch(db, "Dent watch", TITLE ~ .) ## Run a search with prefix xsearch(db, "title:sunwatch", TITLE ~ title:S) ## Run a search with multiple prefixes xsearch(db, "description:\"leather case\" AND title:sundial", TITLE ~ title:S + description:XDESCRIPTION)
The development files for the Xapian
search engine must be
installed.
$ sudo apt-get install libxapian-dev
To install the development version of xapr
, it's easiest to use the
devtools package:
# install.packages("devtools") library(devtools) install_github("stewid/xapr")
Another alternative is to use git
and make
$ git clone https://github.com/stewid/xapr.git
$ cd xapr
$ make install
NOTE: The package is in a very early development phase. Functions and documentation may be incomplete and subject to change. Suggestions, bugs, forks and pull requests are appreciated. Get in touch.
The xapr
package is licensed under the GPL (>= 2). See these files
for additional details:
xapr
package licenseAdd the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.