xindex: Index
In stewid/xapr: Access the Xapian search engine from R

Description Usage Arguments Details Value Examples

View source: R/index.r

Index the content of a data.frame with the Xapian search engine.

xindex(formula, data, path, language = c("none", "english", "en", "danish",
  "da", "dutch", "nl", "english_lovins", "lovins", "english_porter", "porter",
  "finnish", "fi", "french", "fr", "german", "de", "german2", "hungarian", "hu",
  "italian", "it", "kraaij_pohlmann", "norwegian", "nb", "nn", "no",
  "portuguese", "pt", "romanian", "ro", "russian", "ru", "spanish", "es",
  "swedish", "sv", "turkish", "tr"))

`formula`	A formula with a symbolic description of the index plan for the columns in the data.frame. The details of the index plan specification are given under 'Details'.
`data`	The `data.frame` to index.
`path`	A character vector specifying the path to a Xapian databases. If there is already a database in the specified directory, it will be opened. If there isn't an existing database in the specified directory, Xapian will try to create a new empty database there.
`language`	Either the English name for the language or the two letter ISO639 code. Default is 'none'

The index plan for 'xindex' are specified symbolically. An index plan has the form 'data ~ terms' where 'data' is the blob of data returned from a request and the 'terms' are the basis for a search in Xapian. A first order term index the text in the column as free text. A specification of the form 'first:second' indicates that the text in 'second' should be indexed with prefix 'first'.

The prefix is a short string at the beginning of the term to indicate which field the term indexes. Valid prefixes are: 'A' ,'D', 'E', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y' and 'Z'. See http://xapian.org/docs/omega/termprefixes for a list of conventional prefixes.

The specification 'first*second' is the same as 'second + first:second'. The prefix 'X' will create a user defined prefix by appending the uppercase 'second' to 'X'. The prefix 'Q' will use data in the 'second' column as a unique identifier for the document. NA values in columns to be indexed are skipped.

No response e.g. '~ second + first:second' writes the row number as data to the document.

The specification '~X*.' creates prefix terms with all columns plus free text.

If the response contains one or more columns, e.g. 'col_1 + col_2 ~ X*.' the response is first converted to 'JSON'. A compact form to convert all fields to 'JSON' and to enable free text search on all fields is to use '.~.'. It is also possible to drop response fields e.g. '. - col_1 - col_2 ~ X*.' to include all fields in the response except 'col_1' and 'col_2'.

A xapian_database object.

## Not run: 
## This example is borrowed from "Getting Started with Xapian"
## http://getting-started-with-xapian.readthedocs.org/en/latest/index.html
## were the example is implemented in Python.
##
## We are going to build a simple search system based on museum catalogue
## data released under the Creative Commons Attribution-NonCommercial-
## ShareAlike license (http://creativecommons.org/licenses/by-nc-sa/3.0/)
## by the Science Museum in London, UK.
## (http://api.sciencemuseum.org.uk/documentation/collections/)

## The first 100 rows of the museum catalogue data is distributed with
## the 'xapr' package
filename <- system.file("extdata/NMSI_100.csv", package="xapr")
nmsi <- read.csv(filename, as.is = TRUE, na.strings="")

## Create a temporary directory to hold the database
path <- tempfile(pattern="xapr-")
dir.create(path)

## Index the 'TITLE' and 'DESCRIPTION' fields with both a suitable
## prefix and without a prefix for general search. Use the 'id_NUMBER'
## as unique identifier. Store all the fields as JSON for display
## purposes.
db <- xindex(. ~ S*TITLE + X*DESCRIPTION + Q:id_NUMBER, nmsi, path)

## Display a summary of the Xapian database
summary(db)

## Run a search and display docid (rowname) and TITLE from each match
xsearch(db, "watch", TITLE ~ .)

## Run a search with multiple words
xsearch(db, "Dent watch", TITLE ~ .)

## Run a search with prefix
xsearch(db, "title:sunwatch", TITLE ~ title:S)

## Run a search with multiple prefixes
xsearch(db,
        "description:\"leather case\" AND title:sundial",
        TITLE ~ title:S + description:XDESCRIPTION)

## End(Not run)