Functions for querying Cassandra database

Description

RC.use selects the keyspace (aka database) to use for all subsequent operations. All functions described below require keyspace to be set using this function.

RC.get queries one key and a fixed list of columns

RC.get.range queries one key and multiple columns

RC.mget.range queries multiple keys and multiple columns

RC.get.range.slices queries a range of keys (or tokens) and a range of columns

RC.consistency sets the desired consistency level for all query operations

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
RC.use(conn, keyspace, cache.def = TRUE)
RC.get(conn, c.family, key, c.names,
       comparator = NULL, validator = NULL)
RC.get.range(conn, c.family, key, first = "", last = "",
             reverse = FALSE, limit = 1e+07,
             comparator = NULL, validator = NULL)
RC.mget.range(conn, c.family, keys, first = "", last = "",
             reverse = FALSE, limit = 1e+07,
             comparator = NULL, validator = NULL)
RC.get.range.slices(conn, c.family, k.start = "", k.end = "",
                    first = "", last = "", reverse = FALSE,
                    limit = 1e+07, k.limit = 1e+07,
                    tokens = FALSE, fixed = FALSE,
                    comparator = NULL, validator = NULL)
RC.consistency(conn, level = c("one", "quorum", "local.quorum",
               "each.quorum", "all", "any", "two", "three")) 

Arguments

conn

connection handle as returned by RC.connect

keyspace

name of the keyspace to use

cache.def

if TRUE then in addition to setting the keyspace a query on the keyspace definition is sent and the result cached. This allows automatic detection of comparators and validators, see details section for more information.

c.family

column family (aka table) name

key

row key

c.names

vector of column names

comparator

string, type of the column keys (comparator in Cassandra speak) or NULL to rely on cached schema definitions

validator

string, type of the values (validator in Cassandra speak) or NULL to rely on cached schema definitions

first

starting column name

last

ending column name

reverse

if TRUE the resutl is returned in reverse order

limit

return at most as many columns per key

keys

row keys (character vector)

k.start

start key (or token)

k.end

end key (or token)

k.limit

return at most as many keys (rows)

tokens

if TRUE then keys are interpreted as tokens (i.e. values after hashing)

fixed

if TRUE then the result if be a single data frame consisting of rows and keys and all columns ever encountered - essentially assuming fixed column structure

level

the desired consistency level for query operations on this connection. "one" is the default if not explicitly set.

Details

The nomenclature can be a bit confusing and it comes from the literature and the Cassandra API. Put in simple terms, keyspace is comparable to a database, and column family is somewhat comparable to a table. However, a table may have different number of columns for each row, so it can be used to create a flexible two-dimensional query structure. A row is defined by a (row) key. A query is performed by first finding out which row(s) will be fetched according to the key (RC.get, RC.get.range), keys (RC.mget.range) or key range (RC.get.range.slices), then selecting the columns of interest. Empty string ("") can be used to denote an unspecified range (so the default is to fetch all columns).

comparator and validator specify the types of column keys and values respectively. Every key or value in Cassandra is simply a byte string, so it can deal with arbitrary values, but sometimes it is convenient to impose some structure on that content by declaring what is represented by that byte string. Unfortunately Cassandra does not include that information in the results, so the user has to define how column names and values are to be interpreted. The default interpretation is simply as a UTF-8 encoded string, but RCassandra also supports following conversions: "UTF8Type", "AsciiType" (stored as character vectors), "BytesType" (opaque stream of bytes, stored as raw vector), "LongType" (8-bytes integer, stored as real vector in R), "DateType" (8-bytes integer, stored as POSIXct in R), "BooleanType" (one byte, logical vector in R), "FloatType" (4-bytes float, real vector in R), "DoubleType" (8-bytes float, real vector in R) and "UUIDType" (16-bytes, stored as UUID-formatted string). No other conversions are supported at this point. If the value is NULL then RCassandra attempts to guess the proper value by taking into account the schema definition obtained by RC.use(..., cache.def=TRUE), otherwise it falls back to "UTF8Type". You can always get the raw form using "BytesType" and decode the values in R.

The comparator also determines how the values of first and last will be interpreted. Regardless of the comparator, it is always possible to pass either NULL, "" (both denoting 0-length value) or a raw vector. Other supported types must match the comparator.

Most users will be happy with the default settings, but if you want to save every nanosecond you can, call RC.use(..., cache.def = FALSE) (which saves one extra RC.describe.keyspace request to the Cassandra instance) and always specify both comparator and validator (even if it is just "UTF8String").

Cassandra collects results in memory so key (k.limit) and column (limit) limits are mandatory. Future versions of RCassandra may abstract this limitation out (by using a limit and repeating queries with new start key/column based on the last result row), but not at this point.

Note that in Cassandra keys are typically hashed, so key range may be counter-intuitive as it is based on the hash and not on the actual value. Columns are always sorted by their name (=key).

The result of queries may be also counter-intuitive, especially when querying fixed column tables as it is not returned in the form that would be expected from a relational database. See RC.read.table and RC.write.table for retrieving and storing relational structures in rectangular tables (column families with fixed columns). But you have to keep in mind that Cassandra is essentailly key/key/value storage (row key, column key, value) with partitioning on row keys and sorting of column keys, so designing the correct schema for a task needs some thought. Dynamic columns are what makes it so powerful.

Value

RC.use and RC.consistency returns conn

RC.get and RC.get.range return a data frame with columns key (column name), value (value in that column) and ts (timestamp).

RC.mget.range and RC.get.range.slices return a named list of data frames as described in RC.get.range with names being the row keys, except if fixed=TRUE in which case the result is a data frame with row names as keys and values as elements (timestamps are not retrieved in that case).

Author(s)

Simon Urbanek

See Also

RC.connect, RC.read.table, RC.write.table

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## Not run: 
c <- RC.connect("cassandra-host")
RC.use(c, "testdb")
## you will have to use cassandra-cli to create the schema for the "iris" CF
RC.write.table(c, "iris", iris)
RC.get(c, "iris", "1", c("Sepal.Length", "Species"))
RC.get.range(c, "iris", "1")
## list of 150 data frames
r <- RC.get.range.slices(c, "iris")
## use limit=0 to obtain all row keys without pulling any data
rk <- RC.get.range.slices(c, "iris", limit=0)
y <- RC.read.table(c, "iris")
y <- y[order(as.integer(row.names(y))),]
RC.close(c)

## End(Not run)