Read and write tables into column families in Cassandra

Share:

Description

RC.read.table reads the contents of a column family into a data frame

RC.write.table writes the contents of a data frame into a column familly

Usage

1
2
3
RC.read.table(conn, c.family, convert = TRUE, na.strings = "NA",
              as.is = FALSE, dec = ".")
RC.write.table(conn, c.family, df)

Arguments

conn

connection handle as obtained form RC.connect

c.family

column family name (string)

convert

logical, if TRUE the resulting data frame is processed using type.convert, otherwise all columns will be character vectors

na.strings

passed to type.convert

as.is

passed to type.convert

dec

passed to type.convert

df

data frame - it must have both row and column names

Details

Cassandra is a key/value store with dynamic columns, so tables are not the native format. Row names are used as keys and columns are treated as fixed. RC.read.table is really jsut a wrapper for RC.get.range.slices(conn, c.family, fixed=TRUE). RC.write.table uses the same facility as RC.mutate but without actually creating the mutation object on the R side.

Note that all updates in Cassandra are "upserts", i.e., RC.write.table updates any existing row key/coumn name combinations or creates new ones where not present (insert). Additonal columns (or even keys) may still exist in the column family and they will not be touched.

RC.read.table creates a data frame from all columns that are ever encountered in at least one key. All other values are filled with NAs.

Value

RC.read.table returns the resulting data frame

RC.write.table returns conn

Note

IMPORTANT: Cassandra does NOT preserve order of keys and columns. Internally, keys are ordered by their hash value and columns are ordered lexicographically (treated as bytes). However, due to the fact that columns are dynamic the order of columns will vary if keys have different columns, because columns are added to the data frame in the sequence they are encountered as the keys are loaded. You may want to use df <- df[order(as.integer(row.names(df))),] on the result of RC.read.table for tables with automatic row names to obtain the original order of rows.

RC.read.table is more effcient than RC.get.range.slices because it can store columns into vectors and can pre-allocate the whole structure in advance.

Note that the current implementation of tables (RC.read.table and RC.write.table) supports only string-based representation of columns and values ("UTF8Type", "AsciiType" or similar).

Author(s)

Simon Urbanek

See Also

RC.connect, RC.use, RC.get