sqliteToRDS: Export results of an sqlite query to an .rds file using...

Description Usage Arguments Details Value Note Author(s)

Description

Export results of an sqlite query to an .rds file using limited memory (Linux, OS X; not Windows)

Usage

1
2
sqliteToRDS(con, query, bind.data = data.frame(), out, classes = NULL,
  rowsPerBlock = 10000, stringsAsFactors = TRUE, factorQueries = NULL)

Arguments

con

connection to sqlite database

query

character scalar; query on con; the entire results of the query will be written to the .rds file, but the query can include 'limit' and 'offset' phrases.

bind.data

values for query parameters, if any; see dbGetQuery. Defaults to data.frame(x=0L), i.e. a trivial data.frame meant only to pass the sanity checks imposed by dbGetQuery()

out

character scalar; name of file to which the query results will be saved. Should end in ".rds".

classes

named list of character vectors; classes for a subset of the columns in parameter query

rowsPerBlock

maximum number of rows fetched from the DB at a time; this limits the maximum memory consumed by this function. Default: 10000

stringsAsFactors

should string columns be exported as factors? With the default value of TRUE, size on disk and size in memory of the data.frame upon subsequent read will both be smaller, but at the cost of having to run a separate query on each string column to determine the levels. These separate queries might be just as expensive as the full query, in which case you should specify simpler queries for obtaining factor levels using factorQueries. You can specify FALSE here and still request that specific text columns be exported as factors by including 'factor' in the appropriate slot of the 'classes' parameter.

factorQueries

a named character vector of queries for quickly obtaining the levels for those columns you wish to be factors. This list will be consulted for any string column, if 'stringsAsFactors' is TRUE, and for any column whose entry in classes is 'factor'. Names of this list are column names, and values are the query to perform to get the factor levels for that column. The query should return a distinct set of levels. You don't have to specify ‘factorQueries', but if you don’t, sqliteToRDS might take $N+1$ times as long to run as it would otherwise, where $N$ is the number of factor columns.

Details

Typically, exporting the contents of an sqlite database table as an R .rds file has meant reading the entire table into R as a data.frame, then using saveRDS(). This requires memory proportional to the size of the data.frame, but with a sufficiently large swap partition, this will still work. However, our experience on an Intel core-i7 server with 4 cores @ 3.4 GHz, 32 G RAM, and a 256 SSD swap shows that our largest sites still slow the server down to a grind when processing one site at a time. To permit this all to work on a lower spec server with multiple sites potentially being processed at once, we need to do this with a much smaller memory footprint, even at the expense of considerably longer running time.

This function serializes the results of an SQLite query as a data.frame into an .rds file, using a fixed amount of memory that does not depend on the size of the results. Because sqlite stores data row-by-row, while .rds files store them column by column, the main challenge is to transpose the data without having it all in memory. We do this with a single run of the query, distributing columns to their own files, then concatenating and compressing these into the final .rds file via a shell command.

Value

integer scalar; the number of result rows written to file out.

Note

End users should really be working with an on-disk .sqlite version of their data, but we want to maintain some backward compatibility with users' existing code. Also, the details of the algorithm in this function depend on the encoding inherent in the files serialize.c and Rinternals.h from the R source tree.

The resulting .rds file uses bzip2 compression, which works with recent versions of R, but might break 'readRDS()' in older versions.

The algorithm, ignoring headers and footers, is:

The result is an .rds file in non-XDR little-endian format, which should read more quickly into memory.

Data types are converted as so:

Additionally, class() attributes can be specified for any of the columns. If "logical" or "integer" is specified for a column, it is written as a native vector of that type.

If there are no result rows from the query, then the R value NULL is saved to file out.

This function currently only works on linux (and OS X?) because we use shell commands 'cat' and 'bzip2'. The author would appreciate information on how to implement this for Windows platforms.

Author(s)

John Brzustowski jbrzusto@REMOVE_THIS_PART_fastmail.fm


jbrzusto/motusClient documentation built on May 30, 2019, 4:33 p.m.