Description Usage Arguments Details Value Note Author(s)
Export results of an sqlite query to an .rds file using limited memory (Linux, OS X; not Windows)
1 2 | sqliteToRDS(con, query, bind.data = data.frame(), out, classes = NULL,
rowsPerBlock = 10000, stringsAsFactors = TRUE, factorQueries = NULL)
|
con |
connection to sqlite database |
query |
character scalar; query on con; the entire results of the query will be written to the .rds file, but the query can include 'limit' and 'offset' phrases. |
bind.data |
values for query parameters, if any; see
|
out |
character scalar; name of file to which the query results will be saved. Should end in ".rds". |
classes |
named list of character vectors; classes for a
subset of the columns in parameter |
rowsPerBlock |
maximum number of rows fetched from the DB at a time; this limits the maximum memory consumed by this function. Default: 10000 |
stringsAsFactors |
should string columns be exported as
factors? With the default value of TRUE, size on disk and size
in memory of the data.frame upon subsequent read will both be
smaller, but at the cost of having to run a separate query on
each string column to determine the levels. These separate
queries might be just as expensive as the full query, in which
case you should specify simpler queries for obtaining factor
levels using |
factorQueries |
a named character vector of queries for quickly obtaining the
levels for those columns you wish to be factors. This list will be consulted
for any string column, if 'stringsAsFactors' is TRUE, and for any column
whose entry in |
Typically, exporting the contents of an sqlite database
table as an R .rds file has meant reading the entire table into
R as a data.frame, then using saveRDS()
. This requires
memory proportional to the size of the data.frame, but with a sufficiently
large swap partition, this will still work. However, our experience on
an Intel core-i7 server with 4 cores @ 3.4 GHz, 32 G RAM, and a 256 SSD swap
shows that our largest sites still slow the server down to a grind when
processing one site at a time. To permit this all to work on a lower
spec server with multiple sites potentially being processed at once, we
need to do this with a much smaller memory footprint, even at the expense
of considerably longer running time.
This function serializes the results of an SQLite query as a data.frame into an .rds file, using a fixed amount of memory that does not depend on the size of the results. Because sqlite stores data row-by-row, while .rds files store them column by column, the main challenge is to transpose the data without having it all in memory. We do this with a single run of the query, distributing columns to their own files, then concatenating and compressing these into the final .rds file via a shell command.
integer scalar; the number of result rows written to file out
.
End users should really be working with an on-disk .sqlite version of their data, but we want to maintain some backward compatibility with users' existing code. Also, the details of the algorithm in this function depend on the encoding inherent in the files serialize.c and Rinternals.h from the R source tree.
The resulting .rds file uses bzip2 compression, which works with recent versions of R, but might break 'readRDS()' in older versions.
The algorithm, ignoring headers and footers, is:
for each column in the query result, open a temporary output file
while there are results remaining
fetch a block of query results
distribute data in the block among the column files
when all blocks have been distributed, close the temporary files and concatenate them on disk into the target .rds file
The result is an .rds file in non-XDR little-endian format, which should read more quickly into memory.
Data types are converted as so:
sqlite real: written as 8-byte doubles
sqlite int: written as 4-byte signed integers or logical (see below)
sqlite text: written as a factor
Additionally, class() attributes can be specified for any of the columns. If "logical" or "integer" is specified for a column, it is written as a native vector of that type.
If there are no result rows from the query, then the R value
NULL
is saved to file out
.
This function currently only works on linux (and OS X?) because we use shell commands 'cat' and 'bzip2'. The author would appreciate information on how to implement this for Windows platforms.
John Brzustowski jbrzusto@REMOVE_THIS_PART_fastmail.fm
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.