knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "READMEfigs/" )
The basic idea behind this format is to add a header section to the regular csv format, containing information about the data frame being stored, in such a way as to allow easy reading and writing of data to and from file, without losing the structure of the original data. Specifically designed for storing data frame (or data table) objects, details stored in the header include the dimensions of the data frame, and the class of each column. This means that the data frame can be read back into memory when required, without the need to specify column classes on input, nor perform column manipulations after import. The data will be brought back into memory in (almost) exactly the form it was in when it was saved.
library( rcsv ) library( data.table ) library( dplyr ) library( readr ) library( ggplot2 )
Personally, I like to work with binary files, and not having to worry about things like classes, timezones, etc. However, these binary formats are often not suitable in certain workflows, for example where collaborators may not be working in R*. It can also be beneficial in certain workflows to have meaningful diffs when a data file is changed, in order to work with version control systems like git.
So to make the data as generally accessible as possible, and allowing for meaningful diffs, while maintaining the ability to write out and read in data without losing important attributes, is the aim here.
feather
tries to address this with compatibility in both R and python, but this still can't encompass everyone's collaborators.We'll look at a test data frame, using several different column types. The rcsv
package is specifically designed to work with the following column types, while others may or may not work well (please do your own testing):
character, numeric, integer, logical, factor, Date, POSIXct, ITime, IDate, times
testDF <- function( n ) { data.frame( integers = 1:n, letters = sample( letters, n, replace = TRUE ), dates = sample( seq.Date( as.Date( "1970-01-01" ), as.Date( "2199-12-31" ), by = 1 ), n, replace = TRUE ), posix = as.POSIXct( seq.int( 1E6, by = 1000, length.out = n ), tz = "EST", origin = "1970-01-01 00:00:00" ), itime = setattr( sample( seq_len( 86399 ), n, replace = TRUE ), "class", "ITime" ), logical = sample( c( TRUE, FALSE ), n, replace = T ), factor = factor( sample( c( "small", "medium", "large", "extra-large" ), n, replace = TRUE ), levels = c( "small", "medium", "large", "extra-large" ) ), times = chron::times( seq.int( from = 0.001, to = 0.999, length.out = n ) ), stringsAsFactors = FALSE ) } set.seed( 123 ) df <- testDF( n = 100 ) glimpse( df, width = 80 )
An important feature of the rcsv
format is the ability to include notes on the data within the file itself. We'll do that here for demonstration.
notes_replace( df, "Here's an example note." )
notes_add( df, "And another one." )
A simple call to write_rcsv
will write out the object to csv as with any csv writing function. The notes are printed to the console on file write.
testfile <- "READMEfiles/test.rcsv" write_rcsv( df, testfile )
Notice that at the top of this file though, is a (JSON-like) header:
readLines( testfile, n = 15 )
The first line of the file contains some basic details about the file structure, below which is a series of lines specifying details of each column. This "header" section will be used by read_rcsv
to import the data frame in a manner matching the file written out, without the need for special parameters on the read call, or additional manipulations after import.
Before importing the file, we can take a look at it in a few ways. glimpse_rcsv
shows important details about the file, including notes and a dplyr::glimpse
-alike view of the dataset, without loading the file into memory. (We go a little more into the glimpse
function later.)
glimpse_rcsv( testfile )
notes_rcsv
shows just the notes.
notes_rcsv( testfile )
df.readrcsv <- read_rcsv( testfile )
dplyr::glimpse( df.readrcsv, width = 80 )
All columns have been read in with identical column classes as the original test data frame.
Note that since the header section is designated by a # at the beginning of each line. This means that any csv reader can still read the file, with "#" passed as a comment character.
df.base <- read.csv( testfile, comment.char = "#" ) glimpse( df.base, width = 80 )
Using a regular reader however, will not necessarily import the columns correctly (note several incorrect import formats above). Even with the very good readr
package, which does a great job, we won't always correctly attribute the original column classes, because there is ambiguity on the user's original intent.
df.readr <- readr::read_csv( testfile, comment = "#" ) glimpse( df.readr, width = 80 )
Also try with csvy
, a great package by Thomas J Leeper (https://github.com/leeper/csvy), looking to solve issues with csv use in a similar way to here.
testfile <- "READMEfiles/test.rcsv" csvy::write_csvy( df, "READMEfiles/test.csvy" ) df.readcsvy <- csvy::read_csvy( "READMEfiles/test.csvy" ) dplyr::glimpse( df.readcsvy, width = 80 )
Note some issues are solved, but there is still significant work to do before the data is back to its original format. In particular, datetime and factor formats are returned as character columns, albeit with some helpful information stored as attributes.
sapply( df.readcsvy[,c(3,4,5,7,8)], attributes )
Some of these issues will be easily fixed, like converting factors to strings, or vice-versa. Some are less simple though, for example timezones are generally not printed to csv files, meaning they are often read into memory with different timezone attributes from the original data frame. This can be difficult to notice if special care is not taken, and can have serious consequences.
worth noting that data.table::fwrite
does have a facility for this
cat( "Testing for timezone changes:\n", "from original data frame:\t", as.character( df$posix[1] ), attr( df$posix[1], "tzone" ), "\n", "using read_rcsv:\t\t\t", as.character( df.readrcsv$posix[1] ), attr( df.readrcsv$posix[1], "tzone" ), "\n", "using readr::read_csv:\t\t", as.character( df.readr$posix[1] ), attr( df.readr$posix[1], "tzone" ), "\n", "using base::read.csv:\t\t", as.character( df.base$posix[1] ), attr( df.base$posix[1], "tzone" ), "\n" )
factor
columns also raise an issue here, where they may be correctly read in as factor, but associated incorrect levels, which may have consequences on further data analysis:
cat( "Testing for factor level changes:\n", "from original data frame:\t", levels( df$factor ), "\n", "using read_rcsv:\t\t\t", levels( df.readrcsv$factor ), "\n", "using readr::read_csv:\t\t", levels( df.readr$factor ), "\n", "using csvy::read_csvy:\t\t", levels( df.readcsvy$factor ), "\n", "using base::read.csv:\t\t", levels( df.base$factor ) )
readr
defaults all columns without other conversion triggers to character
class (hence there are no levels to display here).
csvy
correctly stores and retrieves the factor levels, but leaves the column as character
class.
read.csv
defaults these columns to factor
class, but automatically sorts levels in alphabetical order, which may not be appropriate, as is the case here.
rcsv
determines whether or not to convert a column to factor based on the data frame from which it was written, and stores then extracts the correct factor levels, including their order.
The write_rcsv
function includes several options for "compressing" the csv output file by changing the way some columns are represented. We can control each conversion type individually with their respective parameters, or we can turn all of these options on with a single parameter, strings.convert
.
rcsv::write_rcsv( df, testfile, strings.convert = TRUE ) readLines( testfile, n = 15 )
The data itself here looks significantly different from the regular csv. character strings, factors, dates, times, and logical objects are all being represented in integer or numeric form, to be converted back to their respective classes on import using the information stored in the header, without any extra parameters needing to be passed to the import call:
df.readrcsv.convert <- read_rcsv( testfile ) glimpse( df.readrcsv.convert )
By converting columns in this way, we may save disk space*, as well as speed up reading and writing, while retaining the original data, and maintaining meaningful diffs. It is certainly less of a "general" data storage method (although the details needed to convert back to the original data types is mostly human-readable in the header, but I digress...) than without these conversions, but it may be useful in some situations.
To see the file size and speed advantages, we'll create a more meaningful (bigger) dataset, and write it to files with several different methods.
Date
, ITime
, and IDate
formats are all more efficiently stored as integers than character strings, and will generally convert to their proper classes faster from integers than from character strings. POSIXct
is likewise more efficiently converted from numeric, and this also makes storage of greater precision than 1s simpler and more efficient. Similarly, times
format values stored as numeric enables greater than 1s precision values to be stored and retrieved, however in some cases may increase storage size (where the decimal values convert to longer strings than the formatted time value).We test the speed of file writing between a few of the methods available to an R user.
n <- 1E5 df <- testDF( n ) write.times <- microbenchmark::microbenchmark( base_write = write.csv( df, "READMEfiles/base_write.csv" ), csvy_write = csvy::write_csvy( df, "READMEfiles/csvy_write.csvy" ), readr_write = readr::write_csv( df, "READMEfiles/readr_write.csv" ), dt_fwrite = data.table::fwrite( df, "READMEfiles/dt_fwrite.csv" ), rcsv_noconvert = rcsv::write_rcsv( df, "READMEfiles/rcsv_noconvert.rcsv" ), rcsv_convert = rcsv::write_rcsv( df, "READMEfiles/rcsv_convert.rcsv", strings.convert = TRUE ), rds_default = saveRDS( df, "READMEfiles/rds_default.rds" ), fst = fst::write.fst( df, "READMEfiles/fst.fst" ), feather = feather::write_feather( df, "READMEfiles/feather.feather" ), times = 12 ) write.times
write_rcsv
is built around the fantastic data.table::fwrite
function, making it much faster than both base::write.csv
and readr::write_csv
. It even maintains relatively good performance compared with data.table::fwrite
, but is slowed a little by the conversion processes and header writing steps.
ggplot( write.times, mapping = aes( expr, time/1E9, colour = expr ) ) + geom_violin( show.legend = FALSE ) + geom_jitter( size = 1, alpha = 0.4, width = 0.5, show.legend = FALSE ) + ylab( "write time (s)" ) + xlab( "" ) + ylim( c( 0, NA ) ) + labs( title = "File write times.", subtitle = sprintf( "%s row by %s column data frame.", n, ncol( df ) ) )
The speed differences between dt.fwrite
, rcsv.noconvert
, and rcsv.convert
seen here are the result of additional processing in the write_rcsv
function (slowing it down), and the advantage of writing fewer characters to file (speeding it up where strings.convert
== TRUE). Without string conversions, rcsv
generally writes with comparable speed to saveRDS
, while turning the string conversions on can make it significantly faster.
files <- list.files( path = "READMEfiles/", full.names = TRUE ) files <- files[ !grepl( "test.rcsv", files ) ] sizes <- file.info( files )$size ggplot( data = data.frame( files = gsub( ".*\\/|\\..*", "", files ), size = sizes ), mapping = aes( files, sizes/1E6, fill = files ) ) + geom_col( show.legend = FALSE ) + labs( title = "File sizes on disk." ) + xlab( "" ) + ylab( "File size (MB)" ) + theme(axis.text.x = element_text(size = 10, angle = 20, hjust = 1)) + geom_text( aes( label = paste( signif( sizes/1E6, 2 ), "MB" ), y = 0.7 * sizes/1E6 ), position = position_dodge(0.9), vjust = 0 )
Note most csv writers create files of similar sizes, but the rcsv
format with string conversions can (in some circumstances) result in significant file size reductions. We do not have the same file size benefit of the compressed (as per default settings) rds
format, so if file size is of paramount importance, it may be worth sticking with the rds format, or including an extra file compression step to your workflow; that is not the primary concern here though.
read.times <- microbenchmark::microbenchmark( base_read = { df.base <- read.csv( "READMEfiles/base_write.csv" ) }, csvy_read = { df.csvy <- csvy::read_csvy( "READMEfiles/csvy_write.csvy" ) }, readr_read = { df.readr <- readr::read_csv( "READMEfiles/readr_write.csv" ) }, dt_fread = { df.fread <- data.table::fread( "READMEfiles/dt_fwrite.csv" ) }, rcsv_noconvert = { df.noconvert <- rcsv::read_rcsv( "READMEfiles/rcsv_noconvert.rcsv" ) }, rcsv_convert = { df.convert <- rcsv::read_rcsv( "READMEfiles/rcsv_convert.rcsv" ) }, rds_default = { df.rds <- readRDS( "READMEfiles/rds_default.rds" ) }, fst = { df.fst <- fst::read.fst( "READMEfiles/fst.fst" ) }, feather = { df.feather <- feather::read_feather( "READMEfiles/feather.feather" ) }, times = 12 ) read.times
ggplot( data = read.times, mapping = aes( expr, time/1E9, colour = expr ) ) + geom_violin( show.legend = FALSE ) + geom_jitter( size = 1, alpha = 0.4, width = 0.5, show.legend = FALSE ) + ylab( "read time (s)" ) + xlab( "" ) + ylim( c( 0, NA ) ) + labs( title = "File read times.", subtitle = sprintf( "%s row by %s column data frame.", n, ncol( df ) ) )
We can see here that when reading in data, the speed of rcsv
is excellent if conversions are used, and very competitive even without conversions. Note the time spent here is on applying conversions to the data on import. These conversions would usually need to be performed manually after import, meaning that a more appropriate comparison between both forms of read_rcsv
and other read methods may be to include those conversion steps
read.times.with.manipulations <- microbenchmark::microbenchmark( readr_read = { df.readr.manip <- readr::read_csv( "READMEfiles/readr_write.csv", col_types = cols( col_integer(), col_character(), col_date(), col_datetime(), col_time(), col_logical(), col_factor( levels = c( "small", "medium", "large", "extra-large" ) ), col_time() ) ) %>% mutate( posix = lubridate::with_tz( posix, tzone = "EST" ), itime = as.ITime( itime ), times = chron::times( times ) ) }, dt_fread = { df.fread.manip <- data.table::fread( "READMEfiles/dt_fwrite.csv" ) df.fread.manip[ , dates := as.Date( dates ) ][ , posix := fasttime::fastPOSIXct( posix, tz = "EST" ) ][ , itime := as.ITime( itime ) ][ , factor := factor( factor, levels = c( "small", "medium", "large", "extra-large" ) ) ][ , times := chron::times( times ) ] }, rcsv_noconvert = { df.noconvert <- rcsv::read_rcsv( "READMEfiles/rcsv_noconvert.rcsv" ) }, rcsv_convert = { df.convert <- rcsv::read_rcsv( "READMEfiles/rcsv_convert.rcsv" ) }, csvy = { df.csvy <- csvy::read_csvy( "READMEfiles/csvy_write.csvy" ) df.csvy$dates <- as.Date( df.csvy$dates ) df.csvy$posix <- as.POSIXct( df.csvy$posix, tz = attr( df.csvy$posix, "tzone" ) ) df.csvy$itime <- as.ITime( df.csvy$itime ) df.csvy$factor <- factor( df.csvy$factor, levels = attr( df.csvy$factor, "levels" ) ) df.csvy$times <- chron::times( df.csvy$times ) }, times = 12 ) read.times.with.manipulations
ggplot( setDT( read.times.with.manipulations )[ grepl( "^base|^readr|^rcsv|^dt|^csvy", expr ) ], mapping = aes( expr, time/1E9, colour = expr ) ) + geom_violin( show.legend = FALSE ) + geom_jitter( size = 1, alpha = 0.4, width = 0.5, show.legend = FALSE ) + ylab( "read time (s)" ) + xlab( "" ) + ylim( c( 0, NA ) ) + labs( title = "File read times, with column manipulations where necessary.", subtitle = sprintf( "%s row by %s column data frame.", n, ncol( df ) ) )
So, assuming you'll be performing the necessary data conversions on import, fread
is actually significantly slower than rcsv
. In the case of rcsv_convert
, this is due to the fact that converting from numeric variables is often faster than from character. In the case of rcsv_noconvert
, it is largely due to the use of several speed optimisations, including the use of c++ as well as the fasttime
package in converting datetime variables to improve speeds over as.Date
, as.POSIXct
, chron::times
, data.table::as.ITime
, data.table::as.IDate
functions.
To test for correct data read, we look at each column of each imported data frame, and compare it to the matching column in the original data frame. In this way, we see if the data has remained consistent through the write/read process.
tests <- lapply( X = list( df.convert = df.convert, df.noconvert = df.noconvert, df.fread.manip = df.fread.manip, df.readr.manip = df.readr.manip, df.csvy = df.csvy ), FUN = function(x) { sapply( seq_along( df ), function(i) all.equal( df[[i]], x[[i]] ) ) } ) %>% lapply( function(x) paste( x, collapse = ", " ) ) print( tests )
It's clear here that imports using regular csv readers will need some work to bring the data back to the way it should be. This manipulation needs to be repeated every time a dataset is read into memory, which can be quite prohibitive to a smooth workflow. With rcsv
, the data is read back into R in the same form as before it was written out*, with no additional time or effort on the part of the user.
times
values have been stored without numeric conversion. The slight discrepancy in the values retrieved is due to the conversion of greater than 1s precision time values to formatted strings ("h:m:s"), which does not allow for the storage of such high precision. Turning conversion on in the write_rcsv
call solves this by storing as numeric, and therefore retaining the precision of the original data. fwrite
defaults to writing times
values as numeric, but we overwrite this where necessary to create a more portable csv file (times written out as "h:m:s" character strings).Sometimes it's important to include a few brief notes with your dataset. Maybe the specs for a piece of equipment used to collect the data, or maybe something that a user of the dataset should be aware of when working with it. The rcsv
format has this facility built in. You can add a note to your data frame, which is stored as an attribute named notes
.
df <- testDF( 100 ) notes_add( df, "All the data here was collected with a model X super-duper measuring device" )
We can replace any existing notes with notes_replace
, or we can add another note with notes_add
notes_add( df, "The equipment was calibrated using ISO 9000" )
Each time a note is added, the full set of notes are printed to the console. This console print will also happen whenever the data is written to file...
write_rcsv( df, testfile )
And whenever it is read from file.
df <- read_rcsv( testfile )
These notes are stored in the file header in a human-readable way, meaning that they can be accessed even without the rcsv
package.
readLines( testfile, 3 )
The rcsv
package also includes several helper functions to work with rcsv files, without loading the entire dataset into memory. These functions are therefore very fast ways to access details on even very large data files.
glimpse_rcsv
displays a dplyr::glimpse
-like view of the data in the console (and silently returns a list of these details), without loading the entire dataset into memory. This is a very efficient way of viewing key details about a dataset without importing it.
details <- glimpse_rcsv( testfile )
Useful details are returned as a list, meaning they can be referenced using $
or [[
print( details )
details$dim
notes_rcsv
returns only the notes associated with a data file, without loading the dataset.
notes_rcsv( testfile )
This will also accept a data frame with the "notes" attribute. ie: after being read into R
notes_rcsv( df )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.