knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "READMEfigs/"
)

The rcsv format.

The basic idea behind this format is to add a header section to the regular csv format, containing information about the data frame being stored, in such a way as to allow easy reading and writing of data to and from file, without losing the structure of the original data. Specifically designed for storing data frame (or data table) objects, details stored in the header include the dimensions of the data frame, and the class of each column. This means that the data frame can be read back into memory when required, without the need to specify column classes on input, nor perform column manipulations after import. The data will be brought back into memory in (almost) exactly the form it was in when it was saved.

library( rcsv )
library( data.table )
library( dplyr )
library( readr )
library( ggplot2 )

Why not rds, feather, fst...?

Personally, I like to work with binary files, and not having to worry about things like classes, timezones, etc. However, these binary formats are often not suitable in certain workflows, for example where collaborators may not be working in R*. It can also be beneficial in certain workflows to have meaningful diffs when a data file is changed, in order to work with version control systems like git.

So to make the data as generally accessible as possible, and allowing for meaningful diffs, while maintaining the ability to write out and read in data without losing important attributes, is the aim here.

Testing

Set up some test data

We'll look at a test data frame, using several different column types. The rcsv package is specifically designed to work with the following column types, while others may or may not work well (please do your own testing): character, numeric, integer, logical, factor, Date, POSIXct, ITime, IDate, times

testDF <- function( n ) {
    data.frame(
        integers = 1:n,
        letters = sample( letters, n, replace = TRUE ),
        dates = sample(
            seq.Date( as.Date( "1970-01-01" ), as.Date( "2199-12-31" ), by = 1 ),
            n, replace = TRUE ),
        posix = as.POSIXct( seq.int( 1E6, by = 1000, length.out = n ), tz = "EST", origin = "1970-01-01 00:00:00" ),
        itime = setattr( sample( seq_len( 86399 ), n, replace = TRUE ), "class", "ITime" ),
        logical = sample( c( TRUE, FALSE ), n, replace = T ),
        factor = factor( sample( c( "small", "medium", "large", "extra-large" ), n, replace = TRUE ),
                         levels = c( "small", "medium", "large", "extra-large" )  ),
        times = chron::times( seq.int( from = 0.001, to = 0.999, length.out = n ) ),
        stringsAsFactors = FALSE
    )
}

set.seed( 123 )
df <- testDF( n = 100 )
glimpse( df, width = 80 )

Add notes

An important feature of the rcsv format is the ability to include notes on the data within the file itself. We'll do that here for demonstration.

notes_replace( df, "Here's an example note." )
notes_add( df, "And another one." )

write_rcsv with default parameters

A simple call to write_rcsv will write out the object to csv as with any csv writing function. The notes are printed to the console on file write.

testfile <- "READMEfiles/test.rcsv"
write_rcsv( df, testfile )

Notice that at the top of this file though, is a (JSON-like) header:

readLines( testfile, n = 15 )

The first line of the file contains some basic details about the file structure, below which is a series of lines specifying details of each column. This "header" section will be used by read_rcsv to import the data frame in a manner matching the file written out, without the need for special parameters on the read call, or additional manipulations after import.

Before importing the file, we can take a look at it in a few ways. glimpse_rcsv shows important details about the file, including notes and a dplyr::glimpse-alike view of the dataset, without loading the file into memory. (We go a little more into the glimpse function later.)

glimpse_rcsv( testfile )

notes_rcsv shows just the notes.

notes_rcsv( testfile )
df.readrcsv <- read_rcsv( testfile )
dplyr::glimpse( df.readrcsv, width = 80 )

All columns have been read in with identical column classes as the original test data frame.

Note that since the header section is designated by a # at the beginning of each line. This means that any csv reader can still read the file, with "#" passed as a comment character.

df.base <- read.csv( testfile, comment.char = "#" )
glimpse( df.base, width = 80 )

Using a regular reader however, will not necessarily import the columns correctly (note several incorrect import formats above). Even with the very good readr package, which does a great job, we won't always correctly attribute the original column classes, because there is ambiguity on the user's original intent.

df.readr <- readr::read_csv( testfile, comment = "#" )
glimpse( df.readr, width = 80 )

Also try with csvy, a great package by Thomas J Leeper (https://github.com/leeper/csvy), looking to solve issues with csv use in a similar way to here.

testfile <- "READMEfiles/test.rcsv"
csvy::write_csvy( df, "READMEfiles/test.csvy" )
df.readcsvy <- csvy::read_csvy( "READMEfiles/test.csvy" )
dplyr::glimpse( df.readcsvy, width = 80 )

Note some issues are solved, but there is still significant work to do before the data is back to its original format. In particular, datetime and factor formats are returned as character columns, albeit with some helpful information stored as attributes.

sapply( df.readcsvy[,c(3,4,5,7,8)], attributes )

Some of these issues will be easily fixed, like converting factors to strings, or vice-versa. Some are less simple though, for example timezones are generally not printed to csv files, meaning they are often read into memory with different timezone attributes from the original data frame. This can be difficult to notice if special care is not taken, and can have serious consequences. worth noting that data.table::fwrite does have a facility for this

cat( "Testing for timezone changes:\n",
     "from original data frame:\t", as.character( df$posix[1] ), attr( df$posix[1], "tzone" ), "\n",
     "using read_rcsv:\t\t\t", as.character( df.readrcsv$posix[1] ), attr( df.readrcsv$posix[1], "tzone" ), "\n",
     "using readr::read_csv:\t\t", as.character( df.readr$posix[1] ), attr( df.readr$posix[1], "tzone" ), "\n",
     "using base::read.csv:\t\t", as.character( df.base$posix[1] ), attr( df.base$posix[1], "tzone" ), "\n" )

factor columns also raise an issue here, where they may be correctly read in as factor, but associated incorrect levels, which may have consequences on further data analysis:

cat( "Testing for factor level changes:\n",
     "from original data frame:\t", levels( df$factor ), "\n",
     "using read_rcsv:\t\t\t", levels( df.readrcsv$factor ), "\n",
     "using readr::read_csv:\t\t", levels( df.readr$factor ), "\n",
     "using csvy::read_csvy:\t\t", levels( df.readcsvy$factor ), "\n",
     "using base::read.csv:\t\t", levels( df.base$factor ) )

readr defaults all columns without other conversion triggers to character class (hence there are no levels to display here). csvy correctly stores and retrieves the factor levels, but leaves the column as character class. read.csv defaults these columns to factor class, but automatically sorts levels in alphabetical order, which may not be appropriate, as is the case here. rcsv determines whether or not to convert a column to factor based on the data frame from which it was written, and stores then extracts the correct factor levels, including their order.

using write_rcsv with data conversion options

The write_rcsv function includes several options for "compressing" the csv output file by changing the way some columns are represented. We can control each conversion type individually with their respective parameters, or we can turn all of these options on with a single parameter, strings.convert.

rcsv::write_rcsv( df, testfile, strings.convert = TRUE )
readLines( testfile, n = 15 )

The data itself here looks significantly different from the regular csv. character strings, factors, dates, times, and logical objects are all being represented in integer or numeric form, to be converted back to their respective classes on import using the information stored in the header, without any extra parameters needing to be passed to the import call:

df.readrcsv.convert <- read_rcsv( testfile )
glimpse( df.readrcsv.convert )

File writing.

Why use conversions?

By converting columns in this way, we may save disk space*, as well as speed up reading and writing, while retaining the original data, and maintaining meaningful diffs. It is certainly less of a "general" data storage method (although the details needed to convert back to the original data types is mostly human-readable in the header, but I digress...) than without these conversions, but it may be useful in some situations.

To see the file size and speed advantages, we'll create a more meaningful (bigger) dataset, and write it to files with several different methods.

File write speeds

We test the speed of file writing between a few of the methods available to an R user.

n <- 1E5
df <- testDF( n )
write.times <- microbenchmark::microbenchmark(
    base_write = write.csv( df, "READMEfiles/base_write.csv" ),
    csvy_write = csvy::write_csvy( df, "READMEfiles/csvy_write.csvy" ),
    readr_write = readr::write_csv( df, "READMEfiles/readr_write.csv" ),
    dt_fwrite = data.table::fwrite( df, "READMEfiles/dt_fwrite.csv" ),
    rcsv_noconvert = rcsv::write_rcsv( df, "READMEfiles/rcsv_noconvert.rcsv" ),
    rcsv_convert = rcsv::write_rcsv( df, "READMEfiles/rcsv_convert.rcsv", strings.convert = TRUE ),
    rds_default = saveRDS( df, "READMEfiles/rds_default.rds" ),
    fst = fst::write.fst( df, "READMEfiles/fst.fst" ),
    feather = feather::write_feather( df, "READMEfiles/feather.feather" ),
    times = 12
)
write.times

write_rcsv is built around the fantastic data.table::fwrite function, making it much faster than both base::write.csv and readr::write_csv. It even maintains relatively good performance compared with data.table::fwrite, but is slowed a little by the conversion processes and header writing steps.

ggplot( write.times,
        mapping = aes( expr, time/1E9, colour = expr ) ) +
    geom_violin( show.legend = FALSE ) +
    geom_jitter( size = 1, alpha = 0.4, width = 0.5, show.legend = FALSE ) +
    ylab( "write time (s)" ) + xlab( "" ) +
    ylim( c( 0, NA ) ) +
    labs( title = "File write times.",
          subtitle = sprintf( "%s row by %s column data frame.", n, ncol( df ) ) )

The speed differences between dt.fwrite, rcsv.noconvert, and rcsv.convert seen here are the result of additional processing in the write_rcsv function (slowing it down), and the advantage of writing fewer characters to file (speeding it up where strings.convert == TRUE). Without string conversions, rcsv generally writes with comparable speed to saveRDS, while turning the string conversions on can make it significantly faster.

File sizes

files <- list.files( path = "READMEfiles/", full.names = TRUE )
files <- files[ !grepl( "test.rcsv", files ) ]
sizes <- file.info( files )$size

ggplot( data = data.frame( files = gsub( ".*\\/|\\..*", "", files ), size = sizes ),
        mapping = aes( files, sizes/1E6, fill = files ) ) +
    geom_col( show.legend = FALSE ) +
    labs( title = "File sizes on disk." ) +
    xlab( "" ) + ylab( "File size (MB)" ) +
    theme(axis.text.x = element_text(size = 10, angle = 20, hjust = 1)) +
    geom_text( aes( label = paste( signif( sizes/1E6, 2 ), "MB" ), y = 0.7 * sizes/1E6 ),
               position = position_dodge(0.9),
               vjust = 0
    )

Note most csv writers create files of similar sizes, but the rcsv format with string conversions can (in some circumstances) result in significant file size reductions. We do not have the same file size benefit of the compressed (as per default settings) rds format, so if file size is of paramount importance, it may be worth sticking with the rds format, or including an extra file compression step to your workflow; that is not the primary concern here though.

File read times

read.times <- microbenchmark::microbenchmark(
    base_read = { df.base <- read.csv( "READMEfiles/base_write.csv" ) },
    csvy_read = { df.csvy <- csvy::read_csvy( "READMEfiles/csvy_write.csvy" ) },
    readr_read = { df.readr <- readr::read_csv( "READMEfiles/readr_write.csv" ) },
    dt_fread = { df.fread <- data.table::fread( "READMEfiles/dt_fwrite.csv" ) },
    rcsv_noconvert = { df.noconvert <- rcsv::read_rcsv( "READMEfiles/rcsv_noconvert.rcsv" ) },
    rcsv_convert = { df.convert <- rcsv::read_rcsv( "READMEfiles/rcsv_convert.rcsv" ) },
    rds_default = { df.rds <- readRDS( "READMEfiles/rds_default.rds" ) },
    fst = { df.fst <- fst::read.fst( "READMEfiles/fst.fst" ) },
    feather = { df.feather <- feather::read_feather( "READMEfiles/feather.feather" ) },
    times = 12
)
read.times
ggplot( data = read.times,
        mapping = aes( expr, time/1E9, colour = expr ) ) +
    geom_violin( show.legend = FALSE ) +
    geom_jitter( size = 1, alpha = 0.4, width = 0.5, show.legend = FALSE ) +
    ylab( "read time (s)" ) + xlab( "" ) +
    ylim( c( 0, NA ) ) +
    labs( title = "File read times.",
          subtitle = sprintf( "%s row by %s column data frame.", n, ncol( df ) ) )

We can see here that when reading in data, the speed of rcsv is excellent if conversions are used, and very competitive even without conversions. Note the time spent here is on applying conversions to the data on import. These conversions would usually need to be performed manually after import, meaning that a more appropriate comparison between both forms of read_rcsv and other read methods may be to include those conversion steps

read.times.with.manipulations <- microbenchmark::microbenchmark(
    readr_read = {
        df.readr.manip <- readr::read_csv( "READMEfiles/readr_write.csv",
                                           col_types = cols(
                                               col_integer(),
                                               col_character(),
                                               col_date(),
                                               col_datetime(),
                                               col_time(),
                                               col_logical(),
                                               col_factor(
                                                   levels = c( "small", "medium", "large", "extra-large" )
                                               ),
                                               col_time()
                                           ) ) %>%
            mutate( posix = lubridate::with_tz( posix, tzone = "EST" ),
                    itime = as.ITime( itime ),
                    times = chron::times( times ) )
        },
    dt_fread = { 
        df.fread.manip <- data.table::fread( "READMEfiles/dt_fwrite.csv" )
        df.fread.manip[ , dates := as.Date( dates )
                     ][ , posix := fasttime::fastPOSIXct( posix, tz = "EST" )
                     ][ , itime := as.ITime( itime )
                     ][ , factor := factor( factor, levels = c( "small", "medium", "large", "extra-large" ) )
                     ][ , times := chron::times( times ) ] },
    rcsv_noconvert = { df.noconvert <- rcsv::read_rcsv( "READMEfiles/rcsv_noconvert.rcsv" ) },
    rcsv_convert = { df.convert <- rcsv::read_rcsv( "READMEfiles/rcsv_convert.rcsv" ) },
    csvy = {
        df.csvy <- csvy::read_csvy( "READMEfiles/csvy_write.csvy" )
        df.csvy$dates <- as.Date( df.csvy$dates )
        df.csvy$posix <- as.POSIXct( df.csvy$posix, tz = attr( df.csvy$posix, "tzone" ) )
        df.csvy$itime <- as.ITime( df.csvy$itime )
        df.csvy$factor <- factor( df.csvy$factor, levels = attr( df.csvy$factor, "levels" ) )
        df.csvy$times <- chron::times( df.csvy$times )
    },
    times = 12
)
read.times.with.manipulations
ggplot( setDT( read.times.with.manipulations )[ grepl( "^base|^readr|^rcsv|^dt|^csvy", expr ) ],
        mapping = aes( expr, time/1E9, colour = expr ) ) +
    geom_violin( show.legend = FALSE ) +
    geom_jitter( size = 1, alpha = 0.4, width = 0.5, show.legend = FALSE ) +
    ylab( "read time (s)" ) + xlab( "" ) +
    ylim( c( 0, NA ) ) +
    labs( title = "File read times, with column manipulations where necessary.",
          subtitle = sprintf( "%s row by %s column data frame.", n, ncol( df ) ) )

So, assuming you'll be performing the necessary data conversions on import, fread is actually significantly slower than rcsv. In the case of rcsv_convert, this is due to the fact that converting from numeric variables is often faster than from character. In the case of rcsv_noconvert, it is largely due to the use of several speed optimisations, including the use of c++ as well as the fasttime package in converting datetime variables to improve speeds over as.Date, as.POSIXct, chron::times, data.table::as.ITime, data.table::as.IDate functions.

Data integrity

Testing that the data has been read in correctly.

To test for correct data read, we look at each column of each imported data frame, and compare it to the matching column in the original data frame. In this way, we see if the data has remained consistent through the write/read process.

tests <- lapply( X = list( df.convert = df.convert,
                           df.noconvert = df.noconvert,
                           df.fread.manip = df.fread.manip,
                           df.readr.manip = df.readr.manip,
                           df.csvy = df.csvy ),
                 FUN = function(x) {
                     sapply( seq_along( df ),
                             function(i) all.equal( df[[i]], x[[i]] )
                     )
                 }
) %>%
    lapply( function(x) paste( x, collapse = ", " ) )
print( tests )

It's clear here that imports using regular csv readers will need some work to bring the data back to the way it should be. This manipulation needs to be repeated every time a dataset is read into memory, which can be quite prohibitive to a smooth workflow. With rcsv, the data is read back into R in the same form as before it was written out*, with no additional time or effort on the part of the user.

Additional features

Notes, an added detail to your datasets

Sometimes it's important to include a few brief notes with your dataset. Maybe the specs for a piece of equipment used to collect the data, or maybe something that a user of the dataset should be aware of when working with it. The rcsv format has this facility built in. You can add a note to your data frame, which is stored as an attribute named notes.

df <- testDF( 100 )
notes_add( df, "All the data here was collected with a model X super-duper measuring device" )

We can replace any existing notes with notes_replace, or we can add another note with notes_add

notes_add( df, "The equipment was calibrated using ISO 9000" )

Each time a note is added, the full set of notes are printed to the console. This console print will also happen whenever the data is written to file...

write_rcsv( df, testfile )

And whenever it is read from file.

df <- read_rcsv( testfile )

These notes are stored in the file header in a human-readable way, meaning that they can be accessed even without the rcsv package.

readLines( testfile, 3 )

Helper functions

The rcsv package also includes several helper functions to work with rcsv files, without loading the entire dataset into memory. These functions are therefore very fast ways to access details on even very large data files.

glimpse_rcsv

glimpse_rcsv displays a dplyr::glimpse-like view of the data in the console (and silently returns a list of these details), without loading the entire dataset into memory. This is a very efficient way of viewing key details about a dataset without importing it.

details <- glimpse_rcsv( testfile )

Useful details are returned as a list, meaning they can be referenced using $ or [[

print( details )
details$dim

notes_rcsv

notes_rcsv returns only the notes associated with a data file, without loading the dataset.

notes_rcsv( testfile )

This will also accept a data frame with the "notes" attribute. ie: after being read into R

notes_rcsv( df )


rossholmberg/rcsv documentation built on May 27, 2019, 11:36 p.m.