vec_pkgs <- c("readr", "iotools", "LaF", "microbenchmark")
if (!all(sapply(vec_pkgs, 
       function(x) x  %in% installed.packages(), USE.NAMES = FALSE))){
  for (pgk in vec_pkgs){
    if (!pgk %in% installed.packages())
      install.packages(pkgs = pgk, repos = "https://cran.rstudio.com/")
  }  
}

Disclaimer

This vignette compares different methods of reading large amounts of fixed width formatted (fwf) data from a file.

Introduction

We are given a large file (about 13 million lines and 1.3 Gb of data) with fixed formatted data. This data needs to be read into R to do some consistency checks on the data and to write the data to a file in a different format. In a first implementation, we used the function readLines() to read each line, parse the content and store the data in a nested list.

Scaling

This approach worked well for small datasets, but it does not scale to bigger sizes of the input data. Even after several days of wall clock time, the data is still no completely processed. Hence we need a faster method to read the large input file and to process the data.

Remainder

In this document, we are comparing different methods for reading large amounts of data. The focus will be on R-packages iotools and LaF. Because the methods of how the data will be processed is not completely independent of how the data is read from the file, we are going to discuss different approaches on how the data can be processed towards the end of this document.

Methods

We are starting by comparing different methods of how we can read large amounts of data into R.

readlines

The original idea that has shown to be very slow was to read the input line-by-line using readline() and building the required data-structure for consistency checking as nested list.

read.fwf

The standard way of reading fwf-data into an R dataframe is to use the function read.fwf from the base package.

read_fwf

The package readr is part of the tidyverse and it offers the function read_fwf to read fwf-data. According to https://github.com/tidyverse/readr the function data.table::fread() is even faster than readr::read.fwf()

iotools

The package iotools is available from http://www.rforge.net/iotools/ and https://cran.r-project.org/web/packages/iotools/index.html and described in https://arxiv.org/abs/1510.00041.

LaF

The package LaF is available from https://cran.r-project.org/web/packages/LaF/index.html and https://github.com/djvanderlaan/LaF. This package allows to read data from files that are too big to fit into memory which is a special property that is not available in other tools.

Results

We are testing the described tools using a subset of the complete input data. This should make the tests to be runnable in an interactive mode.

sDataFileName <- system.file(file.path("extdata","KLDAT_20170524_10000.txt"), 
                             package = "PedigreeFromTvdData")
vecFormat <- c(22,14,3,31,8,14,3,1,14,3)

The size of this file in MB is

round(file.info(sDataFileName)$size / 1024 / 1024, digits = 2)

This file is read using each of the above mentioned methods and the time it takes to read them is compared. The basic method to which everything is compared is based on reading the input file line by line using readLines() and building up a nested list with the results.

(mb.base <- microbenchmark::microbenchmark(
  data.base = PedigreeFromTvdData::read_tvd_input(psInputFile = sDataFileName),
  times = 5, unit = "s"))
(mb.fwf <- microbenchmark::microbenchmark(
  data.fwf = read.fwf(file = sDataFileName, widths = vecFormat), 
  times = 5, unit = "s"))
(mb_fwf <- microbenchmark::microbenchmark(
  data_fwf = suppressMessages(readr::read_fwf(file = sDataFileName, 
                              col_positions = readr::fwf_widths(widths = vecFormat))),
  times = 5, unit = "s"))
(mb.iot <- microbenchmark::microbenchmark(
  iotools  = iotools::dstrfw(iotools::readAsRaw(sDataFileName), 
                            col_types = rep("character", length(vecFormat)),
                            widths = vecFormat),
  times = 5, unit = "s"))
(mb.laf <- microbenchmark::microbenchmark(
  laf = LaF::laf_open_fwf(filename = sDataFileName, 
                          column_types = rep("character", length(vecFormat)),
                          column_widths = vecFormat),
  times = 5, unit = "s"))

Overview of results

The following table shows the median times (column MedianTime) in seconds for all methods that were compared. The column entitled Factor expresses every median time in terms of the fastest method.

vecMethods <- c("PedigreeFromTvdData::read_tvd_input",
                "base::read.fwf",
                 "readr::read_fwf",
                 "iotools::dstrfw",
                 "LaF::laf_open_fwf")
vecMedTimes <- c(mb.base$time[3],
                 mb.fwf$time[3],
                  mb_fwf$time[3],
                  mb.iot$time[3],
                  mb.laf$time[3])
nMinTime <- vecMedTimes[order(vecMedTimes)][1]
dfMedTime <- data.frame(Methode = vecMethods,
                        MedianTime = round(vecMedTimes*10^(-9), digits = 4),
                        Factor = round(vecMedTimes/nMinTime, digits = 0))
knitr::kable(dfMedTime)

The results above show several things that can be learned. Our experiments showed that r vecMethods[order(vecMedTimes)][1] was the fastest and r vecMethods[order(vecMedTimes)][length(vecMethods)] was the slowest method.

Discussion

The data used so far was a small test example. The real data set will be three orders of magnitude larger. Hence the time requirements will also be larger by a factor of about $1000$.

Resources

Session Info

sessionInfo()

Latest Update

r paste(Sys.time(),paste0("(", Sys.info()[["user"]],")" ))



pvrqualitasag/PedigreeFromTvdData documentation built on Feb. 16, 2018, 2:20 a.m.