vec_pkgs <- c("readr", "iotools", "LaF", "microbenchmark") if (!all(sapply(vec_pkgs, function(x) x %in% installed.packages(), USE.NAMES = FALSE))){ for (pgk in vec_pkgs){ if (!pgk %in% installed.packages()) install.packages(pkgs = pgk, repos = "https://cran.rstudio.com/") } }
This vignette compares different methods of reading large amounts of fixed width formatted (fwf) data from a file.
We are given a large file (about 13 million lines and 1.3 Gb of data) with fixed formatted data. This data needs to be read into R to do some consistency checks on the data and to write the data to a file in a different format. In a first implementation, we used the function readLines()
to read each line, parse the content and store the data in a nested list.
This approach worked well for small datasets, but it does not scale to bigger sizes of the input data. Even after several days of wall clock time, the data is still no completely processed. Hence we need a faster method to read the large input file and to process the data.
In this document, we are comparing different methods for reading large amounts of data. The focus will be on R-packages iotools
and LaF
. Because the methods of how the data will be processed is not completely independent of how the data is read from the file, we are going to discuss different approaches on how the data can be processed towards the end of this document.
We are starting by comparing different methods of how we can read large amounts of data into R.
The original idea that has shown to be very slow was to read the input line-by-line using readline()
and building the required data-structure for consistency checking as nested list.
The standard way of reading fwf-data into an R dataframe is to use the function read.fwf from the base package.
The package readr
is part of the tidyverse
and it offers the function read_fwf
to read fwf-data. According to https://github.com/tidyverse/readr the function data.table::fread()
is even faster than readr::read.fwf()
The package iotools
is available from http://www.rforge.net/iotools/ and https://cran.r-project.org/web/packages/iotools/index.html and described in https://arxiv.org/abs/1510.00041.
The package LaF
is available from https://cran.r-project.org/web/packages/LaF/index.html and https://github.com/djvanderlaan/LaF. This package allows to read data from files that are too big to fit into memory which is a special property that is not available in other tools.
We are testing the described tools using a subset of the complete input data. This should make the tests to be runnable in an interactive mode.
sDataFileName <- system.file(file.path("extdata","KLDAT_20170524_10000.txt"), package = "PedigreeFromTvdData") vecFormat <- c(22,14,3,31,8,14,3,1,14,3)
The size of this file in MB is
round(file.info(sDataFileName)$size / 1024 / 1024, digits = 2)
This file is read using each of the above mentioned methods and the time it takes to read them is compared. The basic method to which everything is compared is based on reading the input file line by line using readLines()
and building up a nested list with the results.
(mb.base <- microbenchmark::microbenchmark( data.base = PedigreeFromTvdData::read_tvd_input(psInputFile = sDataFileName), times = 5, unit = "s")) (mb.fwf <- microbenchmark::microbenchmark( data.fwf = read.fwf(file = sDataFileName, widths = vecFormat), times = 5, unit = "s")) (mb_fwf <- microbenchmark::microbenchmark( data_fwf = suppressMessages(readr::read_fwf(file = sDataFileName, col_positions = readr::fwf_widths(widths = vecFormat))), times = 5, unit = "s")) (mb.iot <- microbenchmark::microbenchmark( iotools = iotools::dstrfw(iotools::readAsRaw(sDataFileName), col_types = rep("character", length(vecFormat)), widths = vecFormat), times = 5, unit = "s")) (mb.laf <- microbenchmark::microbenchmark( laf = LaF::laf_open_fwf(filename = sDataFileName, column_types = rep("character", length(vecFormat)), column_widths = vecFormat), times = 5, unit = "s"))
The following table shows the median times (column MedianTime) in seconds for all methods that were compared. The column entitled Factor
expresses every median time in terms of the fastest method.
vecMethods <- c("PedigreeFromTvdData::read_tvd_input", "base::read.fwf", "readr::read_fwf", "iotools::dstrfw", "LaF::laf_open_fwf") vecMedTimes <- c(mb.base$time[3], mb.fwf$time[3], mb_fwf$time[3], mb.iot$time[3], mb.laf$time[3]) nMinTime <- vecMedTimes[order(vecMedTimes)][1] dfMedTime <- data.frame(Methode = vecMethods, MedianTime = round(vecMedTimes*10^(-9), digits = 4), Factor = round(vecMedTimes/nMinTime, digits = 0)) knitr::kable(dfMedTime)
The results above show several things that can be learned. Our experiments showed that r vecMethods[order(vecMedTimes)][1]
was the fastest and r vecMethods[order(vecMedTimes)][length(vecMethods)]
was the slowest method.
The data used so far was a small test example. The real data set will be three orders of magnitude larger. Hence the time requirements will also be larger by a factor of about $1000$.
sessionInfo()
r paste(Sys.time(),paste0("(", Sys.info()[["user"]],")" ))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.