fwrite_: Convenience wrapper around data.table::fwrite with preferred...

fwrite_R Documentation

Convenience wrapper around data.table::fwrite with preferred defaults

Description

As write.csv but much faster (e.g. 2 seconds versus 1 minute) and just as flexible. Modern machines almost surely have more than one CPU so fwrite uses them; on all operating systems including Linux, Mac and Windows.

Usage

fwrite_(x, file, sep = ",", na = "NA", col.names = TRUE, ...)

Arguments

x

Any list of same length vectors; e.g. data.frame and data.table. If matrix, it gets internally coerced to data.table preserving col names but not row names

file

Output file name. "" indicates output to the console.

sep

The separator between columns. Default is ",".

na

The string to use for missing values in the data. Default is a blank string "".

col.names

Should the column names (header row) be written? The default is TRUE for new files and when overwriting existing files (append=FALSE). Otherwise, the default is FALSE to prevent column names appearing again mid-file when stacking a set of data.tables or appending rows to the end of a file.

...

Extra arguments to data.table::fwrite

Details

fwrite began as a community contribution with pull request #1613 by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: https://h2o.ai/blog/fast-csv-writing-for-r/. Final items were tracked in issue #1664 such as automatic quoting, bit64::integer64 support, decimal/scientific formatting exactly matching write.csv between 2.225074e-308 and 1.797693e+308 to 15 significant figures, row.names, dates (between 0000-03-01 and 9999-12-31), times and sep2 for list columns where each cell can itself be a vector.

To save space, fwrite prefers to write wide numeric values in scientific notation – e.g. 10000000000 takes up much more space than 1e+10. Most file readers (e.g. fread) understand scientific notation, so there's no fidelity loss. Like in base R, users can control this by specifying the scipen argument, which follows the same rules as options('scipen'). fwrite will see how much space a value will take to write in scientific vs. decimal notation, and will only write in scientific notation if the latter is more than scipen characters wider. For 10000000000, then, 1e+10 will be written whenever scipen<6.

CSVY Support:

The following fields will be written to the header of the file and surrounded by --- on top and bottom:

  • source - Contains the R version and data.table version used to write the file

  • creation_time_utc - Current timestamp in UTC time just before the header is written

  • schema with element fields giving name-type (class) pairs for the table; multi-class objects (e.g. c('POSIXct', 'POSIXt')) will have their first class written.

  • header - same as col.names (which is header on input)

  • sep

  • sep2

  • eol

  • na.strings - same as na

  • dec

  • qmethod

  • logical01

References

https://howardhinnant.github.io/date_algorithms.html
https://en.wikipedia.org/wiki/Decimal_mark

See Also

data.table::fwrite()

Examples

DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF)
write.csv(DF, row.names=FALSE, quote=FALSE)  # same

fwrite(DF, row.names=TRUE, quote=TRUE)
write.csv(DF)                                # same

DF = data.frame(A=c(2.1,-1.234e-307,pi), B=c("foo","A,Name","bar"))
fwrite(DF, quote='auto')        # Just DF[2,2] is auto quoted
write.csv(DF, row.names=FALSE)  # same numeric formatting

DT = data.table(A=c(2,5.6,-3),B=list(1:3,c("foo","A,Name","bar"),round(pi*1:3,2)))
fwrite(DT)
fwrite(DT, sep="|", sep2=c("{",",","}"))

## Not run: 

set.seed(1)
DT = as.data.table( lapply(1:10, sample,
         x=as.numeric(1:5e7), size=5e6))                            #     382MB
system.time(fwrite(DT, "/dev/shm/tmp1.csv"))                        #      0.8s
system.time(write.csv(DT, "/dev/shm/tmp2.csv",                      #     60.6s
                      quote=FALSE, row.names=FALSE))
system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv")                  # identical

set.seed(1)
N = 1e7
DT = data.table(
  str1=sample(sprintf("%010d",sample(N,1e5,replace=TRUE)), N, replace=TRUE),
  str2=sample(sprintf("%09d",sample(N,1e5,replace=TRUE)), N, replace=TRUE),
  str3=sample(sapply(sample(2:30, 100, TRUE), function(n)
     paste0(sample(LETTERS, n, TRUE), collapse="")), N, TRUE),
  str4=sprintf("%05d",sample(sample(1e5,50),N,TRUE)),
  num1=sample(round(rnorm(1e6,mean=6.5,sd=15),2), N, replace=TRUE),
  num2=sample(round(rnorm(1e6,mean=6.5,sd=15),10), N, replace=TRUE),
  str5=sample(c("Y","N"),N,TRUE),
  str6=sample(c("M","F"),N,TRUE),
  int1=sample(ceiling(rexp(1e6)), N, replace=TRUE),
  int2=sample(N,N,replace=TRUE)-N/2
)                                                                   #     774MB
system.time(fwrite(DT,"/dev/shm/tmp1.csv"))                         #      1.1s
system.time(write.csv(DT,"/dev/shm/tmp2.csv",                       #     63.2s
                      row.names=FALSE, quote=FALSE))
system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv")                  # identical

unlink("/dev/shm/tmp1.csv")
unlink("/dev/shm/tmp2.csv")

## End(Not run)

shaliulab/idocr documentation built on June 1, 2025, 4:59 p.m.