sample_csv: Read Sample of CSV

Description Usage Arguments Details Value Examples

View source: R/sample_csv.r

Description

The function will read (as csv) approximately p*nlines lines. So if p=.1, then we will get roughly (probably not exactly) 10 data. This is the analogue of the base R function read.csv().

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sample_csv(
  file,
  param,
  method = "proportional",
  reader = utils::read.csv,
  header = TRUE,
  nskip = 0,
  nmax = 0,
  verbose = FALSE,
  ...
)

Arguments

file

Location of the file (as a string) to be subsampled.

param

The downsampling parameter. For the "proportional" method, this is the proportion to retain and should be a numeric value between 0 and 1. For the exact method, this is the total number of lines to read in.

method

A string indicating the type of read method to use. Options are "proportional" and "exact".

reader

A function specifying the reader to use. The default is utils::read.csv. Other options include data.table::fread() and readr::read_csv(). Note the first argument of the reader should be the file to read in and the second should be the the header/col_names argument. This would require writing a small wrapper for fread().

header

Is a header (line of column names) on the first line of the csv file?

nskip

Number of lines to skip. If header=TRUE, then this only applies to lines after the header.

nmax

Max number of lines to read. If nmax==0, then there is no read cap. Ignored if method="exact".

verbose

Should linecounts of the input file and the number of lines sampled be printed?

...

Additional arguments passed to the csv reader.

Details

This function scans over the test of the input file and at each step, randomly chooses whether or not to include the current line into a downsampled file. Each selected line is placed in a temporary file, before being read into R via read.csv(). Additional arguments to this function (those other than file, p, and verbose) are passed to read.csv(), and so if their behavior is unclear, you should examine the read.csv() help file.

If verbose=TRUE, then something like:

Read 12207 lines (0.001%) of 12174948 line file.

will be printed to the terminal. This counts the header (if there is one) as one of the lines read and as one of the lines possible.

Value

A dataframe, as with read.csv().

Examples

1
2
3
4
5
6
7
8
library(filesampler)
file = system.file("rawdata/small.csv", package="filesampler")

# Read in a 5% random subsample of the rows.
data = sample_csv(file, param=.05)

# Read in 10 randomly sampled rows.
data = sample_csv(file, param=10, method="exact")

wrathematics/lineSampler documentation built on Feb. 27, 2020, 8:01 p.m.