file_sample_exact: Exact File Sampler

Description Usage Arguments Details Value

View source: R/file_sample_exact.r

Description

Randomly sample lines from an input text file.

Usage

1
2
3
4
5
6
7
8
file_sample_exact(
  nlines,
  infile,
  outfile = tempfile(),
  header = TRUE,
  nskip = 0,
  verbose = FALSE
)

Arguments

nlines

The (exact) number of lines to sample from the input file.

infile

Location of the file (as a string) to be subsampled.

outfile

Output file location (as a string).

header

Is a header (line of column names) on the first line of the csv file?

nskip

Number of lines to skip. If header=TRUE, then this only applies to lines after the header.

verbose

Should linecounts of the input file and the number of lines sampled be printed?

Details

The sampling is done in two passes of the input file. First, the number of lines of the input file are determined by scanning through the file as quickly as possible (i.e., it should be completely I/O bound). Next, an index of lines to keep is produced by reservoir sampling. Then finally, the input file is scanned again line by line with the chosen lines dumped into a temporary file.

If the output file (the one pointed to by the return of this function) is "large" and to be read into memory (which isn't really appropriate for text files in the first place!), then this strategy is probably not appropriate.

Value

NULL


wrathematics/lineSampler documentation built on Feb. 27, 2020, 8:01 p.m.