sampleCSV: Drawing a random sample of lines from a CSV file

Description Usage Arguments Details Value Author(s) References See Also

Description

Function for obtaining a random sample of lines from a very large CSV file, whitout having to load in the full data into memory. Targets situations where the full data does not fit in the computer memory so usage of the standard sample function is not possible.

Usage

1
sampleCSV(file, percORn, nrLines, header=TRUE, mxPerc=0.5)

Arguments

file

A file name (a string)

percORn

Either the percentage of number of rows of the file or the actual number of rows, the sample should have

nrLines

Optionally you may indicate the number of rows of the file if you know it before-hand, otherwise the function will count them for you

header

Whether the file has a header line or not (a Boolean value)

mxPerc

A maximum threshold for the percentage the sample is allowed to have (defaults to 0.5)

Details

This function can be used to draw a random sample of lines from a very large CSV file. This is particularly usefull when you can not afford to load the file into memory to use R functions like sample to obtain the sample.

The function obtains the sample of rows without actually loading the full data into memory - only the final sample is loaded into main memory.

The function is based on unix-based utility programs (perl and wc) so it is limited to this type of platforms. The function will not run on other platforms (it will check the system variable .Platform$OS.type), although you may wish to check the function code and see if you can adapt it to your platform.

Value

A data frame

Author(s)

Luis Torgo ltorgo@dcc.fc.up.pt

References

Torgo, L. (2016) Data Mining using R: learning with case studies, second edition, Chapman & Hall/CRC (ISBN-13: 978-1482234893).

http://ltorgo.github.io/DMwR2

See Also

nrLinesFile, sample, sampleDBMS


ltorgo/DMwR2 documentation built on May 21, 2019, 8:41 a.m.