Description Usage Arguments Details Value Author(s) References See Also
Function for obtaining a random sample of lines from a very large CSV
file, whitout having to load in the full data into memory. Targets
situations where the full data does not fit in the computer memory so
usage of the standard sample
function is not possible.
1 |
file |
A file name (a string) |
percORn |
Either the percentage of number of rows of the file or the actual number of rows, the sample should have |
nrLines |
Optionally you may indicate the number of rows of the file if you know it before-hand, otherwise the function will count them for you |
header |
Whether the file has a header line or not (a Boolean value) |
mxPerc |
A maximum threshold for the percentage the sample is allowed to have (defaults to 0.5) |
This function can be used to draw a random sample of lines from a very
large CSV file. This is particularly usefull when you can not afford
to load the file into memory to use R functions like sample
to
obtain the sample.
The function obtains the sample of rows without actually loading the full data into memory - only the final sample is loaded into main memory.
The function is based on unix-based utility programs (perl
and wc
) so
it is limited to this type of platforms. The function will not run on
other platforms (it will check the system variable .Platform$OS.type
), although you may wish to check the function code and
see if you can adapt it to your platform.
A data frame
Luis Torgo ltorgo@dcc.fc.up.pt
Torgo, L. (2016) Data Mining using R: learning with case studies, second edition, Chapman & Hall/CRC (ISBN-13: 978-1482234893).
nrLinesFile
, sample
, sampleDBMS
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.