txt2hdd: Transforms text data into a HDD file
In hdd: Easy Manipulation of Out of Memory Data Sets

txt2hdd

R Documentation

Transforms text data into a HDD file

Description

Imports text data and saves it into a HDD file. It uses read_delim_chunked to extract the data. It also allows to preprocess the data.

Usage

txt2hdd(
  path,
  dirDest,
  chunkMB = 500,
  rowsPerChunk,
  col_names,
  col_types,
  nb_skip,
  delim,
  preprocessfun,
  replace = FALSE,
  encoding = "UTF-8",
  verbose = 0,
  locale = NULL,
  ...
)

Arguments

`path`	Character vector that represents the path to the data. Note that it can be equal to patterns if multiple files with the same name are to be imported (if so it must be a fixed pattern, NOT a regular expression).
`dirDest`	The destination directory, where the new HDD data should be saved.
`chunkMB`	The chunk sizes in MB, defaults to 500MB. Instead of using this argument, you can alternatively use the argument `rowsPerChunk` which decides the size of chunks in terms of lines.
`rowsPerChunk`	Number of rows per chunk. By default it is missing: its value is deduced from argument `chunkMB` and the size of the file. If provided, replaces any value provided in `chunkMB`.
`col_names`	The column names, by default is uses the ones of the data set. If the data set lacks column names, you must provide them.
`col_types`	The column types, in the `readr` fashion. You can use `guess_col_types` to find them.
`nb_skip`	Number of lines to skip.
`delim`	The delimiter. By default the function tries to find the delimiter, but sometimes it fails.
`preprocessfun`	A function that is applied to the data before saving. Default is missing. Note that if a function is provided, it MUST return a data.frame, anything other than data.frame is ignored.
`replace`	If the destination directory already exists, you need to set the argument `replace=TRUE` to overwrite all the HDD files in it.
`encoding`	Character scalar containing the encoding of the file to be read. By default it is "UTF-8" and is passed to the `readr` function `locale` which is used in `read_delim_chunked` (the reading function). A common encoding in Western Europe is "ISO-8859-1" (simply use "file filename" in a non-Windows console to get the encoding). Note that this argument is ignored if the argument `locale` is not NULL.
`verbose`	Logical scalar or `NULL` (default). If `TRUE`, then the evolution of the importing process as well as the time to import are reported. If `NULL`, it becomes `TRUE` when the data to import is greater than 5GB or there are more than one chunk.
`locale`	Either `NULL` (default), either an object created with `locale`. This object will be passed to the reading function `read_delim_chunked` and handles how the data is imported.
`...`	Other arguments to be passed to `read_delim_chunked`, `quote = ""` can be interesting sometimes.

Details

This function uses read_delim_chunked from readr to read a large text file per chunk, and generate a HDD data set.

Since the main function for importation uses readr, the column specification must also be in readr's style (namely cols or cols_only).

By default a guess of the column types is made on the first 10,000 rows. The guess is the application of guess_col_types on these rows.

Note that by default, columns that are found to be integers are imported as double (in want of integer64 type in readr). Note that for large data sets, sometimes integer-like identifiers can be larger than 16 digits: in these case you must import them as character not to lose information.

The delimiter is found with the function guess_delim, which uses the guessing from fread. Note that fixed width delimited files are not supported.

Value

This function does not return anything in R. Instead it creates a folder on disk containing .fst files. These files represent the data that has been imported and converted to the hdd format.

You can then read the created data with the function hdd().

Author(s)

Laurent Berge

Examples


# Toy example with iris data

# we create a text file on disk
iris_path = tempfile()
fwrite(iris, iris_path)

# destination path
hdd_path = tempfile()
# reading the text file with HDD, with approx. 50 rows per chunk:
txt2hdd(iris_path, dirDest = hdd_path, rowsPerChunk = 50)

base_hdd = hdd(hdd_path)
summary(base_hdd)

# Same example with preprocessing
sl_keep = sort(unique(sample(iris$Sepal.Length, 40)))
fun = function(x){
	# we keep only some observations & vars + renaming
	res = x[Sepal.Length %in% sl_keep, .(sl = Sepal.Length, Species)]
	# we create some variables
	res[, sl2 := sl**2]
	res
}
# reading with preprocessing
hdd_path_preprocess = tempfile()
txt2hdd(iris_path, hdd_path_preprocess,
		preprocessfun = fun, rowsPerChunk = 50)

base_hdd_preprocess = hdd(hdd_path_preprocess)
summary(base_hdd_preprocess)

hdd documentation built on Aug. 25, 2023, 5:19 p.m.