FileFormat: Dataset file formats

FileFormatR Documentation

Dataset file formats

Description

A FileFormat holds information about how to read and parse the files included in a Dataset. There are subclasses corresponding to the supported file formats (ParquetFileFormat and IpcFileFormat).

Factory

FileFormat$create() takes the following arguments:

  • format: A string identifier of the file format. Currently supported values:

    • "parquet"

    • "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported

    • "csv"/"text", aliases for the same thing (because comma is the default delimiter for text files

    • "tsv", equivalent to passing ⁠format = "text", delimiter = "\t"⁠

  • ...: Additional format-specific options

    format = "parquet":

    • dict_columns: Names of columns which should be read as dictionaries.

    • Any Parquet options from FragmentScanOptions.

    format = "text": see CsvParseOptions. Note that you can specify them either with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the readr-style naming used in read_csv_arrow() ("delim", "quote", etc.). Not all readr options are currently supported; please file an issue if you encounter one that arrow should support. Also, the following options are supported. From CsvReadOptions:

    • skip_rows

    • column_names. Note that if a Schema is specified, column_names must match those specified in the schema.

    • autogenerate_column_names From CsvFragmentScanOptions (these values can be overridden at scan time):

    • convert_options: a CsvConvertOptions

    • block_size

It returns the appropriate subclass of FileFormat (e.g. ParquetFileFormat)

Examples


## Semi-colon delimited files
# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
write.table(mtcars, file.path(tf, "file1.txt"), sep = ";", row.names = FALSE)

# Create FileFormat object
format <- FileFormat$create(format = "text", delimiter = ";")

open_dataset(tf, format = format)


arrow documentation built on Sept. 11, 2024, 8:02 p.m.