df_from_file: Read Parquet, CSV, and other files using DuckDB

View source: R/io-.R

df_from_fileR Documentation

Read Parquet, CSV, and other files using DuckDB

Description

df_from_file() uses arbitrary table functions to read data. See https://duckdb.org/docs/data/overview for a documentation of the available functions and their options. To read multiple files with the same schema, pass a wildcard or a character vector to the path argument,

duckplyr_df_from_file() is a thin wrapper around df_from_file() that calls as_duckplyr_df() on the output.

These functions ingest data from a file using a table function. The results are transparently converted to a data frame, but the data is only read when the resulting data frame is actually accessed.

df_from_csv() reads a CSV file using the read_csv_auto() table function.

duckplyr_df_from_csv() is a thin wrapper around df_from_csv() that calls as_duckplyr_df() on the output.

df_from_parquet() reads a Parquet file using the read_parquet() table function.

duckplyr_df_from_parquet() is a thin wrapper around df_from_parquet() that calls as_duckplyr_df() on the output.

df_to_parquet() writes a data frame to a Parquet file via DuckDB. If the data frame is a duckplyr_df, the materialization occurs outside of R. An existing file will be overwritten. This function requires duckdb >= 0.10.0.

Usage

df_from_file(path, table_function, ..., options = list(), class = NULL)

duckplyr_df_from_file(
  path,
  table_function,
  ...,
  options = list(),
  class = NULL
)

df_from_csv(path, ..., options = list(), class = NULL)

duckplyr_df_from_csv(path, ..., options = list(), class = NULL)

df_from_parquet(path, ..., options = list(), class = NULL)

duckplyr_df_from_parquet(path, ..., options = list(), class = NULL)

df_to_parquet(data, path)

Arguments

path

Path to files, glob patterns * and ⁠?⁠ are supported.

table_function

The name of a table-valued DuckDB function such as "read_parquet", "read_csv", "read_csv_auto" or "read_json".

...

These dots are for future extensions and must be empty.

options

Arguments to the DuckDB function indicated by table_function.

class

The class of the output. By default, a tibble is created. The returned object will always be a data frame. Use class = "data.frame" or class = character() to create a plain data frame.

data

A data frame to be written to disk.

Value

A data frame for df_from_file(), or a duckplyr_df for duckplyr_df_from_file(), extended by the provided class.

Examples

# Create simple CSV file
path <- tempfile("duckplyr_test_", fileext = ".csv")
write.csv(data.frame(a = 1:3, b = letters[4:6]), path, row.names = FALSE)

# Reading is immediate
df <- df_from_csv(path)

# Materialization only upon access
names(df)
df$a

# Return as tibble, specify column types:
df_from_file(
  path,
  "read_csv",
  options = list(delim = ",", types = list(c("DOUBLE", "VARCHAR"))),
  class = class(tibble())
)

# Read multiple file at once
path2 <- tempfile("duckplyr_test_", fileext = ".csv")
write.csv(data.frame(a = 4:6, b = letters[7:9]), path2, row.names = FALSE)

duckplyr_df_from_csv(file.path(tempdir(), "duckplyr_test_*.csv"))

unlink(c(path, path2))

# Write a Parquet file:
path_parquet <- tempfile(fileext = ".parquet")
df_to_parquet(df, path_parquet)

# With a duckplyr_df, the materialization occurs outside of R:
df %>%
  as_duckplyr_df() %>%
  mutate(b = a + 1) %>%
  df_to_parquet(path_parquet)

duckplyr_df_from_parquet(path_parquet)

unlink(path_parquet)

duckplyr documentation built on Sept. 12, 2024, 9:36 a.m.