Scanner: Scan the contents of a dataset
In arrow: Integration to 'Apache' 'Arrow'

Scanner

R Documentation

Scan the contents of a dataset

Description

A Scanner iterates over a Dataset's fragments and returns data according to given row filtering and column projection. A ScannerBuilder can help create one.

Factory

Scanner$create() wraps the ScannerBuilder interface to make a Scanner. It takes the following arguments:

dataset: A Dataset or arrow_dplyr_query object, as returned by the dplyr methods on Dataset.
projection: A character vector of column names to select columns or a named list of expressions
filter: A Expression to filter the scanned rows by, or TRUE (default) to keep all rows.
use_threads: logical: should scanning use multithreading? Default TRUE
...: Additional arguments, currently ignored

Methods

ScannerBuilder has the following methods:

⁠$Project(cols)⁠: Indicate that the scan should only return columns given by cols, a character vector of column names or a named list of Expression.
⁠$Filter(expr)⁠: Filter rows by an Expression.
⁠$UseThreads(threads)⁠: logical: should the scan use multithreading? The method's default input is TRUE, but you must call the method to enable multithreading because the scanner default is FALSE.
⁠$BatchSize(batch_size)⁠: integer: Maximum row count of scanned record batches, default is 32K. If scanned record batches are overflowing memory then this method can be called to reduce their size.
⁠$schema⁠: Active binding, returns the Schema of the Dataset
⁠$Finish()⁠: Returns a Scanner

Scanner currently has a single method, ⁠$ToTable()⁠, which evaluates the query and returns an Arrow Table.

Examples


# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))

write_dataset(mtcars, tf, partitioning="cyl")

ds <- open_dataset(tf)

scan_builder <- ds$NewScan()
scan_builder$Filter(Expression$field_ref("hp") > 100)
scan_builder$Project(list(hp_times_ten = 10 * Expression$field_ref("hp")))

# Once configured, call $Finish()
scanner <- scan_builder$Finish()

# Can get results as a table
as.data.frame(scanner$ToTable())

# Or as a RecordBatchReader
scanner$ToRecordBatchReader()

arrow documentation built on Aug. 8, 2025, 7:16 p.m.