track_data: Track Data Provenance

View source: R/data_provenance.R

track_dataR Documentation

Track Data Provenance

Description

Records comprehensive provenance information for data files including checksums, sources, timestamps, and metadata. Supports fast hashing for large files.

Usage

track_data(
  data_path,
  source = c("downloaded", "generated", "manual", "reference", "other"),
  source_url = NULL,
  description = NULL,
  metadata = NULL,
  fast_hash = TRUE,
  size_threshold_gb = 1,
  registry_file
)

Arguments

data_path

Character. Path to data file or directory.

source

Character. Source of the data (e.g., "downloaded", "generated", "manual", "reference").

source_url

Character. URL if data was downloaded. Optional.

description

Character. Description of the data. Optional.

metadata

List. Additional metadata. Optional.

fast_hash

Logical. Use faster xxHash for large files (>1GB). Default TRUE.

size_threshold_gb

Numeric. Size threshold (GB) for using fast hash. Default 1.

registry_file

Character. Path to provenance registry (required).

Value

A list containing data provenance information

Examples

## Not run: 
# Track a downloaded dataset
track_data("data/mydata.csv",
  source = "downloaded",
  source_url = "https://example.com/data.csv",
  description = "Customer data from API",
  registry_file = tempfile(fileext = ".json")
)

# Track generated data
track_data("results/simulation.rds",
  source = "generated",
  description = "Monte Carlo simulation results",
  registry_file = tempfile(fileext = ".json")
)

# Track large file with fast hashing
track_data("data/large_file.bam",
  source = "generated",
  fast_hash = TRUE,
  registry_file = tempfile(fileext = ".json")
)

## End(Not run)

Capsule documentation built on Nov. 11, 2025, 5:14 p.m.