dataUpdate: dataUpdate
In rworkflow/ReUseData: Reusable and reproducible Data Management

View source: R/dataUpdate.R

dataUpdate

R Documentation

dataUpdate

Description

Function to update the local data records by reading the yaml files in the specified directory recursively.

Usage

dataUpdate(
  dir,
  cachePath = "ReUseData",
  outMeta = FALSE,
  keepTags = TRUE,
  cleanup = FALSE,
  cloud = FALSE,
  remote = FALSE,
  checkData = TRUE,
  duplicate = FALSE
)

Arguments

`dir`	a character string for the directory where all data are saved. Data information will be collected recursively within this directory.
`cachePath`	A character string specifying the name for the `BiocFileCache` object to store all the curated data resources. Once specified, must match the `cachePath` argument in `dataSearch`. Default is "ReUseData".
`outMeta`	Logical. If TRUE, a "meta_data.csv" file will be generated in the `dir`, containing information about all available datasets in the directory: The file path to the yaml files, and yaml entries including parameter values for data recipe, file path to datasets, notes, version (from `getData()`), if available and data generating date.
`keepTags`	If keep the prior assigned data tags. Default is TRUE.
`cleanup`	If remove any invalid intermediate files. Default is FALSE. In cases one data recipe (with same parameter values) was evaluated multiple times, the same data file(s) will match to multiple intermediate files (e.g., .yml). `cleanup` will remove older intermediate files, and only keep the most recent ones that matches the data file. When there are any intermediate files that don't match to any data file, `cleanup` will also remove those.
`cloud`	Whether to return the pregenerated data from Google Cloud bucket of ReUseData. Default is FALSE.
`remote`	Whether to use the csv file (containing information about pregenerated data on Google Cloud) from GitHub, which is most up-to-date. Only works when `cloud = TRUE`. Default is FALSE.
`checkData`	check if the data (listed as "# output: " in the yml file) exists. If not, do not include in the output csv file. This argument is added for internal testing purpose.
`duplicate`	Whether to remove duplicates. If TRUE, older version of duplicates will be removed.

Details

Users can directly retrieve information for all available datasets by using meta_data(dir=), which generates a data frame in R with same information as described above and can be saved out. dataUpdate does extra check for all datasets (check the file path in "output" column), remove invalid ones, e.g., empty or non-existing file path, and create a data cache for all valid datasets.

Value

a dataHub object containing the information about local data cache, e.g., data name, data path, etc.

Examples

## Generate data
## Not run: 
library(Rcwl)
outdir <- file.path(tempdir(), "SharedData")

echo_out <- recipeLoad("echo_out")
Rcwl::inputs(echo_out)
echo_out$input <- "Hello World!"
echo_out$outfile <- "outfile"
res <- getData(echo_out,
               outdir = outdir,
               notes = c("echo", "hello", "world", "txt"),
               showLog = TRUE)

ensembl_liftover <- recipeLoad("ensembl_liftover")
Rcwl::inputs(ensembl_liftover)
ensembl_liftover$species <- "human"
ensembl_liftover$from <- "GRCh37"
ensembl_liftover$to <- "GRCh38"
res <- getData(ensembl_liftover,
        outdir = outdir, 
        notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"),
        showLog = TRUE)

## Update data cache (with or without prebuilt data sets from ReUseData cloud bucket)
dataUpdate(dir = outdir)
dataUpdate(dir = outdir, cloud = TRUE)

## newly generated data are now cached and searchable
dataSearch(c("hello", "world"))
dataSearch(c("ensembl", "liftover"))  ## both locally generated data and google cloud data! 

## End(Not run)

rworkflow/ReUseData documentation built on Dec. 7, 2023, 11 p.m.