knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

BiocProject is a (pending) Bioconductor package that provides a way to use Portable Encapsulated Projects (PEPs) within Bioconductor framework.

This vignette assumes you are already familiar with PEPs. If not, see pep.databio.org to learn more about PEP, and the pepr documentation to learn more about reading PEPs in R.

BiocProject uses objects of Project class (from pepr) to handle your project metadata, and allows you to provide a data loading/processing function so that you can load both project metadata and data for an entire project with a single line of R code.

The output of the BiocProject function is the object that your function returns, but enriched with the PEP in its metadata slot. This way of metadata storage is uniform across all objects within Bioconductor project (see: ?Annotated-class for details).

Installation

You must first install pepr:

devtools::install_github(repo='pepkit/pepr')

Then, install BiocProject:

devtools::install_github(repo='pepkit/BiocProject')

How to use BiocProject

Introduction to PEP components

In order to use the BiocProject package, you first need a PEP. For this vignette, we have included a basic example PEP within the package, but if you like, you can create your own, or download an example PEP.

The central component of a PEP is the project configuration file. Let's load up BiocProject and grab the path to our example configuration file:

library(BiocProject)

configFile = system.file(
  "extdata",
  "example_peps-master",
  "example_BiocProject",
  "project_config.yaml",
  package = "BiocProject"
)
configFile
# Run some stuff we need for the vignette
processFunction = system.file(
  "extdata",
  "example_peps-master",
  "example_BiocProject",
  "readBedFiles.R",
  package = "BiocProject"
)
source(processFunction)
bp = BiocProject(file=configFile)

This path points to a YAML project config file, that looks like this:

library(pepr)
.printNestedList(yaml::read_yaml(configFile))

This configuration file points to the second major part of a PEP: the sample annotation CSV file (r { basename(config(bp)$sample_table) }). Here are the contents of that file:

library(knitr)
sampleAnnotation = system.file(
"extdata",
"example_peps-master",
"example_BiocProject",
"sample_table.csv",
package = "BiocProject"
)
sampleAnnotationDF = read.table(sampleAnnotation, sep=",", header=TRUE)
knitr::kable(sampleAnnotationDF, format = "html")

In this example, our PEP has two samples, which have two attributes: sample_name, and file_path, which points the location for the data.

The configuration file also points to a third file (r { basename(config(bp)$bioconductor$readFunPath) }). This file holds a single R function called r { basename(config(bp)$bioconductor$readFunName) }, which has these contents:

get(config(bp)$bioconductor$readFunName)

And that's all there is to it! This PEP consists really of 3 components:

  1. the project configuration file (which points to an annotation sheet and specifies your function name)
  2. the annotation sheet
  3. an R file that holds a function that knows how to process this data.

With that, we're ready to see how BiocProject works.

How to use the BiocProject function

With a PEP in hand, it takes only a single line of code to do all the magic with BiocProject:

bp = BiocProject(file=configFile)

This loads the project metadata from the PEP, then loads and calls the actual data processing function, and returns the R object that the data processing function produces, but enriched with the PEP metadata. Consequently, the object contains all your project metadata and data! Let's inspect the it:

bp

Since the data processing function returned GenomicRanges::GRangesList object, the final result of the BiocProject function is an object of the same class.

How to interact with the returned object

The created object provides all the pepr::Project methods (which you can find in the reference documentation) for pepr.

sampleTable(bp)
config(bp)

Finally, there are a few methods specific to BiocProject objects:

getProject(bp)

How to provide a data load function

In the basic case the function name (and path to source file, if necessary) is specified in the YAML config file itself, like:

bioconductor:
  readFunName: function_name

or

bioconductor:
  readFunName: function_name
  readFunPath: /path/to/the/file.R

The function specified can be a data processing function of any complexity, but has to follow 3 rules listed below.

Rules:

  1. must take at least a single argument,
  2. the argument must be a pepr::Project object (should use that input to load all the relevant data into R),
  3. must return an object of class that extends the class Annotated.

Listed below are some of the classes that extend the class Annotated:

showClass("Annotated")

Consider the readBedFilesfunction as an example of a function that can be used with BiocProject package:

processFunction =  system.file(
  "extdata",
  "example_peps-master",
  "example_BiocProject",
  "readBedFiles.R",
  package = "BiocProject"
)
source(processFunction)
readBedFiles

Data reading function error/warning handling

The BiocProject function provides a way to rigorously monitor exceptions related to your data reading function. All the produced warnings and errors are caught, processed and displayed in an organized way:

configFile = system.file(
  "extdata",
  "example_peps-master",
  "example_BiocProject_exceptions",
  "project_config.yaml",
  package = "BiocProject"
)

bpExceptions = BiocProject(configFile)

As indicated in the warning messages above -- no data is being returned. Instead a S4Vectors::List with a PEP is its metadata slot is produced.

bpExceptions

Further reading

See "More arguments than just a PEP in your function?" vignette if you want to:

See the "Working with remote data" vignette to learn how to download the data from the Internet, process it and store it conveniently with related metadata in any object from the Bioconductor project.

See the "Working with large datasets - simpleCache" vignette to learn how the simpleCache R package can be used to prevent copious and lengthy results recalculations when working with large datasets.



pepkit/BiocProject documentation built on July 28, 2023, 2:49 p.m.