Home

/

GitHub

/

In pepkit/BiocProject: Bioconductor Management with Portable Encapsulated Project (PEP) Objects

knitr::opts_chunk$set(collapse=FALSE, message=FALSE)

Introduction

Prerequisites

This vignette demonstrates how to integrate BiocProject with the tximeta Bioconductor package for a really slick start-to-finish analysis of RNA-seq data. We assume you're familiar with BiocProject; if not, please start with Getting started with BiocProject vignette for basic instructions.

Introduction to Tximeta

Tximeta is a package that imports transcript quantification files from the salmon transcript quantifier. When importing, tximeta automatically annotates the data with the transcriptome used. How it works is that salmon records a unique identifier of the transcriptome it uses during quantification; then, tximeta reads this identifier and looks up metadata about those sequences using a local database of known transcriptome identifiers. For more details, refer to the tximeta GitHub repository or publication in PLoS Computational Biology.

The tximeta::tximeta function takes as input a data.frame (coldata) object that, for Salmon results, points to a quantification results directory for each sample. The tximeta function reads the *.sa files and returns a single SummarizedExperiment object with the Salmon-generated metadata in the object metadata slot.

Since SummarizedExperiment inherits from the Bioconductor Annotated class, it fits perfectly into BiocProject output object class requirements.

suppressPackageStartupMessages(library(BiocProject))
suppressPackageStartupMessages(library(SummarizedExperiment))
is(SummarizedExperiment(), "Annotated")

Advantages of using BiocProject with tximeta

If we add BiocProject in to the tximeta workflow, then sample metadata from the PEP project specification can be easily plugged in! For example, if a researcher used a PEP to run Salmon to quantify reads across multiple samples with PEP-compatible workflow management engine/job scatterer like Snakemake, CWL, or looper, the same PEP would be ready to use with tximeta as long as the samples had files attribute defined. This could be done either via a files column in the sample table, or by using one of the sample modifiers provided by the PEP framework. The advantages of calling tximport within BiocProject include:

project portability, inherent to projects following PEP specification
single source of metadata from start of the analysis to finish -- all the PEP-defined metadata will be propagated to the output object of the tximeta function automatically. It will be accessible from within your R session using the pepr API, or with @PEP in the metadata slot of the SummarizedExperiment object, just as any other metadata attached to the result by tximeta function.

Let's show you how this work with a simple demo.

Demo of the BiocProject + tximeta workflow

Download example data

First, let's download some RNA-seq counts from salmon, described in PEP format:

if (basename(getwd()) != "long_vignettes") setwd("long_vignettes")
pth = BiocFileCache::bfcrpath(
  BiocFileCache::BiocFileCache(getwd()), 
  "http://big.databio.org/example_data/tximeta_pep.tar.gz"
  )
utils::untar(tarfile=pth)
abs_pep_path = file.path(getwd(), "tximeta_pep")
abs_cfg_path = file.path(abs_pep_path, "project_config.yaml")

Let's take a look at what we have here...

Examine and load the PEP into R

The Biocproject + tximeta workflow requires a PEP. The example we just downloaded looks like this:

library(pepr)
.printNestedList(yaml::read_yaml(abs_cfg_path))

As you can see, this PEP configuration file uses a $TXIMPORTDATA environment variable to specify a file path. This is just an optional way to make this PEP work in any computing environment without being changed, so you can share your sample metadata more easily. For this vignette, we need to set the variable to the output directory where our downloaded results are stored:

Sys.setenv("TXIMPORTDATA"=file.path(abs_pep_path, "/tximportData"))

# Run some stuff we need for the vignette
p=Project(abs_cfg_path)

Now, look at the sample_table key in the configuration file. It points to the second major part of a PEP: the sample table CSV file (r { basename(config(p)$sample_table) }). Check out the contents of that file:

library(knitr)
coldataDF = read.table(p@config$sample_table, sep=",", header=TRUE)
knitr::kable(coldataDF, format = "html")

This sample table lacks the files column required by tximeta -- but this file is sufficient, since BiocProject, or more specifically pepr, will take care of constructing the portable files sample attribute automatically via sample_modifiers.derive, where the config file above specifies the files attribute and its path.

Now we can load the file with BiocProject... but first, a short detour

Detour: the magic of PEP sample modifiers

Before we jump into using BiocProject, let's take a minute to demonstrate how using the PEP helps us out here. Let's read in our PEP using the the generic Project function from pepr:

p=Project(abs_cfg_path)

We now have our PEP project read in, and we can see what is found in the sample table:

sampleTable(p)

See how our sample table has now been automatically updated with the files attribute? That is the magic of the PEP sample modifiers. It's that simple. Now, let's move on to demonstrate what BiocProject adds.

The BiocProject data processing function

If you look again at our configuration file above, you'll notice the biconductor section in the configuration file, which defines a function name and R script. These specify the BiocProject data processing function, which in this case, is simply a tximeta call that uses the PEP-managed processed sample table its input. Here's what that function looks like:

source(file.path(abs_pep_path, "readTximeta.R"))
get(config(p)$bioconductor$readFunName)

Loading in the data with BiocProject

We have everything we need: a salmon output file, a PEP that specifies a sample table and provides the files column, and a function that uses tximeta to create the final SummarizedExperiment object. Now, we can call the BiocProject function:

require(tximeta)
bp = BiocProject(abs_cfg_path)

The output of BiocProject function, the bp object in our case, is magical. In one object, it supports the functionality of SummarizedExperiment, tximeta, and pepr. Observe:

First, it is a RangedSummarizedExperiment, so it supports all methods defined in SummarizedExperiment:

suppressPackageStartupMessages(library(SummarizedExperiment))
colData(bp)
assayNames(bp)
rowRanges(bp)

Naturally, we can use tximeta methods:

retrieveDb(bp)

But wait, there's more! The PEP metadata information has been attached to the metadata as well. Let's extract the Project object from the result with getProject method:

getProject(bp)

You can use the pepr API for any R-based PEP processing tools:

sampleTable(bp)
config(bp)

Conclusion

If you format your project metadata according to the PEP specification, it will be ready to use with tximeta and the resulting object will include project-wide metadata and expose pepr API for any PEP-compatible R packages for downstream analysis.

pepkit/BiocProject documentation built on July 28, 2023, 2:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com