knitr::opts_chunk$set(error=TRUE, message=TRUE, warning=TRUE, eval=FALSE)
library(BiocStyle)

Introduction

This vignette walks through analysis of the B cell lymphoma RNA-seq data included in the Andrews, J. et al., eBioMedicine 2021 paper. It uses the convenience functions included in this package to streamline the process.

Several command line tools as well as python are also utilized - each code block should indicate whether it is bash, R, or python code. If your system is set up properly, you can still run these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.

The large majority of this vignette can be skipped if you want, the raw gene counts (or FASTQs if you want the whole shebang) is available in GSE145842. That accession also contains a table with the normalized gene counts as output from DESeq2, if you just want to spot-check your gene of interest in this dataset.

Required Software:

Suggested Reading

That is, if you ask me a question and haven't read these first, I'm going to tell you to read them.

Generating gene counts

I prefer the salmon -> DESeq2 route, as it's fast and accurate. This requires R, salmon v0.14, and DESeq2 (along with a few other packages). salmon needs mashmap and bedtools as well if generating "decoy-aware" transcriptomes. The Google can show you how to install those.

Create decoy reference transcriptome

Newer versions of salmon recommend using a decoy-aware transcriptome to generate the index file to be used for selective alignment. This process is a bit annoying, so the authors have pre-computed some decoy transcriptomes for hg19 and hg38 with different annotations. In this case, I used the GENCODE v31 annotations that had been lifted over to hg19, so I had to generate them myself.

This had to be run on a cluster with 128 GB RAM. If you can avoid this step by using their pre-generated decoys, I recommend doing so.

You will likely have a different directory stucture, so paths will have to be edited as appropriate.

First, we'll download all the annotation info and sequence data we need.

```{bash dl-annotations, eval=FALSE}

bash

mkdir -p RNA_seq/ref

annotations

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz mv gencode.v31lift37.annotation.gtf.gz ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf.gz gunzip ./RNA_seq/ref/gencode.v31lift37.annotation.gtf.gz

transcriptome

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.transcripts.fa.gz mv gencode.v31lift37.transcripts.fa.gz ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa.gz gunzip ./RNA_seq/ref/gencode.v31lift37.transcripts.fa.gz

genome

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz mv GRCh37.primary_assembly.genome.fa.gz ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa.gz gunzip ./RNA_seq/ref/GRCh37.primary_assembly.genome.fa.gz

Download the script to generate the decoys.

```{bash dl-decoy, eval=FALSE}
# bash
# Script for generating decoy genome.
wget https://raw.githubusercontent.com/COMBINE-lab/SalmonTools/master/scripts/generateDecoyTranscriptome.sh

This script crashed a lot for me, so I ended up editing it to just snag the last 10 commands or so (those with the step comments) and removing the variable parsing at the beginning, as it uses some deprecated linux commands (readpath) that I couldn't install on our computing cluster.

```{bash salmon-decoy, eval=FALSE}

bash

bash generateDecoyTranscriptome.sh -j 2 -a ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf -g ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa -t ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa -o ./RNA_Seq/ref/decoys/

### Index the transcriptome

This will generate the *decoy-aware* transcriptome index for our annotations.

```{bash salmon-index, eval=FALSE}
# Bash
# Now actually create the index for salmon. We lower k as some of our samples use 50 bp reads.
salmon index -t ./RNA_Seq/ref/decoys/gentrome.fa -i ./RNA_Seq/ref/gencodev31_grch37_index --decoys ./RNA_Seq/ref/decoys/decoys.txt -k 25 --gencode

Quantify

Actually generate the gene counts with salmon now.

```{bash salmon-quant, eval=FALSE}

bash

for f in ./RNA_Seq/FASTQs/; do samp="${f##/}" count=$(ls -1 "$f" | wc -l) echo "$count" if [ $count -gt 1 ] then salmon quant --validateMappings --threads 2 --seqBias -l A -1 "$f"/"$samp"_1.fq.gz -2 "$f"/"$samp"_2.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index else salmon quant --validateMappings --threads 2 --seqBias -l A -r "$f"/"$samp".RNA.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index fi done

### Make Transcript-to-Gene Mappings
For gene-level summarization and readability.

```{python tx2gene, eval=FALSE}
# python
outter = open("./RNA_Seq/ref/tx2gene.txt", "w")
with open("./RNA_Seq/ref/gencode.v31lift37.transcripts.fa") as f:
    for line in f:
        if not line.startswith(">"):
            continue
        line = line.strip().strip(">").split("|")
        tx = line[0]
        symb = line[5]
        out = "\t".join([tx, symb])
        print(out, file = outter)

outter.close()

Check Mapping Rates

An easy diagnostic to ensure everything worked properly is just to check the mapping rates for each sample. Depending on the experimental setup and annotations, a mapping rate of 50-90% may be expected. Really low rates (<40% or so) are worrying, and such samples should be removed from the analysis if possible.

for f in ./RNA_Seq/quants/*; do
  echo "${f##*/}";
  grep "Mapping Rate" "$f"/logs/salmon_quant.log;
done

Analysis with R

Install this package if you haven't already with devtools::install_github("j-andrews7/OneLinerOmics"). If you don't have devtools, it can be installed with install.packages("devtools").

# R
suppressPackageStartupMessages(library(OneLinerOmics))

The only thing you need now is a sample sheet, which is just a table with sample names and metadata columns. It can include as many metadata columns as you'd like.

Run DESeq2

DESeq2 is a very popular R package for performing differential expression analysis. Raw RNA-seq counts follow a negative binomial distribution, which is how DESeq2 models the data. It is quick, flexible, and generally works quite well with little to no parameter tweaking. It can also help to mitigate the technical or batch effects on the analysis.

# R
bcl.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/BCL.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "status", padj.thresh = c(0.05, 0.01, 0.001), plot.box = TRUE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

subtypes.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/Subtypes.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "disease", padj.thresh = c(0.05, 0.01, 0.001), plot.box = FALSE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

Next Steps

Yes, that's it. Explore the plots, tables, enrichments, etc. Re-run if necessary, tweaking parameters as you see fit.

Session Information

sessionInfo()


j-andrews7/AndrewsBCellLymphoma documentation built on Sept. 9, 2021, 11 a.m.