In j-andrews7/AndrewsBCellLymphoma: One-liner RNA-seq and ChIP-seq Analyses

knitr::opts_chunk$set(error=TRUE, message=TRUE, warning=TRUE, eval=FALSE)
library(BiocStyle)

Introduction

This vignette walks through analysis of the B cell lymphoma RNA-seq data included in the Andrews, J. et al., eBioMedicine 2021 paper. It uses the convenience functions included in this package to streamline the process.

Several command line tools as well as python are also utilized - each code block should indicate whether it is bash, R, or python code. If your system is set up properly, you can still run these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.

The large majority of this vignette can be skipped if you want, the raw gene counts (or FASTQs if you want the whole shebang) is available in GSE145842. That accession also contains a table with the normalized gene counts as output from DESeq2, if you just want to spot-check your gene of interest in this dataset.

Required Software:

R (3.6+)
python
salmon (v0.14+)
bedtools - only needed if generating salmon decoy transcriptome
mashmap - only needed if generating salmon decoy transcriptome

bash

mkdir -p RNA_seq/ref

annotations

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz mv gencode.v31lift37.annotation.gtf.gz ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf.gz gunzip ./RNA_seq/ref/gencode.v31lift37.annotation.gtf.gz

transcriptome

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.transcripts.fa.gz mv gencode.v31lift37.transcripts.fa.gz ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa.gz gunzip ./RNA_seq/ref/gencode.v31lift37.transcripts.fa.gz

genome

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz mv GRCh37.primary_assembly.genome.fa.gz ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa.gz gunzip ./RNA_seq/ref/GRCh37.primary_assembly.genome.fa.gz

Download the script to generate the decoys.

```{bash dl-decoy, eval=FALSE}
# bash
# Script for generating decoy genome.
wget https://raw.githubusercontent.com/COMBINE-lab/SalmonTools/master/scripts/generateDecoyTranscriptome.sh

This script crashed a lot for me, so I ended up editing it to just snag the last 10 commands or so (those with the step comments) and removing the variable parsing at the beginning, as it uses some deprecated linux commands (readpath) that I couldn't install on our computing cluster.

```{bash salmon-decoy, eval=FALSE}

bash

bash generateDecoyTranscriptome.sh -j 2 -a ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf -g ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa -t ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa -o ./RNA_Seq/ref/decoys/

### Index the transcriptome

This will generate the *decoy-aware* transcriptome index for our annotations.

```{bash salmon-index, eval=FALSE}
# Bash
# Now actually create the index for salmon. We lower k as some of our samples use 50 bp reads.
salmon index -t ./RNA_Seq/ref/decoys/gentrome.fa -i ./RNA_Seq/ref/gencodev31_grch37_index --decoys ./RNA_Seq/ref/decoys/decoys.txt -k 25 --gencode

Quantify

Actually generate the gene counts with salmon now.

```{bash salmon-quant, eval=FALSE}

bash

for f in ./RNA_Seq/FASTQs/; do samp="${f##/}" count=$(ls -1 "$f" | wc -l) echo "$count" if [ $count -gt 1 ] then salmon quant --validateMappings --threads 2 --seqBias -l A -1 "$f"/"$samp"_1.fq.gz -2 "$f"/"$samp"_2.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index else salmon quant --validateMappings --threads 2 --seqBias -l A -r "$f"/"$samp".RNA.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index fi done

### Make Transcript-to-Gene Mappings
For gene-level summarization and readability.

```{python tx2gene, eval=FALSE}
# python
outter = open("./RNA_Seq/ref/tx2gene.txt", "w")
with open("./RNA_Seq/ref/gencode.v31lift37.transcripts.fa") as f:
    for line in f:
        if not line.startswith(">"):
            continue
        line = line.strip().strip(">").split("|")
        tx = line[0]
        symb = line[5]
        out = "\t".join([tx, symb])
        print(out, file = outter)

outter.close()

Check Mapping Rates

An easy diagnostic to ensure everything worked properly is just to check the mapping rates for each sample. Depending on the experimental setup and annotations, a mapping rate of 50-90% may be expected. Really low rates (<40% or so) are worrying, and such samples should be removed from the analysis if possible.

for f in ./RNA_Seq/quants/*; do
  echo "${f##*/}";
  grep "Mapping Rate" "$f"/logs/salmon_quant.log;
done

Analysis with R

Install this package if you haven't already with devtools::install_github("j-andrews7/OneLinerOmics"). If you don't have devtools, it can be installed with install.packages("devtools").

# R
suppressPackageStartupMessages(library(OneLinerOmics))

The only thing you need now is a sample sheet, which is just a table with sample names and metadata columns. It can include as many metadata columns as you'd like.

Run DESeq2

DESeq2 is a very popular R package for performing differential expression analysis. Raw RNA-seq counts follow a negative binomial distribution, which is how DESeq2 models the data. It is quick, flexible, and generally works quite well with little to no parameter tweaking. It can also help to mitigate the technical or batch effects on the analysis.

# R
bcl.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/BCL.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "status", padj.thresh = c(0.05, 0.01, 0.001), plot.box = TRUE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

subtypes.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/Subtypes.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "disease", padj.thresh = c(0.05, 0.01, 0.001), plot.box = FALSE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

Next Steps

Yes, that's it. Explore the plots, tables, enrichments, etc. Re-run if necessary, tweaking parameters as you see fit.

Session Information

sessionInfo()

j-andrews7/AndrewsBCellLymphoma documentation built on Sept. 9, 2021, 11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

j-andrews7/AndrewsBCellLymphoma
One-liner RNA-seq and ChIP-seq Analyses

In j-andrews7/AndrewsBCellLymphoma: One-liner RNA-seq and ChIP-seq Analyses

Introduction

Suggested Reading

Generating gene counts

Create decoy reference transcriptome

bash

annotations

transcriptome

genome

bash

Quantify

bash

Check Mapping Rates

Analysis with R

Run DESeq2

Next Steps

Session Information

R Package Documentation

Browse R Packages

We want your feedback!

j-andrews7/AndrewsBCellLymphoma One-liner RNA-seq and ChIP-seq Analyses

In j-andrews7/AndrewsBCellLymphoma: One-liner RNA-seq and ChIP-seq Analyses

Introduction

Suggested Reading

Generating gene counts

Create decoy reference transcriptome

bash

annotations

transcriptome

genome

bash

Quantify

bash

Check Mapping Rates

Analysis with R

Run DESeq2

Next Steps

Session Information

R Package Documentation

Browse R Packages

We want your feedback!

j-andrews7/AndrewsBCellLymphoma
One-liner RNA-seq and ChIP-seq Analyses