Process GEO RNA-seq data using GREP2

GREP2 pipeline workflow

To consistently process GEO RNA-seq datasets through a robust and uniform system, we have built GEO RNA-seq evenly processing pipeline (GREP2). The whole processing workflow can be summarized in the following steps:

  1. The pipeline starts with a valid GEO series accession ID. Currently the pipeline works for human, mouse, and rat species. We then retrieve metadata for the GEO series accession using Bioconductor package GEOquery [@davis2007geoquery]. We also download metadata file from the sequence read archive (SRA) to get corresponding run information.

  2. Download corresponding experiment run files from the SRA using ascp utility of Aspera Connect or regular downloading. All the downloaded files are stored in the local repository until processed. You can skip this step by downloading fastq files directly.

  3. Convert SRA files to fastq format using NCBI SRA toolkit or download fastq files directly.

  4. Run FastQC on each fastq file to generate quality control (QC) reports.

  5. Remove adapter sequences if necessary using Trimmomatic [@bolger2014trimmomatic].

  6. Quantify transcript abundances using Salmon [@patro2017salmon]. Transcript level estimates are then summarized to gene level using Bioconductor package tximport [@soneson2015differential]. We obtained gene annotation for Homo sapiens (GRCh38), Mus musculus (GRCm38), and Rattus norvegicus (Rnor_6.0) from Ensemble.

  7. Compile FastQC reports and Salmon log files into a single interactive HTML report using MultiQC [@ewels2016multiqc].

You can run individual functions for each step or run the whole pipeline using process_geo_rnaseq function. To demonstrate the usage of the package, we will process a small dataset from GEO: GSE102170

#dataset
geo_series_acc="GSE102170"
# Species
species="human"

Steps to process the data

options(warn=-1)
#Step 1: get metadata
library(GREP2)
metadata <- get_metadata(geo_series_acc="GSE102170",destdir=tempdir(),
geo_only=FALSE,download_method="auto")

#Step 2: Get SRA data files
srr_id <- metadata$metadata_sra$Run
for(i in 1:length(srr_id)){
    get_srr(srr_id=srr_id[i], destdir=tempdir(), ascp=FALSE,
    prefetch_workspace=NULL,ascp_path=NULL)
}

# Step 3: Get fastq files
library_layout <- metadata$metadata_sra$LibraryLayout
for(i in 1:length(srr_id)){
    get_fastq(srr_id=srr_id[i],library_layout=library_layout[i],
    use_sra_file=FALSE,sra_files_dir=NULL,n_thread=2,
    destdir=tempdir())
}

# Step 4: Run FastQC
run_fastqc(destdir=tempdir(),fastq_dir=tempdir(),
n_thread=2)

# Step 5: Run Trimmomatic
for(i in 1:length(srr_id)){
    trim_fastq(srr_id=srr_id[i],fastq_dir=tempdir(),
    instrument="MiSeq",library_layout=library_layout[i],destdir=tempdir(),n_thread=2)
}

# Step 6: Run Salmon and tximport
# Before running Salmon, you will have to build index first.
build_index(species="human",kmer=31,ens_release=92,
destdir=tempdir())
# Run Salmon
for(i in 1:length(srr_id)){
    run_salmon(srr_id=srr_id[i],library_layout=library_layout[i],
    index_dir=tempdir(),destdir=tempdir(),
    fastq_dir=tempdir(),use_trimmed_fastq=FALSE,
    other_opts=NULL,n_thread=2)
}
# Run tximport
counts_list <- run_tximport(srr_id=srr_id, species="human",
salmon_dir=paste0(tempdir(),"/salmon"),countsFromAbundance="lengthScaledTPM")

# Step 7: Run MultiQC
run_multiqc(fastqc_dir=tempdir(),salmon_dir=tempdir(),
destdir=tempdir())

# All of the above steps are combined into the following single function. We would recommend using this function for processing GEO RNA-seq data.
process_geo_rnaseq (geo_series_acc=geo_series_acc,destdir=tempdir(),
download_method="auto",
ascp=FALSE,prefetch_workspace=NULL,
ascp_path=NULL,use_sra_file=FALSE,trim_fastq=FALSE,
index_dir=tempdir(),species="human",
countsFromAbundance="lengthScaledTPM",n_thread=1)

References



Try the GREP2 package in your browser

Any scripts or data that you put into this service are public.

GREP2 documentation built on Sept. 3, 2019, 9:05 a.m.