The Gene Expression Omnibus (GEO) is a public repository of gene expression data that hosts more than 6,000 RNA-seq datasets and this number is increasing. Most of these datasets are deposited in raw sequencing format which needs to be downloaded and processed. With an aim to transform all these datasets in an analysis-ready format, we have developed a comprehensive pipeline to simultaneously download and process RNA-seq data sets from GEO. This R-based automated pipeline can process the available RNA-seq data of human, mouse, and rat from GEO. This package is recommended to use in the unix environment as many of the features are not available in windows.
Before installing GREP2
, you need to install the following software packages first:
You will also need to install the following R and Bioconductor packages:
install.packages(c("devtools", "XML", "parallel", "utils", "rentrez", "RCurl")
source("https://bioconductor.org/biocLite.R")
biocLite(c("GEOquery", "Biobase", "tximport", "EnsDb.Hsapiens.v86", "EnsDb.Rnorvegicus.v79", "EnsDb.Mmusculus.v79",
"AnnotationDbi", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db"))
Once you install the above packages, you can now install GREP2
using CRAN
:
install.packages("GREP2")
You can also install GREP2
using devtools
:
library(devtools)
install_github("uc-bd2k/GREP2")
To consistently process GEO RNA-seq datasets through a robust and uniform system, we have built GEO RNA-seq evenly processing pipeline (GREP2). To demonstrate the usage of the package, we demonstrate the processing steps with a small dataset from GEO: GSE102170. The whole processing workflow can be summarized in the following steps:
library(GREP2)
metadata <- get_metadata(geo_series_acc="GSE102170",destdir=tempdir(),
geo_only=FALSE,download_method="auto")
ascp
utility of Aspera Connect or
regular download. All the downloaded files are stored in the local repository until processed. You can skip this
step by downloading fastq files directly.srr_id <- metadata$metadata_sra$Run
for(i in 1:length(srr_id)){
get_srr(srr_id=srr_id[i], destdir=tempdir(), ascp=FALSE,
prefetch_workspace=NULL,ascp_path=NULL)
}
library_layout <- metadata$metadata_sra$LibraryLayout
for(i in 1:length(srr_id)){
get_fastq(srr_id=srr_id[i],library_layout=library_layout[i],
use_sra_file=FALSE,sra_files_dir=NULL,n_thread=2,
destdir=tempdir())
}
run_fastqc(destdir=tempdir(),fastq_dir=tempdir(),
n_thread=2)
for(i in 1:length(srr_id)){
trim_fastq(srr_id=srr_id[i],fastq_dir=tempdir(),
instrument="MiSeq",library_layout=library_layout[i],destdir=tempdir(),n_thread=2)
}
# Before running Salmon, you will have to build index first.
build_index(species="human",kmer=31,ens_release=92,
destdir=tempdir())
# Run Salmon
for(i in 1:length(srr_id)){
run_salmon(srr_id=srr_id[i],library_layout=library_layout[i],
index_dir=tempdir(),destdir=tempdir(),
fastq_dir=tempdir(),use_trimmed_fastq=FALSE,
other_opts=NULL,n_thread=2)
}
# Run tximport
counts_list <- run_tximport(srr_id=srr_id, species="human",
salmon_dir=paste0(tempdir(),"/salmon"),countsFromAbundance="lengthScaledTPM")
run_multiqc(fastqc_dir=tempdir(),salmon_dir=tempdir(),
destdir=tempdir())
You can run the above individual functions for each step or run the
whole pipeline using the following process_geo_rnaseq
function.
All of the above steps are combined into the following single function.
We would recommend using this function for processing GEO RNA-seq data.
process_geo_rnaseq (geo_series_acc=geo_series_acc,destdir=tempdir(),
download_method="auto",
ascp=FALSE,prefetch_workspace=NULL,
ascp_path=NULL,use_sra_file=FALSE,trim_fastq=FALSE,
index_dir=tempdir(),species="human",
countsFromAbundance="lengthScaledTPM",n_thread=1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.