process_geo_rnaseq: A complete pipeline to process GEO RNA-seq data
In uc-bd2k/GREP2: GEO RNA-Seq Experiments Processing Pipeline

Description Usage Arguments Value References Examples

View source: R/process_geo_rnaseq.R

process_geo_rnaseq downloads and processes GEO RNA-seq data for a given GEO series accession ID. It filters metadata for RNA-seq samples only. We use SRA toolkit for downloading SRA data, Trimmomatic for read trimming (optional), and Salmon for read mapping.

process_geo_rnaseq(geo_series_acc, destdir, download_method = "auto",
  ascp = TRUE, prefetch_workspace, ascp_path, use_sra_file = FALSE,
  trim_fastq = FALSE, index_dir, other_opts = NULL,
  species = c("human", "mouse", "rat"), countsFromAbundance = c("no",
  "scaledTPM", "lengthScaledTPM"), n_thread)

`geo_series_acc`	GEO series accession ID.
`destdir`	directory where all the results will be saved.
`download_method`	download method for GEOquery.
`ascp`	logical, whether to use Aspera connect to download SRA run files. If FALSE, then wget will be used to download files which might be slower than `'ascp'` download.
`prefetch_workspace`	directory where SRA run files will be downloaded. This parameter is needed when `ascp=TRUE`. The location of this directory can be found by going to the aspera directory (/.aspera/connect/bin/) and typing `'vdb-config -i'`. A new window will pop-up and under the `'Workspace Name'`, you will find the location. Usually the default is `'/home/username/ncbi/public'`.
`ascp_path`	path to the Aspera software.
`use_sra_file`	logical, whether to download SRA file first and get fastq files afterwards.
`trim_fastq`	logical, whether to trim fastq file.
`index_dir`	directory of the indexing files needed for read mapping using Salmon. See function `'build_index'`.
`other_opts`	options other than default to use for read mapping. See Salmon documentation for the available options.
`species`	name of the species. Only `'human'`, `'mouse'`, and `'rat'` are allowed to use.
`countsFromAbundance`	whether to generate counts based on abundance. Available options are: `'no'`, `'scaledTPM'` (abundance based estimated counts scaled up to library size), `'lengthScaledTPM'` (default, scaled using the average transcript length over samples and library size). See Bioconductor package tximport for further details.
`n_thread`	number of cores to use.

a list of metadata from GEO and SRA saved in the destdir. Another list of gene and transcript level estimated counts summarized by Bioconductor package 'tximport' is also saved in the destdir.

Rob Patro, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford (2017): Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417. https://www.nature.com/articles/nmeth.4197

Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research. http://dx.doi.org/10.12688/f1000research.7563.1

Philip Ewels, Mans Magnusson, Sverker Lundin, and Max Kaller (2016): MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047-3048. https://doi.org/10.1093/bioinformatics/btw354

geo_series_acc="GSE102170"
#You will have to build index first before running this function.

build_index(species="human",kmer=31,ens_release=92,
destdir=tempdir())
process_geo_rnaseq (geo_series_acc=geo_series_acc,destdir=tempdir(),
download_method="auto",
ascp=FALSE,prefetch_workspace=NULL,
ascp_path=NULL,use_sra_file=FALSE,trim_fastq=FALSE,
index_dir=tempdir(),species="human",
countsFromAbundance="lengthScaledTPM",n_thread=1)