process_geo_rnaseq: A complete pipeline to process GEO RNA-seq data

Description Usage Arguments Value References Examples

View source: R/process_geo_rnaseq.R

Description

process_geo_rnaseq downloads and processes GEO RNA-seq data for a given GEO series accession ID. It filters metadata for RNA-seq samples only. We use SRA toolkit for downloading SRA data, Trimmomatic for read trimming (optional), and Salmon for read mapping.

Usage

1
2
3
4
5
process_geo_rnaseq(geo_series_acc, destdir, download_method = "auto",
  ascp = TRUE, prefetch_workspace, ascp_path, use_sra_file = FALSE,
  trim_fastq = FALSE, index_dir, other_opts = NULL,
  species = c("human", "mouse", "rat"), countsFromAbundance = c("no",
  "scaledTPM", "lengthScaledTPM"), n_thread)

Arguments

geo_series_acc

GEO series accession ID.

destdir

directory where all the results will be saved.

download_method

download method for GEOquery.

ascp

logical, whether to use Aspera connect to download SRA run files. If FALSE, then wget will be used to download files which might be slower than 'ascp' download.

prefetch_workspace

directory where SRA run files will be downloaded. This parameter is needed when ascp=TRUE. The location of this directory can be found by going to the aspera directory (/.aspera/connect/bin/) and typing 'vdb-config -i'. A new window will pop-up and under the 'Workspace Name', you will find the location. Usually the default is '/home/username/ncbi/public'.

ascp_path

path to the Aspera software.

use_sra_file

logical, whether to download SRA file first and get fastq files afterwards.

trim_fastq

logical, whether to trim fastq file.

index_dir

directory of the indexing files needed for read mapping using Salmon. See function 'build_index'.

other_opts

options other than default to use for read mapping. See Salmon documentation for the available options.

species

name of the species. Only 'human', 'mouse', and 'rat' are allowed to use.

countsFromAbundance

whether to generate counts based on abundance. Available options are: 'no', 'scaledTPM' (abundance based estimated counts scaled up to library size), 'lengthScaledTPM' (default, scaled using the average transcript length over samples and library size). See Bioconductor package tximport for further details.

n_thread

number of cores to use.

Value

a list of metadata from GEO and SRA saved in the destdir. Another list of gene and transcript level estimated counts summarized by Bioconductor package 'tximport' is also saved in the destdir.

References

Rob Patro, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford (2017): Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417. https://www.nature.com/articles/nmeth.4197

Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research. http://dx.doi.org/10.12688/f1000research.7563.1

Philip Ewels, Mans Magnusson, Sverker Lundin, and Max Kaller (2016): MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047-3048. https://doi.org/10.1093/bioinformatics/btw354

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
geo_series_acc="GSE102170"
#You will have to build index first before running this function.

build_index(species="human",kmer=31,ens_release=92,
destdir=tempdir())
process_geo_rnaseq (geo_series_acc=geo_series_acc,destdir=tempdir(),
download_method="auto",
ascp=FALSE,prefetch_workspace=NULL,
ascp_path=NULL,use_sra_file=FALSE,trim_fastq=FALSE,
index_dir=tempdir(),species="human",
countsFromAbundance="lengthScaledTPM",n_thread=1)

uc-bd2k/GREP2 documentation built on Oct. 29, 2019, 5:15 a.m.