In alexvpickering/GEOfastq: Downloads ENA Fastqs With GEO Accessions

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

library(GEOfastq)

Installation

GEOfastq can be installed from Bioconductor as follows:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GEOfastq")

Overview of GEOfastq

The NCBI Gene Expression Omnibus (GEO) offers a convenient interface to explore high-throughput experimental data such as RNA-seq. GEO deposits RNA-seq data as sra files to the Sequence Read Archive (SRA) which can be converted to fastq files using fastq-dump. This conversion process can be quite slow and it is usually more convenient to download fastq files for a GEO accession generated by the European Nucleotide Archive (ENA). GEOfastq crawls GEO to retrieve metadata and ENA fastq urls, and then downloads them.

Getting Started using GEOfastq

To get fastq data for a GEO series, we first retrieve the metadata for a GEO accession:

gse_name <- 'GSE133758'
gse_text <- crawl_gse(gse_name)

Next, we extract the sample accessions for this study and retrieve the GEO metadata and ENA fastq url for an example:

gsm_names <- extract_gsms(gse_text)
gsm_name <- gsm_names[182]
srp_meta <- crawl_gsms(gsm_name)

Now that we have retrieved the necessary metadata, we are ready to download the fastq files for this sample:

data_dir <- tempdir()

# example using smaller file
srp_meta <- data.frame(
        run  = 'SRR014242',
        row.names = 'SRR014242',
        gsm_name = 'GSM315559',
        ebi_dir = get_dldir('SRR014242'), stringsAsFactors = FALSE)

res <- get_fastqs(srp_meta, data_dir)