library(BiocStyle)
library(knitr)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Obtaining the read sequences

The first task is to download the relevant ChIP-seq libraries from the NCBI Gene Expression Omnibus (GEO). These files are obtained from the data series GSE38046, using the Sequence Read Accession (SRA) numbers listed below. Multiple technical replicates exist for some libraries, and are indicated as those files with the same grouping.

sra.numbers <- list(
    c("SRR499718", "SRR499719"),
    c("SRR499720", "SRR499721"), 
    c("SRR499734", "SRR499735", "SRR499736", "SRR499737"), 
    "SRR499738"
)
grouping  <- rep(c(
    "h3k9ac-proB-8113", 
    "h3k9ac-proB-8108", 
    "h3k9ac-matureB-8059", 
    "h3k9ac-matureB-8086"
), lengths(sra.numbers))

sra.numbers <- unlist(sra.numbers)
data.frame(SRA=sra.numbers, Group=grouping)

These files need to be downloaded and unpacked to the FASTQ format prior to alignment. This can be done using the fastq-dump utility from the SRA Toolkit. Note that the SRA files themselves are saved in the local home directory and need to be cleared out.

all.fastq <- paste0(sra.numbers, ".fastq.gz")
for (i in seq_along(sra.numbers)) {
    # Skipping download if the file already exists.
    if (file.exists(all.fastq[i])) { next } 

    # Downloading from NCBI.
    cmd <- "fastq-dump --gzip --skip-technical --dumpbase --clip" 
    code <- system(paste(cmd, sra.numbers[i]))
    stopifnot(code==0L)

    # Cleaning out the cache to reduce disk usage.
    extras <- list.files("~/ncbi/public/sra", 
        pattern=sprintf("^%s\\.", sra.numbers[i]), full=TRUE)
    unlink(extras)
}

Aligning reads to the mouse genome

Reads from technical replicates are pooled together into a single FASTQ file prior to further processing. This reflects the fact that they originate from a single library of DNA fragments. Note that gzipped files can be directly concatenated, hence the use of cat.

by.group <- split(all.fastq, grouping)
for (group in names(by.group)) {
    code <- system(paste(c("cat", by.group[[group]], ">", 
        paste0(group, ".fastq.gz")), collapse=" "))
    stopifnot(code==0L)
}
group.fastq <- paste0(names(by.group), ".fastq.gz")

Reads are aligned to the mm10 build of the mouse genome using r Biocpkg("Rsubread"). This assumes that an index has already been constructed with the prefix index/mm10, see the buildindex() function for details. The type parameter is set to optimize for genomic alignment, rather than alignment to the transcriptome. We only consider uniquely mapped reads to avoid ambiguity from paralogous regions or repeats.

library(Rsubread)
bam.files <- paste0(names(by.group), ".bam")
align(index="index/mm10", readfile1=group.fastq, type='dna', 
    unique=TRUE, input_format="gzFASTQ", output_file=bam.files)

Post-processing of the BAM files

We sort the alignments by their coordinates in the BAM files.

library(Rsamtools)
for (bam in bam.files) {
    out <- suppressWarnings(sortBam(bam, "h3k9ac_temp"))
    file.rename(out, bam)
}

Potential PCR duplicates are marked using the MarkDuplicates tool from the Picard software suite. For some reason, MarkDuplicates uses BAM index files if they're available. We don't want it using old indices, so we delete them beforehand if any are present.

indices <- paste0(bam.files, ".bai")
exist.indices <- file.exists(indices)
if (any(exist.indices)) { unlink(indices[exist.indices]) }

# Marking duplicates.
temp.bam <- "h3k9ac_temp.bam"
temp.file <- "h3k9ac_metric.txt"
temp.dir <- "h3k9ac_working"
dir.create(temp.dir)
for (bam in bam.files) {
    code <- system(sprintf("MarkDuplicates I=%s O=%s M=%s \\
        TMP_DIR=%s AS=true REMOVE_DUPLICATES=false \\
        VALIDATION_STRINGENCY=SILENT", bam, temp.bam,   
        temp.file, temp.dir))
    stopifnot(code==0L)
    file.rename(temp.bam, bam)
}

Finally, we create indices for all of the BAM files.

indexBam(bam.files)

Wrapping up

We delete all of the unnecessary files that were generated during this procedure.

unlink(all.fastq)
unlink(group.fastq)
unlink(temp.dir, recursive=TRUE)
unlink(temp.file)

We also show the session information.

sessionInfo()


LTLA/chipseqDBData documentation built on July 3, 2021, 9:09 a.m.