ChIP-seq signal quantifier (CSSQ)"
In CSSQ: Chip-seq Signal Quantifier Pipeline

library(BiocStyle)
# Decide whether to display parts for BioC (TRUE) or F1000 (FALSE)
on.bioc <- TRUE
library(knitr)
# Use fig.width = 7 for html and fig.width = 6 for pdf
fig.width <- ifelse(on.bioc, 8, 6)
knitr::opts_chunk$set(cache = 2, warning = FALSE, message = FALSE, error = FALSE,
  cache.path = "cache/", fig.path = "figure/", fig.width = fig.width)

Introduction

This vignette introduces the package Bioconductor package r Biocpkg("CSSQ"). This package is designed to perform differential binding analysis for ChIP-seq samples. Differential binding is when there is a significant difference in the signal intensities at a region between two groups. The idea behind CSSQ is to implement a statistically robust pipeline which is built for ChIP-seq datasets to identify regions of differential binding among a set of interesting regions. It contains functions to quantify the signal over a predefined set of regions from the user, transform the data, normalize across samples and perform statistical tests to determine differential binding. There is an ever-increasing number of ChIP-seq samples, and this opens the door to compare these binding events between two groups of samples. CSSQ does this while considering the signal to noise ratios, paired control experiments and the lack of multiple (>2) replicates in a typical ChIP-seq experiment. It controls the false discovery rates while not compromising in its sensitivity. This document will give a brief overview of the processing steps.

Processing summary

The workflow for CSSQ is as follows

Quantify regions of interest - A bed file with the regions of interest is used to count the number of reads aligning to these regions for the samples and their respective controls.
Data processing - The raw count data is used to perform background correction, data transformation and normalization across samples.
Identifying differential regions - The normalized count data is used to perform statistical tests to identify differentially bound regions.

Example

To use CSSQ, you will require a bed file containing the regions of interest as well as sample information file which contains the necessary information about the samples being analyzed. Below are examples of both files.

library(CSSQ)
regionBed <- read.table(system.file("extdata", "chr19_regions.bed", package="CSSQ",mustWork = TRUE))
sampleInfo <- read.table(system.file("extdata", "sample_info.txt", package="CSSQ",mustWork = TRUE),sep="\t",header=TRUE)
head(regionBed)
sampleInfo

CSSQ provides two options for obtaining raw count data used to perform the analysis. The first option is to ask CSSQ to count the reads and perform background correction. The following commands will perform the steps and return an object with count data and meta data.

countData <- getRegionCounts(system.file("extdata", "chr19_regions.bed", package="CSSQ"),sampleInfo,sampleDir = system.file("extdata", package="CSSQ"))
countData
head(assays(countData)$countData)
colData(countData)
rowRanges(countData)

The second option is to provide CSSQ with the count data in a tab separated format. The count data given will directly be used in transformation and normalization steps and no background correction will be performed.

countData <- loadCountData(system.file("extdata", "sample_count_data.txt", package="CSSQ",mustWork = TRUE),system.file("extdata", "chr19_regions.bed", package="CSSQ"),sampleInfo)
countData
head(assays(countData)$countData)
colData(countData)
rowRanges(countData)

Anscombe transformation is then applied to the count data using the ansTransform function. This function also has an option parameter to plot the data distribution of the data before and after transformation.

ansCountData <- ansTransform(countData)
ansCountData
head(assays(ansCountData)$ansCount)

Anscombe transformed data is then normalized across samples. Normalization in CSSQ using the normalizeData function. It takes as input the number of clusters to use in the underlying k-means algorithm that is used in the normalization process. It is advised to choose the number of clusters by analyzing the data distribution.

normInfo <- normalizeData(ansCountData,numClusters=4)
normInfo
head(assays(normInfo)$normCount)

The normalized data is then used by the DBAnalyze function to perform statistical tests to identify differentially bound regions. CSSQ utilized non-parametric approaches to calculate the test statistics and to calculate the p-value from the test statistics. This is done due to prevalent trend observed in ChIP-seq datasets of not having more than two replicates. The resulting GRanges object from this function contains the Benjamini Hochberg adjusted P-value and a Fold change for each of the input regions.

result <- DBAnalyze(normInfo,comparison=c("HSMM","HESC"))
result

For convenience, most of the above steps are wrapped in preprocessData function to perform them all at one go. All the above steps can be performed using the below commands.

#To let CSSQ calculate the counts
processedData <- preprocessData(system.file("extdata", "chr19_regions.bed", package="CSSQ"),system.file("extdata", "sample_info.txt", package="CSSQ"),sampleDir = system.file("extdata", package="CSSQ"),numClusters=4,noNeg=TRUE,plotData=FALSE)
#To provide CSSQ with the counts 
processedData <- preprocessData(system.file("extdata", "chr19_regions.bed", package="CSSQ"),system.file("extdata", "sample_info.txt", package="CSSQ"),inputCountData = system.file("extdata", "sample_count_data.txt", package="CSSQ",mustWork = TRUE),numClusters=4,noNeg=TRUE,plotData=FALSE)
#To find differential binding sites
result <- DBAnalyze(processedData,comparison=c("HSMM","HESC"))
processedData
result