predictStrand

Share:

Description

The function evaluates transcription initiation within a peak region by comparing RNA-seq read densities upstream and downstream of an empirically determined transcription start sites. Putative transcription of both forward and reverse genomic strands is tested and the results are stored with each ChIP-seq peak.

Usage

1
2
3
4
5
6
predictStrand(cdsObj, tdsObj, coverage.cutoff, quant.cutoff = 0.1,
  win.size = 2500, prob.cutoff)

## S4 method for signature 'ChipDataSet,TranscriptionDataSet'
predictStrand(cdsObj, tdsObj,
  coverage.cutoff, quant.cutoff = 0.1, win.size = 2500, prob.cutoff)

Arguments

cdsObj

A ChipDataSet object.

tdsObj

A TranscriptionDataSet object.

coverage.cutoff

Numeric. A cutoff value to discard regions with the low fragments coverage, representing expression noise. By default, the value stored in the coverageCutoff slot of the supplied TranscriptionDataSet object is used. The optimal cutoff value can be calculated by estimateBackground function call.

quant.cutoff

Numeric. A cutoff value for the cumulative distribution of the RNA-seq signal along the ChIP-seq peak region. Must be in a range (0, 1). For the details, see step 1 in the "Details" section below. Default: 0.1.

win.size

Numeric. The size of the q1 and q2 regions flanking transcription start position at the 5' and 3', respectively. For the details, see step 2 in the "Details" section below. Default: 2500.

prob.cutoff

Numeric. A cutoff value for the probability of reads to be sampled from the q2 flanking region. If not supplied, the value estimated from the data will be used. Must be in a range (0, 1). For the details, see step 6 in the "Details" section below.

Details

RNA-seq data is incorporated to find direct evidence of active transcription from every putatively gene associated peak. In order to do this, we determine the 'strandedness' of the ChIP-seq peaks, using strand specific RNA-seq data. The following assumptions are made in order to retrieve the peak 'strandedness':

  • The putatively gene associated ChIP-seq peaks are commonly associated with transcription initiation.

  • This transcription initiation occurs within the ChIP peak region.

  • When a ChIP peak is associated with a transcription initiation event, we expect to see a strand-specific increase in RNA-seq fragment count downstream the transcription initiation site.

Each peak in the data set is tested for association with transcription initiation on both strands of DNA. Steps 1-5 are performed for both forward and reverse DNA strand separately and step 6 combines the data from both strands. If the peak is identified as associated with the transcription on both strands, than it is considered to be a bidirectional.

ChIP peak 'strandedness' prediction steps:

  1. Identify a location within the ChIP-seq peak near the transcription start site. This is accomplished by calculating the cumulative distribution of RNA-seq fragments within a peak region. The position is determined where 100% - 'quant.cutoff' * 100% of RNA-seq fragments are located downstream. This approach performs well on both gene-poor and gene-dense regions where transcripts may overlap.

  2. Two equally sized regions are defined (q1 and q2), flanking the position identified in (1) on both sides. RNA-seq fragments are counted in each region.

  3. ChIP peaks with an RNA-seq fragment coverage below an estimated threshold are discarded from the analysis.

  4. The probability is calculated for RNA-seq fragments to be sampled from either q1 or q2. Based on the assumptions we stated above, a ChIP peak that is associated with transcription initiation should have more reads in q2 (downstream of the transcription start position) compared to q1, and subsequently, the probability of a fragment being sampled from q2 would be higher.

  5. ChIP-seq peaks are divided into gene associated and background based on the prediction.

  6. Iteratively, the optimal P(q2) threshold is identified, which balances out the False Discovery Rate (FDR) and False Negative Rate (FNR). Peaks with the P(q2) exceeding the estimated threshold are considered to be associated with the transcription initiation event.

Value

The slot strandPrediction of the provided ChipDataSet object will be updated by the the following elements: 'predicted.strand', 'probability.cutoff', 'results.plus' and 'results.minus'.

Author(s)

Armen R. Karapetyan

See Also

ChipDataSet constructCDS

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
### Load TranscriptionDataSet object
data(tds)

### Load ChipDataSet object
data(cds)

### Classify peaks on gene associated and background
predictTssOverlap(object = cds, feature = "pileup", p = 0.75)

### Predict peak 'strand'
predictStrand(cdsObj = cds, tdsObj = tds, coverage.cutoff = 5,
quant.cutoff = 0.1, win.size = 2500)

### View a short summary of the 'strand' prediction
cds

### View 'strand' prediction
getPeaks(cds)