Filtering methods | R Documentation |
Implementations of the direct and trended filtering strategies for bin pair abundances.
filterDirect(data, prior.count=2, reference=NULL, assay.data=1, assay.ref=1)
filterTrended(data, span=0.25, prior.count=2, reference=NULL, assay.data=1, assay.ref=1)
data |
an InteractionSet object produced by |
span |
a numeric scalar specifying the bandwidth for loess curve fitting |
prior.count |
a numeric scalar indicating the prior count to use for calculating the average abundance |
reference |
another InteractionSet object, usually containing data for larger bin pairs |
assay.data |
a string or integer scalar specifying the count matrix to use from |
assay.ref |
a string or integer scalar specifying the count matrix to use from |
The filterDirect
function implements the direct filtering strategy.
The rate of non-specific ligation is estimated as the median of average abundances from inter-chromosomal bin pairs.
This rate or some multiple thereof can be used as a minimum threshold for filtering, to keep only high-abundance bin pairs.
When calculating the median, some finesse is required to consider empty parts of the interaction space, i.e., areas that are not represented by bin pairs.
The filterTrended
function implements the trended filtering strategy.
The rate of non-specific compaction is estimated by fitting a trend to the average abundances against the log-distance for all intra-chromosomal bin pairs.
This rate can then be used as a minimum threshold for filtering.
For inter-chromosomal bin pairs, the threshold is the same as that from the direct filter.
Curve fitting in filterTrended
is done using loessFit
with a bandwidth of span
.
Lower values may need to be used for a more accurate fit when the trend is highly non-linear.
The bin size is also added to the distance prior to log-transformation, to avoid problems with undefined values when distances are equal to zero.
Empty parts of the interaction space are considered by inferring the abundances and distances of the corresponding bin pairs (though this is skipped if too much of the space is empty).
If reference
is specified, it will be used to compute filter thresholds instead of data
.
This is intended for large bin pairs that have been loaded with filter=1
.
Larger bins provide larger counts for more precise threshold estimates, while the lack of filtering ensures that estimates are not biased.
All threshold estimates are adjusted to account for differences in bin sizes between reference
and data
.
The final values can be used to directly filter on abundances in data
; check out the user's guide for more details.
A list is returned containing abundances
, a numeric vector with the average abundances of all bin pairs in data
.
For filterDirect
, the list contains a numeric scalar threshold
, i.e., the non-specific ligation rate.
For filterTrended
, the list contains threshold
, a numeric vector containing the threshold for each bin pair; and log.distance
, a numeric vector with the log-distances for each bin pair.
If reference
is specified in either function, an additional list named ref
is also returned.
This contains the filtering information for the bin pairs in reference
, same as that reported above for each bin pair in data
.
Aaron Lun
Lin, YC et al. (2012) Global changes in the nuclear positioning of genes and intra- and interdomain genomic interactions that orchestrate B cell fate. Nat. Immunol. 13. 1196-1204
squareCounts
,
scaledAverage
# Setting up the object.
a <- 10
b <- 20
regions <- GRanges(rep(c("chrA", "chrB"), c(a, b)), IRanges(c(1:a, 1:b), c(1:a, 1:b)))
set.seed(138153)
npairs <- 500
all.anchor1 <- sample(length(regions), npairs, replace=TRUE)
all.anchor2 <- as.integer(runif(npairs, 1, all.anchor1+1))
counts <- matrix(rnbinom(npairs*4, mu=10, size=10), npairs, 4)
y <- InteractionSet(list(counts=counts), GInteractions(anchor1=all.anchor1,
anchor2=all.anchor2, regions=regions, mode="reverse"),
colData=DataFrame(totals=colSums(counts)), metadata=List(width=1))
# Requiring at least 1.5-fold change.
direct <- filterDirect(y)
keep <- direct$abundances > direct$threshold + log2(1.5)
y[keep,]
# Requiring to be above the threshold.
trended <- filterTrended(y)
keep <- trended$abundances > trended$threshold
y[keep,]
# Running reference comparisons, using larger bin pairs.
w <- 5L
a2 <- a/w
b2 <- b/w
rep.regions <- GRanges(rep(c("chrA", "chrB"), c(a2, b2)),
IRanges(c(1:a2, 1:b2)*w-w+1L, c(1:a2, 1:b2)*w))
npairs2 <- 20
rep.anchor1 <- sample(length(rep.regions), npairs2, replace=TRUE)
rep.anchor2 <- as.integer(runif(npairs2, 1, rep.anchor1+1))
y2 <- InteractionSet(list(counts=matrix(rnbinom(npairs2*4, mu=10*w^2, size=10), npairs2, 4)),
GInteractions(anchor1=rep.anchor1, anchor2=rep.anchor2, regions=rep.regions, mode="reverse"),
colData=DataFrame(totals=y$totals), metadata=List(width=w))
direct2 <- filterDirect(y, reference=y2)
sum(direct2$abundances > direct2$threshold + log2(1.5))
trended2 <- filterTrended(y, reference=y2)
sum(trended2$abundances > trended2$threshold)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.