featureScores: Get scores at regular sample points around genomic features.
In markrobinsonuzh/Repitools: Epigenomic tools

featureScores

R Documentation

Get scores at regular sample points around genomic features.

Description

Given a GRanges / GRangesList object, or BAM file paths, of reads for each experimental condition, or a matrix or an AffynetrixCelSet, or a numeric matrix of array data, where the rows are probes and the columns are the different samples,and an anntotation of features of interest, scores at regularly spaced positions around the features is calculated. In the case of sequencing data, it is the smoothed coverage of reads divided by the library size. In the case of array data, it is array intensity.

Usage

The ANY,data.frame method:
featureScores(x, anno, ...)
The ANY,GRanges method:
featureScores(x, anno, up = NULL, down = NULL, ...)

Arguments

x:: Paths to BAM files, a collection of mapped short reads, or a collection of microarray data.
anno:: Annotation of the features to sample around.
p.anno:: A data.frame with columns chr, position, an optionally index. Only provide this if x is array data. If index is not provided, the rows are assumed to be in the same order as the elements of x.
mapping:: A mapping between probes and genes, as made by annotationLookup. Avoids re-computing the mapping if it has already been done. Only provide this if x is array data.
chrs:: A mapping between chromosome names in an ACP file to the user's feature annotation. Only provide this if x is an AffymetrixCelSet. There is no need to provide this if the feature annotation uses the same chromosome names as the ACP files do. Element i of this vector is the name to give to the chromosome numbered i in the ACP information.
up:: How far to go from the features' reference points in one direction.
down:: How far to go from the features' reference points in the opposite direction.
dist:: The type of distance measure to use, in determining the boundaries of the sampling area. Only provide this if x is sequencing data. Default: "base". "percent" is also accepted.
freq:: Score sampling frequency.
log2.adj:: Whether to log2 scale the array intensities. Only provide this if x is array data. Default: TRUE.
s.width:: The width of smoothing to apply to the coverage. Only provide this if x is sequencing data. This argument is optional. If not provided, then no smoothing is done.
mappability:: A BSgenome object, or list of such objects, the same length as x that has bases for which no mappable reads start at masked by N. If this was provided, then either s.width or tag.len must be provided (but not both).
map.cutoff:: The percentage of bases in a window around each sampling position that must be mappable. Otherwise, the score at that position is repalced by NA. Default: 0.5
tag.len:: Provide this if mappability was provided, but s.width was not.
use.strand:: Whether to only consider reads on the same strand as the feature. Useful for RNA-seq applications.
verbose:: Whether to print the progess of processing. Default: TRUE.

Details

If x is a vector of paths or GRangesList object, then names(x) should contain the types of the experiments.

If anno is a data.frame, it must contan the columns chr, start, and end. Optional columns are strand and name. If anno is a GRanges object, then the name can be present as a column called name in the element metadata of the GRanges object. If names are given, then the coverage matrices will use the names as their row names.

An approximation to running mean smoothing of the coverage is used. Reads are extended to the smoothing width, rather than to their fragment size, and coverage is used directly. This method is faster than a running mean of the calculated coverage, and qualtatively almost identical.

If providing a matrix of array intensity values, the column names of this matrix are used as the names of the samples.

The annotation can be stranded or not. if the annotation is stranded, then the reference point is the start coordinate for features on the + strand, and the end coordinate for features on the - strand. If the annotation is unstranded (e.g. annotation of CpG islands), then the midpoint of the feature is used for the reference point.

The up and down values give how far up and down from the reference point to find scores. The semantics of them depend on if the annotation is stranded or not. If the annotation is stranded, then they give how far upstream and downstream will be sampled. If the annotation is unstranded, then up gives how far towards the start of a chromosome to go, and down gives how far towards the end of a chromosome to go.

If sequencing data is being analysed, and dist is "percent", then they give how many percent of each feature's width away from the reference point the sampling boundaries are. If dist is "base", then the boundaries of the sampling region are a fixed width for every feature, and the units of up and down are bases. up and down must be identical if the features are unstranded. The units of freq are percent for dist being "percent", and bases for dist being "base".

In the case of array data, the sequence of positions described by up, down, and freq actually describe the boundaries of windows, and the probe that is closest to the midpoint of each window is chosen as the representative score of that window. On the other hand, when analysing sequencing data, the sequence of positions refer to the positions that coverage is taken for.

Providing a mappability object for sequencing data is recommended. Otherwise, it is not possible to know if a score of 0 is because the window around the sampling position is unmappable, or if there were really no reads mapping there in the experiment. Coverage is normalised by dividing the raw coverage by the total number of reads in a sample. The coverage at a sampling position is multiplied by 1 / mappability. Any positions that have mappabilty below the mappability cutoff will have their score set to NA.

Value

A ScoresList object, that holds a list of score matrices, one for each experiment type, and the parameters that were used to create the score matrices.

Author(s)

Dario Strbenac, with contributions from Matthew Young at WEHI.

Examples

  data(chr21genes)
  data(samplesList) # Loads 'samples.list.subset'.

  fs <- featureScores(samples.list.subset[1:2], chr21genes, up = 2000, down = 1000,
                      freq = 500, s.width = 500)

markrobinsonuzh/Repitools documentation built on March 20, 2024, 6:04 a.m.