fitBiasModels: Fit bias models over single-isoform genes

View source: R/fit_bias.R

fitBiasModelsR Documentation

Fit bias models over single-isoform genes

Description

This function estimates parameters for one or more bias models for a single sample over a set of single-isoform genes. ~100 medium to highly expressed genes should be sufficient to estimate the parameters robustly.

Usage

fitBiasModels(genes, bam.file, fragtypes, genome, models, readlength, minsize,
  maxsize, speedglm = TRUE, gc.knots = seq(from = 0.4, to = 0.6, length =
  3), gc.bk = c(0, 1), relpos.knots = seq(from = 0.25, to = 0.75, length =
  3), relpos.bk = c(0, 1))

Arguments

genes

a GRangesList with the exons of different single-isoform genes

bam.file

a character string pointing to an indexed BAM file

fragtypes

the output of buildFragtypes. must contain the potential fragment types for the genes named in genes

genome

a BSgenome object

models

a list of lists: the outer list describes multiple models. each element of the inner list has two elements: formula and offset. formula should be a character strings of an R formula describing the bias models, e.g. "count ~ ns(gc) + gene". The end of the string is required to be "+ gene". offset should be a character vector listing possible bias offsets to be used ("fraglen" or "vlmm"). Either offset or formula can be NULL for a model. See vignette for recommendations and details.

readlength

the read length

minsize

the minimum fragment length to model

maxsize

the maximum fragment length to model

speedglm

logical, whether to use speedglm to estimate the coefficients. Default is TRUE.

gc.knots

knots for the GC splines

gc.bk

boundary knots for the GC splines

relpos.knots

knots for the relative position splines

relpos.bk

boundary knots for the relative position splines

Value

a list with elements: coefs, summary, models, model.params, and optional offets: fraglen.density, vlmm.fivep, and vlmm.threep.

  • coefs gives the estimated coefficients for the different models that specified formula.

  • summary gives the tables with coefficients, standard errors and p-values,

  • models stores the incoming models list,

  • model.params stores parameters for the models, such as knot locations

  • fraglen.density is a estimated density object for the fragment length distribution,

  • vlmm.fivep and vlmm.threep store the observed and expected tabulations for the different orders of the VLMM for read start bias.

References

The complete bias model including fragment sequence bias is described in detail in the Supplemental Note of the following publication:

Love, M.I., Hogenesch, J.B., and Irizarry, R.A., Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nature Biotechnologyh (2016) doi: 10.1038/nbt.3682

The read start variable length Markov model (VLMM) for addressing bias introduced by random hexamer priming was introduced in the following publication (the sequence bias model used in Cufflinks):

Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L., and Pachter, L., Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011) doi: 10.1186/gb-2011-12-3-r22

Examples


# see vignette for a more realistic example

# these next lines just write out a BAM file from R
# typically you would already have a BAM file
library(alpineData)
library(GenomicAlignments)
library(rtracklayer)
gap <- ERR188088()
dir <- system.file(package="alpineData", "extdata")
bam.file <- c("ERR188088" = file.path(dir,"ERR188088.bam"))
export(gap, con=bam.file)

library(GenomicRanges)
library(BSgenome.Hsapiens.NCBI.GRCh38)
data(preprocessedData)

readlength <- 75
minsize <- 125 # see vignette how to choose
maxsize <- 175 # see vignette how to choose

# here a very small subset, should be ~100 genes
gene.names <- names(ebt.fit)[6:8] 
names(gene.names) <- gene.names
fragtypes <- lapply(gene.names, function(gene.name) {
                      buildFragtypes(ebt.fit[[gene.name]],
                                     Hsapiens, readlength,
                                     minsize, maxsize)
})
models <- list(
  "GC" = list(formula = "count ~ ns(gc,knots=gc.knots, Boundary.knots=gc.bk) + gene",
              offset=c("fraglen","vlmm"))
)

fitpar <- fitBiasModels(genes=ebt.fit[gene.names],
                        bam.file=bam.file,
                        fragtypes=fragtypes,
                        genome=Hsapiens,
                        models=models,
                        readlength=readlength,
                        minsize=minsize,
                        maxsize=maxsize)


mikelove/alpine documentation built on June 9, 2024, 11:37 a.m.