calrate: Calculate gene elongation rate from a pair of Pro-seq or...
In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

calrate

R Documentation

Calculate gene elongation rate from a pair of Pro-seq or Gro-seq data

Description

Calculate gene elongation rate from a pair of Pro-seq or Gro-seq data, using the LSS (least sum of squares) or HMM (hidden Markov model) method.

Usage

calrate(
  time1file,
  time2file,
  targetfile = NULL,
  gene_ids = NULL,
  genomename = "mm10",
  time,
  strandmethod = 1,
  threads = 1,
  lencutoff = 70000,
  fpkmcutoff = 1,
  startshorten = 1000,
  endshorten = 1000,
  window_num = 40,
  method = "LSS",
  pythonpath = NULL,
  hmmseed = 1234,
  difftype = 1,
  utr = FALSE,
  utrexts = NULL,
  textsize = 13,
  titlesize = 15,
  face = "bold"
)

Arguments

`time1file`	The reference Pro-seq/Gro-seq bam file, corresponding to the experimental condition of no transcriptional inhibitor treatment. Can be a string indicating the directory of the file, or a GAlignmentPairs object, a GAlignments object, or a GRanges object from the original bam file.
`time2file`	The treatment Pro-seq/Gro-seq bam file, corresponding to the experimental condition of transcriptional inhibitor treatment for a specific time (e.g. DRB treatment for 15 min). Can be a string indicating the directory of the bam file, or a GAlignmentPairs object, a GAlignments object, or a GRanges object from the original file.
`targetfile`	A txt file with the genes whose transcriptional rates need to be calculated. Should contain columns named as chr, start, end, strand, and gene_id. It can also be NULL, so that the genes in the genome set by the parameter `genomename` will be analyzed. However, in any case, the genes should have a length longer than the one set by the parameter `lencutoff`, and also longer than the one of 2*(`startshorten` + `endshorten`), which is set by the parameters `startshorten` and `endshorten`.
`gene_ids`	A vector with gene symbols indicating the ones need to be analyzed. In addition to `targetefile` and `genomename`, this parameter also indicates the genes to be analyzed. The final ones should belong to the intersection of these parameters, and they also need to have a length longer than the one set by the parameter `lencutoff`, and also longer than the one of 2*(`startshorten` + `endshorten`), which is set by the parameters `startshorten` and `endshorten`. Can also be NULL, so that no restriction will be added from it.
`genomename`	Specify the genome of the genes to be analyzed, when the parameter `targetfile` is NULL. Can be "mm10" for mouse or "hg38" for human.
`time`	An integer indicating the inhibitor treatment time difference between the `time1file` and the `time2file`, using min as its unit.
`strandmethod`	Indicate the strand specific method used when preparing the sequencing library, can be 1 for the directional ligation method, 2 for the dUTP method, and 0 for a non-strand specific library. In addition, if the sample is sequenced using a single strand method, set it as 3.
`threads`	Number of threads to perform parallelization. Default is 1.
`lencutoff`	The cutoff on gene length (bp). Only genes longer than this cutoff can be considered for analysis. Default is 70000.
`fpkmcutoff`	The cutoff value on gene FPKM. Only genes with an FPKM value greater than the cutoff in `time1file` can be considered for analysis. Default is 1.
`startshorten`	Before inferring a gene's transcription rate, its first 1000 bp (or other length) and last 1000 bp (or other length) regions will be discarded to avoid the unstable reads at the transcription starting and ending stages. However, these regions' lengths can be changed by setting this parameter `startshorten` and the other `endshorten`. This one is used to set the length of the transcription starting region. Its default value is 1000, so that the first 1000 bp region will be discarded.
`endshorten`	Before inferring a gene's transcription rate, its first 1000 bp (or other length) and last 1000 bp (or other length) regions will be discarded to avoid the unstable reads at the transcription starting and ending stages. However, these regions' lengths can be changed by setting this parameter `endshorten` and the other `startshorten`. This one is used to set the length of the transcription ending region. Default is 1000, so that the last 1000 bp region will be discarded.
`window_num`	Before inferring a gene's transcription rate, the function will divide this gene into 40 bins (or other bin number). For each bin, the normalized read count ratio between the treatment and the reference files will be calculated, so a vector with 40 ratios (or other bin number) will be generated. Then, the LSS or HMM method will be used to find the transition bin between the gene's transcription inhibited region and the normal reads region. After that, this identified transition bin and its downstream neighbor will be merged and expanded to the single-base-pair level, and the LSS or HMM method will be further used on them to find the transition base pair in this region. The parameter `window_num` here is used to set the bin number to be divided for each gene. Default value is 40.
`method`	The method to be used for transcription rate inference. The default value is "LSS", so that the least sum of squares method will be used. Can also be "HMM", so that the hidden Markov model will be used.
`pythonpath`	The HMM method is base on `Python`, so the directory of the `Python` interpreter you want to use should be transferred to the function via this parameter, and two `Python` modules should be installed in your `Python` environment, including `numpy` and `hmmlearn`.
`hmmseed`	The HMM method involves random processes, so a random seed should be set via this parameter to repeat the results. Default value is 1234, can also be other integers, such as 2023.
`difftype`	In most cases, the treatment and reference Pro-seq/Gro-seq files are from experiments treating cells with transcription inhibitors, such as DRB (5,6-dichloro-1-beta-d-ribofuranosylbenzimidazole), so that the normal transcription will be repressed for a specific time, generating a reads-depleted region upstream of the normal transcription region. For such inhibitor-based experiments, this parameter `difftype` should be set as 1. However, in some cases, the treatment and reference Pro-seq/Gro- seq files can also come from experiments treating cells with transcription activators, e.g., treating MCF-7 human breast cancer cells with E2 (17- beta-estradiol), making the reads-depleted region downstream, rather than upstream, of the normal transcription region, which is in contrast to the DRB (inhibitor) experiments. For such activator-based experiments, this parameter should be set as 2. In addition, this function `calrate` can also infer proximal polyA alternative sites for genes, and in this case, the parameter `time1file` should be an RNA-seq file with genes using distal polyA sites; the parameter `time2file` needs an RNA-seq file with genes using proximal polyA sites; another parameter `utr` should be set as TRUE; and the parameter `difftype` here should be set as 2. The default value of `difftype` is 1.
`utr`	In addition to inferring transcription rates from Pro-seq/Gro-seq data, `calrate` can also infer proximal polyA alternative sites for genes. In this case, the parameter `time1file` should be an RNA-seq file with genes using distal polyA sites; the parameter `time2file` needs an RNA-seq file with genes using proximal polyA sites; the parameter `difftype` should be set as 2; and the current parameter `utr` should be set as TRUE. The default value of `utr` is FALSE, so the function will perform inference on transcription rates, not on proximal polyA sites.
`utrexts`	When the former parameter `utr` is set as TRUE to infer proximal polyA sites for genes, this parameter can be used to provide a txt file with the genes' last exons whose proximal polyA sites need to be identified. Should contain columns named as chr, start, end, strand, and gene_id. It can also be set as NULL, so that the genes' last exons in the genome set by the parameter `genomename` will be analyzed. However, in the latter case, the original last exons from the genome will be first adjusted so that for the ones with a length > 10000 bp, the proximal polyA sites will be inferred directly in them, but for the ones with a length <= 10000 bp, their lengths will be extended to 10000 bp first, and then the proximal sites will be identified within the extended exons. On the other hand, if the last exons are provided with this parameter `utrexts`, they will never be extended to 10000 bp, and the proximal polyA inference will be performed directly on them. In addition to the proximal sites, the distal ones will also be defined by the function from the `time1file` and `time2file`'s reads. It is performed with a sliding window method on the last exon regions, and this step is before the proximal polyA sites inference but after the exon extension step. It should also be noted that the last exons should have an FPKM value in the `time1file` greater than the cutoff set by `fpkmcutoff`.
`textsize`	In addition to returning a data frame to show the inference results, this function will also generate several plots to show them, and the font size for the plot texts is set by this parameter. Default is 13.
`titlesize`	The font size for the plot titles. Default is 15.
`face`	The font face for the plot texts. Default is "bold".

Value

A list including a slot named "report", which is a data frame with the inferred transcription elongation rates, or genes' proximal and distal polyA sites, as well as other information, such as the genes' coordinates, the results' significance, etc. In addition, the result list also contains other slots, such as "binplots" and "expandplots", which contains the data that can be used to plot the inference results.

Examples

library(proRate)

wt0file <- system.file("extdata", "wt0.bam", package = "proRate")
wt15file <- system.file("extdata", "wt15.bam", package = "proRate")

wtrates <- calrate(time1file = wt0file, 
                  time2file = wt15file, 
                  time = 15, 
                  strandmethod = 1, 
                  
                  genomename = "mm10", 
                  lencutoff = 40000, 
                  fpkmcutoff = 1, 
                  
                  threads = 4, 
                  
                  startshorten = 1000, 
                  endshorten = 1000, 
                  window_num = 40, 
                  
                  method = "LSS", 
                  pythonpath = NULL, 
                  
                  difftype = 1)

yuabrahamliu/proRate documentation built on Nov. 3, 2024, 10:14 a.m.

yuabrahamliu/proRate index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

yuabrahamliu/proRate
proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

calrate: Calculate gene elongation rate from a pair of Pro-seq or...
In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

Calculate gene elongation rate from a pair of Pro-seq or Gro-seq data

Description

Usage

Arguments

Value

Examples

Related to calrate in yuabrahamliu/proRate...

R Package Documentation

Browse R Packages

We want your feedback!

yuabrahamliu/proRate proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

calrate: Calculate gene elongation rate from a pair of Pro-seq or... In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

Calculate gene elongation rate from a pair of Pro-seq or Gro-seq data

Description

Usage

Arguments

Value

Examples

Related to calrate in yuabrahamliu/proRate...

R Package Documentation

Browse R Packages

We want your feedback!

yuabrahamliu/proRate
proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

calrate: Calculate gene elongation rate from a pair of Pro-seq or...
In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.