mcalrate: Calculate gene elongation rate for multiple pairs of Pro-seq...
In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

mcalrate

R Documentation

Calculate gene elongation rate for multiple pairs of Pro-seq or Gro-seq data

Description

Calculate gene elongation rate for multiple pairs of Pro-seq or Gro-seq data with the LSS (least sum of squares) or HMM (hidden Markov model) method.

Usage

mcalrate(
  time1files,
  time2files,
  targetfile = NULL,
  gene_ids = NULL,
  genomename = "mm10",
  times,
  strandmethod = 0,
  threads = 1,
  mergerefs = TRUE,
  mergecases = FALSE,
  lencutoff = 70000,
  fpkmcutoff = 1,
  startshorten = 1000,
  endshorten = 1000,
  window_num = 40,
  method = "LSS",
  pythonpath = NULL,
  hmmseed = 1234,
  difftype = 1,
  utr = FALSE,
  utrexts = NULL,
  textsize = 13,
  titlesize = 15,
  face = "bold"
)

Arguments

`time1files`	The reference Pro-seq/Gro-seq bam files, corresponding to the experimental condition of no transcriptional inhibitor treatment. Can be a vector with elements as strings indicating the directories of the bam files.
`time2files`	The treatment Pro-seq/Gro-seq bam files, corresponding to the treatment of transcriptional inhibitor for specific times (e.g. DRB treatment for 15 min, 30 min, etc). Should be a vector with elements as strings indicating the directories of the bam files.
`targetfile`	A txt file with the genes whose transcriptional rates need to be calculated. Should contain columns named as chr, start, end, strand, and gene_id. It can also be NULL, so that the genes in the genome set by the parameter `genomename` will be analyzed. However, in any case, the genes should have a length longer than the one set by the parameter `lencutoff`, and also longer than the one of 2*(`startshorten` + `endshorten`), which is set by the parameters `startshorten` and `endshorten`.
`gene_ids`	A vector with gene symbols indicating the ones need to be analyzed. In addition to `targetefile` and `genomename`, this parameter also indicates the genes to be analyzed. The final ones should belong to the intersection of these parameters, and they also need to have a length longer than the one set by the parameter `lencutoff`, and also longer than the one of 2*(`startshorten` + `endshorten`), which is set by the parameters `startshorten` and `endshorten`. Can also be NULL, so that no restriction will be added from it.
`genomename`	Specify the genome of the genes to be analyzed, when the parameter `targetfile` is NULL. Can be "mm10" for mouse or "hg38" for human.
`times`	The treatment time differences between the `time1files` and the `time2files`, using min as their units. Should be a vector with each element as the time difference between the matched elements in the `time1files` vector and the `time2files` vector.
`strandmethod`	Indicate the strand specific method used when preparing the sequencing libraries, can be 1 for the directional ligation method, 2 for the dUTP method, and 0 for non-strand specific librares. In addition, if the samples are sequenced using a single strand method, set it as 3.
`threads`	Number of threads to do the parallelization. Default is 1.
`mergerefs`	Whether to merge all the reference data contained in the `time1files` vector to one, and then use it as a unified reference for all the `time2files`. Default is TRUE.
`mergecases`	Whether to merge all the treatment data contained in the `time2files` vector to one. Default is FALSE.
`lencutoff`	The cutoff on gene length (bp). Only genes longer than this cutoff can be considered for analysis. Default is 70000.
`fpkmcutoff`	The cutoff value on gene FPKM. Only genes with an FPKM value greater than the cutoff in the reference data can be considered for analysis. Default is 1.
`startshorten`	Before inferring a gene's transcription rate, its first 1000 bp (or other length) and last 1000 bp (or other length) regions will be discarded to avoid the unstable reads at the transcription starting and ending stages. However, these regions' lengths can be changed by setting this parameter `startshorten` and the other `endshorten`. This one is used to set the length of the transcription starting region. Its default value is 1000, so that the first 1000 bp region will be discarded.
`endshorten`	Before inferring a gene's transcription rate, its first 1000 bp (or other length) and last 1000 bp (or other length) regions will be discarded to avoid the unstable reads at the transcription starting and ending stages. However, these regions' lengths can be changed by setting this parameter `endshorten` and the other `startshorten`. This one is used to set the length of the transcription ending region. Default is 1000, so that the last 1000 bp region will be discarded.
`window_num`	Before inferring a gene's transcription rate, the function will divide this gene into 40 bins (or other bin number). For each bin, the normalized read count ratio between the treatment and the reference files will be calculated, so a vector with 40 ratios (or other bin number) will be generated. Then, the LSS or HMM method will be used to find the transition bin between the gene's transcription inhibited region and the normal reads region. After that, this identified transition bin and its downstream neighbor will be merged and expanded to the single-base-pair level, and the LSS or HMM method will be further used on them to find the transition base pair in this region. The parameter `window_num` here is used to set the bin number to be divided for each gene. Default value is 40.
`method`	The method to be used for transcription rate inference. The default value is "LSS", so that the least sum of squares method will be used. Can also be "HMM", so that the hidden Markov model will be used.
`pythonpath`	The HMM method is base on `Python`, so the directory of the `Python` interpreter you want to use should be transferred to the function via this parameter, and two `Python` modules should be installed to your `Python` environment, including `numpy` and `hmmlearn`.
`hmmseed`	The HMM method involves random processes, so a random seed should be set via this parameter to repeat the results. Default value is 1234, can also be other integers, such as 2023.
`difftype`	In most cases, the treatment and reference Pro-seq/Gro-seq files are from experiments treating cells with transcription inhibitors, such as DRB (5,6-dichloro-1-beta-d-ribofuranosylbenzimidazole), so that the normal transcription will be repressed for a specific time, generating a reads-depleted region upstream of the normal transcription region. For such inhibitor-based experiments, this parameter `difftype` should be set as 1. However, in some cases, the treatment and reference Pro-seq/Gro- seq files can also come from experiments treating cells with transcription activators, e.g., treating MCF-7 human breast cancer cells with E2 (17- beta-estradiol), making the reads-depleted region downstream, rather than upstream, of the normal transcription region, which is in contrast to the DRB (inhibitor) experiments. For such activator-based experiments, this parameter should be set as 2. In addition, this function `mcalrate` can also infer proximal polyA alternative sites for genes, and to perform this analysis, the parameter `time1files` needs to contain RNA-seq files with genes using distal polyA sites; the parameter `time2files` should have RNA-seq files with genes using proximal polyA sites; another parameter `utr` needs to be TRUE; and the parameter `difftype` here should be set as 2. The default value of `difftype` is 1.
`utr`	In addition to inferring transcription rates from Pro-seq/Gro-seq data, `mcalrate` can also infer proximal polyA alternative sites for genes. In this case, the parameter `time1files` should be an RNA-seq files with genes using distal polyA sites; the parameter `time2files` needs RNA-seq files with genes using proximal polyA sites; the parameter `difftype` should be set as 2; and the current parameter `utr` should be set as TRUE. The default value of `utr` is FALSE, so the function will perform inference on transcription rates, not on proximal polyA sites.
`utrexts`	When the former parameter `utr` is set as TRUE to infer proximal polyA sites for genes, this parameter can be used to provide a txt file with the genes' last exons whose proximal polyA sites need to be identified. Should contain columns named as chr, start, end, strand, and gene_id. It can also be set as NULL, so that the genes' last exons in the genome set by the parameter `genomename` will be analyzed. However, in the latter case, the original last exons from the genome will be first adjusted so that for the ones with a length > 10000 bp, the proximal polyA sites will be inferred directly in them, but for the ones with a length <= 10000 bp, their lengths will be extended to 10000 bp first, and then the proximal sites will be identified within the extended exons. On the other hand, if the last exons are provided with this parameter `utrexts`, they will never be extended to 10000 bp, and the proximal polyA inference will be performed directly on them. In addition to the proximal sites, the distal polyA sites will also be defined by the function from the Pro-seq/ Gro-seq pairs defined by `time1files` and `time2files`. It is performed with a sliding window method on the last exons, and it is before the proximal polyA sites inference but after the exon extension step. It should be noted that for a specific Pro-seq/Gro-seq file included in the parameter `time1files`, it also set a cutoff for the last exons to be analyzed with the polyA sites inference, i.e., their FPKM values in this file should be greater than `fpkmcutoff`.
`textsize`	In addition to returning a data frame to show the inference results, this function will also generate several plots to show them, and the font size for the plot texts is set by this parameter. Default is 13.
`titlesize`	The font size for the plot titles. Default is 15.
`face`	The font face for the plot texts. Default is "bold".

Value

A list with several sub-lists and each of them includes a slot named "report", which is a data frame with the inferred transcription rates, or genes' proximal and distal polyA sites, as well as other information, such as the genes' coordinates, the results' significance, etc. A sub-list also contains other slots, such as "binplots" and "expandplots", which contains the data that can be used to plot the inference results.

yuabrahamliu/proRate documentation built on Nov. 3, 2024, 10:14 a.m.

yuabrahamliu/proRate index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

yuabrahamliu/proRate
proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

mcalrate: Calculate gene elongation rate for multiple pairs of Pro-seq...
In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

Calculate gene elongation rate for multiple pairs of Pro-seq or Gro-seq data

Description

Usage

Arguments

Value

Related to mcalrate in yuabrahamliu/proRate...

R Package Documentation

Browse R Packages

We want your feedback!

yuabrahamliu/proRate proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

mcalrate: Calculate gene elongation rate for multiple pairs of Pro-seq... In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

Calculate gene elongation rate for multiple pairs of Pro-seq or Gro-seq data

Description

Usage

Arguments

Value

Related to mcalrate in yuabrahamliu/proRate...

R Package Documentation

Browse R Packages

We want your feedback!

yuabrahamliu/proRate
proRate is an R package to infer gene transcription rates with a novel least sum of squares method.

mcalrate: Calculate gene elongation rate for multiple pairs of Pro-seq...
In yuabrahamliu/proRate: proRate is an R package to infer gene transcription rates with a novel least sum of squares method.