calrate: Calculate gene elongation rate from a pair of Pro-seq or...

View source: R/inference_V2_new.R

calrateR Documentation

Calculate gene elongation rate from a pair of Pro-seq or Gro-seq data

Description

Calculate gene elongation rate from a pair of Pro-seq or Gro-seq data, using the LSS (least sum of squares) or HMM (hidden Markov model) method.

Usage

calrate(
  time1file,
  time2file,
  targetfile = NULL,
  gene_ids = NULL,
  genomename = "mm10",
  time,
  strandmethod = 1,
  threads = 1,
  lencutoff = 70000,
  fpkmcutoff = 1,
  startshorten = 1000,
  endshorten = 1000,
  window_num = 40,
  method = "LSS",
  pythonpath = NULL,
  hmmseed = 1234,
  difftype = 1,
  utr = FALSE,
  utrexts = NULL,
  textsize = 13,
  titlesize = 15,
  face = "bold"
)

Arguments

time1file

The reference Pro-seq/Gro-seq bam file, corresponding to the experimental condition of no transcriptional inhibitor treatment. Can be a string indicating the directory of the file, or a GAlignmentPairs object, a GAlignments object, or a GRanges object from the original bam file.

time2file

The treatment Pro-seq/Gro-seq bam file, corresponding to the experimental condition of transcriptional inhibitor treatment for a specific time (e.g. DRB treatment for 15 min). Can be a string indicating the directory of the bam file, or a GAlignmentPairs object, a GAlignments object, or a GRanges object from the original file.

targetfile

A txt file with the genes whose transcriptional rates need to be calculated. Should contain columns named as chr, start, end, strand, and gene_id. It can also be NULL, so that the genes in the genome set by the parameter genomename will be analyzed. However, in any case, the genes should have a length longer than the one set by the parameter lencutoff, and also longer than the one of 2*(startshorten + endshorten), which is set by the parameters startshorten and endshorten.

gene_ids

A vector with gene symbols indicating the ones need to be analyzed. In addition to targetefile and genomename, this parameter also indicates the genes to be analyzed. The final ones should belong to the intersection of these parameters, and they also need to have a length longer than the one set by the parameter lencutoff, and also longer than the one of 2*(startshorten + endshorten), which is set by the parameters startshorten and endshorten. Can also be NULL, so that no restriction will be added from it.

genomename

Specify the genome of the genes to be analyzed, when the parameter targetfile is NULL. Can be "mm10" for mouse or "hg38" for human.

time

An integer indicating the inhibitor treatment time difference between the time1file and the time2file, using min as its unit.

strandmethod

Indicate the strand specific method used when preparing the sequencing library, can be 1 for the directional ligation method, 2 for the dUTP method, and 0 for a non-strand specific library. In addition, if the sample is sequenced using a single strand method, set it as 3.

threads

Number of threads to perform parallelization. Default is 1.

lencutoff

The cutoff on gene length (bp). Only genes longer than this cutoff can be considered for analysis. Default is 70000.

fpkmcutoff

The cutoff value on gene FPKM. Only genes with an FPKM value greater than the cutoff in time1file can be considered for analysis. Default is 1.

startshorten

Before inferring a gene's transcription rate, its first 1000 bp (or other length) and last 1000 bp (or other length) regions will be discarded to avoid the unstable reads at the transcription starting and ending stages. However, these regions' lengths can be changed by setting this parameter startshorten and the other endshorten. This one is used to set the length of the transcription starting region. Its default value is 1000, so that the first 1000 bp region will be discarded.

endshorten

Before inferring a gene's transcription rate, its first 1000 bp (or other length) and last 1000 bp (or other length) regions will be discarded to avoid the unstable reads at the transcription starting and ending stages. However, these regions' lengths can be changed by setting this parameter endshorten and the other startshorten. This one is used to set the length of the transcription ending region. Default is 1000, so that the last 1000 bp region will be discarded.

window_num

Before inferring a gene's transcription rate, the function will divide this gene into 40 bins (or other bin number). For each bin, the normalized read count ratio between the treatment and the reference files will be calculated, so a vector with 40 ratios (or other bin number) will be generated. Then, the LSS or HMM method will be used to find the transition bin between the gene's transcription inhibited region and the normal reads region. After that, this identified transition bin and its downstream neighbor will be merged and expanded to the single-base-pair level, and the LSS or HMM method will be further used on them to find the transition base pair in this region. The parameter window_num here is used to set the bin number to be divided for each gene. Default value is 40.

method

The method to be used for transcription rate inference. The default value is "LSS", so that the least sum of squares method will be used. Can also be "HMM", so that the hidden Markov model will be used.

pythonpath

The HMM method is base on Python, so the directory of the Python interpreter you want to use should be transferred to the function via this parameter, and two Python modules should be installed in your Python environment, including numpy and hmmlearn.

hmmseed

The HMM method involves random processes, so a random seed should be set via this parameter to repeat the results. Default value is 1234, can also be other integers, such as 2023.

difftype

In most cases, the treatment and reference Pro-seq/Gro-seq files are from experiments treating cells with transcription inhibitors, such as DRB (5,6-dichloro-1-beta-d-ribofuranosylbenzimidazole), so that the normal transcription will be repressed for a specific time, generating a reads-depleted region upstream of the normal transcription region. For such inhibitor-based experiments, this parameter difftype should be set as 1. However, in some cases, the treatment and reference Pro-seq/Gro- seq files can also come from experiments treating cells with transcription activators, e.g., treating MCF-7 human breast cancer cells with E2 (17- beta-estradiol), making the reads-depleted region downstream, rather than upstream, of the normal transcription region, which is in contrast to the DRB (inhibitor) experiments. For such activator-based experiments, this parameter should be set as 2. In addition, this function calrate can also infer proximal polyA alternative sites for genes, and in this case, the parameter time1file should be an RNA-seq file with genes using distal polyA sites; the parameter time2file needs an RNA-seq file with genes using proximal polyA sites; another parameter utr should be set as TRUE; and the parameter difftype here should be set as 2. The default value of difftype is 1.

utr

In addition to inferring transcription rates from Pro-seq/Gro-seq data, calrate can also infer proximal polyA alternative sites for genes. In this case, the parameter time1file should be an RNA-seq file with genes using distal polyA sites; the parameter time2file needs an RNA-seq file with genes using proximal polyA sites; the parameter difftype should be set as 2; and the current parameter utr should be set as TRUE. The default value of utr is FALSE, so the function will perform inference on transcription rates, not on proximal polyA sites.

utrexts

When the former parameter utr is set as TRUE to infer proximal polyA sites for genes, this parameter can be used to provide a txt file with the genes' last exons whose proximal polyA sites need to be identified. Should contain columns named as chr, start, end, strand, and gene_id. It can also be set as NULL, so that the genes' last exons in the genome set by the parameter genomename will be analyzed. However, in the latter case, the original last exons from the genome will be first adjusted so that for the ones with a length > 10000 bp, the proximal polyA sites will be inferred directly in them, but for the ones with a length <= 10000 bp, their lengths will be extended to 10000 bp first, and then the proximal sites will be identified within the extended exons. On the other hand, if the last exons are provided with this parameter utrexts, they will never be extended to 10000 bp, and the proximal polyA inference will be performed directly on them. In addition to the proximal sites, the distal ones will also be defined by the function from the time1file and time2file's reads. It is performed with a sliding window method on the last exon regions, and this step is before the proximal polyA sites inference but after the exon extension step. It should also be noted that the last exons should have an FPKM value in the time1file greater than the cutoff set by fpkmcutoff.

textsize

In addition to returning a data frame to show the inference results, this function will also generate several plots to show them, and the font size for the plot texts is set by this parameter. Default is 13.

titlesize

The font size for the plot titles. Default is 15.

face

The font face for the plot texts. Default is "bold".

Value

A list including a slot named "report", which is a data frame with the inferred transcription elongation rates, or genes' proximal and distal polyA sites, as well as other information, such as the genes' coordinates, the results' significance, etc. In addition, the result list also contains other slots, such as "binplots" and "expandplots", which contains the data that can be used to plot the inference results.

Examples

library(proRate)

wt0file <- system.file("extdata", "wt0.bam", package = "proRate")
wt15file <- system.file("extdata", "wt15.bam", package = "proRate")

wtrates <- calrate(time1file = wt0file, 
                  time2file = wt15file, 
                  time = 15, 
                  strandmethod = 1, 
                  
                  genomename = "mm10", 
                  lencutoff = 40000, 
                  fpkmcutoff = 1, 
                  
                  threads = 4, 
                  
                  startshorten = 1000, 
                  endshorten = 1000, 
                  window_num = 40, 
                  
                  method = "LSS", 
                  pythonpath = NULL, 
                  
                  difftype = 1)




yuabrahamliu/proRate documentation built on Nov. 3, 2024, 10:14 a.m.