splitDataByGene: Split methylation data into regions based on the genes...

View source: R/utils.R

splitDataByGeneR Documentation

Split methylation data into regions based on the genes annotations

Description

This function splits the methylation data into regions based on the genes. The annotations are coming from the Bioconductor package annnotatr.

Usage

splitDataByGene(
  dat,
  chr,
  organism = "human",
  build = "hg38",
  types = "promoter",
  gap = -1,
  min.cpgs = 50,
  max.cpgs = 2000,
  verbose = TRUE
)

Arguments

dat

a data frame with rows as individual CpGs appearing in all the samples. The first 4 columns should contain the information of Meth_Counts (methylated counts), Total_Counts (read depths), Position (Genomic position for the CpG site) and ID(sample ID). The covariate information, such as disease status or cell type composition, are listed in column 5 and onwards.

chr

character vector containing the chromosome information. Its length should be equal to the number of rows in dat.

organism

character defining the organism of interest Only Homo sapiens ("human") is available. Additional packages are required for Mus musculus ("mouse"), Rattus norvegicus ("rat") and Drosophila melanogaster ("fly"). The matching is case-insensitive. The default value is "human".

build

character defining the version of the genome build on which the methylation data have been mapped. By default, the build is set to "hg38", however the build "hg19" is also available for Homo sapiens: Once the additional packages are installed, the following organisms and builds are available:

  • "mm9" and "mm10" for Mus musculus;

  • "rn4", "rn5" and "rn6" for Rattus norvegicus;

  • "dm3" and "dm6" for Drosophila melanogaster;

types

character vector defining the type of genic annotations to use among the following options:

  • "upstream" for the annotations included 1-5Kb upstream of the TSS;

  • "promoter" for the annotations included < 1Kb upstream of the TSS;

  • "threeprime" for the annotations included in 3' UTR;

  • "fiveprime" for the annotations included in the 5' UTR;

  • "exon" for the annotations included in the exons;

  • "intron" for the annotations included in the introns;

  • "all" for all the annotations aforementioned. The default value is "promoter".

gap

this integer defines the maximum gap allowed between two regions to be considered as overlapping. According to the GenomicRanges::findOverlaps function, the gap between 2 ranges is the number of positions that separate them. The gap between 2 adjacent ranges is 0. By convention when one range has its start or end strictly inside the other (i.e. non-disjoint ranges), the gap is considered to be -1. Decimal values will be rounded to the nearest integer. The default value is -1.

min.cpgs

positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50.

max.cpgs

positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000.

verbose

logical indicates if the algorithm should provide progress report information. The default value is TRUE.

Value

A named list of data.frame containing the data of each independent region.

Author(s)

Audrey Lemaçon

Examples

#------------------------------------------------------------#
data(RAdat)
# Add a column containing the chromosome information
RAdat$Chr <- "chr4"
RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
results <- splitDataByGene(dat = RAdat.f, 
chr = rep(x = "chr1", times = nrow(RAdat.f)), verbose = FALSE)


kaiqiong/SOMNiBUS documentation built on Feb. 24, 2023, 5:38 a.m.