splitDataByChromatin: Split methylation data into regions based on the chromatin...
In kaiqiong/SOMNiBUS: Smooth modeling of bisulfite sequencing

splitDataByChromatin

R Documentation

Split methylation data into regions based on the chromatin states

Description

This function splits the methylation data into regions based on the chromatin states predicted by ChromHMM software (Ernst and Kellis (2012)). The annotations come from the Bioconductor package annnotatr. Chromatin states determined by chromHMM are available in hg19 for nine cell lines (Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, and Nhlf).

Usage

splitDataByChromatin(
  dat,
  chr,
  cell.line,
  states,
  gap = -1,
  min.cpgs = 50,
  max.cpgs = 2000,
  verbose = TRUE
)

Arguments

`dat`	a data frame with rows as individual CpGs appearing in all the samples. The first 4 columns should contain the information of `Meth_Counts` (methylated counts), `Total_Counts` (read depths), `Position` (Genomic position for the CpG site) and `ID`(sample ID). The covariate information, such as disease status or cell type composition, are listed in column 5 and onwards.
`chr`	character vector containing the chromosome information. Its length should be equal to the number of rows in `dat`.
`cell.line`	character defining the cell line of interest. Nine cell lines are available: `"gm12878"`: Lymphoblastoid cells GM12878, `"h1hesc"`: Embryonic cells H1 hESC, `"hepg2"`: Liver carcinoma HepG2, `"hmec"`, Mammary epithelial cells HMEC, `"hsmm"`, Skeletal muscle myoblasts HSMM, `"huvec"`: Umbilical vein endothelial HUVEC, `"k562"`: Myelogenous leukemia K562, `"nhek"`: Keratinocytes NHEK, `"nhlf"`: Normal human lung fibroblasts NHLF.
`states`	character vector defining the chromatin states of interest among the following available options: `"ActivePromoter"`: Active Promoter `"WeakPromoter"`: Weak Promoter `"PoisedPromoter"`: Poised Promoter `"StrongEnhancer"`: Strong Enhancer `"WeakEnhancer"`: Weak/poised Enhancer `"Insulator"`: Insulator `"TxnTransition"`: Transcriptional Transition `"TxnElongation"`: Transcriptional Elongation `"WeakTxn"`: Weak Transcribed `"Repressed"`: Polycomb-Repressed `"Heterochrom"`: Heterochromatin; low signal `"RepetitiveCNV"`: Repetitive/Copy Number Variation Use `state="all"` to select all the states simultaneously.
`gap`	this integer defines the maximum gap that is allowed between two regions to be considered as overlapping. According to the `GenomicRanges::findOverlaps` function, the gap between 2 ranges is the number of positions that separate them. The gap between 2 adjacent ranges is 0. By convention when one range has its start or end strictly inside the other (i.e. non-disjoint ranges), the gap is considered to be -1. Decimal values will be rounded to the nearest integer. The default value is `-1`.
`min.cpgs`	positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50.
`max.cpgs`	positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000.
`verbose`	logical indicates if the algorithm should provide progress report information. The default value is TRUE.

Value

A list of data.frame containing the data of each independent region.

Author(s)

Audrey Lemaçon

Examples

#------------------------------------------------------------#
data(RAdat)
RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
results <- splitDataByChromatin(dat = RAdat.f, 
cell.line = "huvec", chr = rep(x = "chr4", times = nrow(RAdat.f)),
states = "Insulator", verbose = FALSE)

kaiqiong/SOMNiBUS documentation built on Feb. 24, 2023, 5:38 a.m.