View source: R/createTADdata.R
createTADdata | R Documentation |
Function to create a data matrix used for building a predictive model to classify boundary regions from functional genomic elements
createTADdata( bounds.GR, resolution, genomicElements.GR, featureType = "distance", resampling, trainCHR, predictCHR = NULL, genome = "hg19" )
bounds.GR |
a GRanges object with chromosomal coordinates of TAD
boundaries used to identify positive cases (can be obtained using
|
resolution |
Numeric, the width to bin the genome at, should match the resolution that TADs were called at. Required. |
genomicElements.GR |
a GRangesList object containing GRanges objects
for each ChIP-seq data to leverage in the random forest model (can be
obtained using the |
featureType |
Character, controls how the feature space is constructed (one of either "binary" (overlap yes/no), "oc" (overlap counts, the number of overlaps), "op" (overlap percent, the percent of bin width covered by the genomic annotation), or "distance" (log2-transformed distance from the center of the nearest genomic annotation to the center of the bin); default is "distance"). Required. |
resampling |
Character, controls if and how the data should be resampled to create balanced classes of boundary vs. nonboundary regions (one of either "none" - no re-sampling, "ros" - Random Over-Sampling, "rus" - Random Under-Sampling. Required. |
trainCHR |
Character vector of chromosomes to use to build the binned data matrix for training. Required. |
predictCHR |
Character vector of chromosomes to use to build the binned data matrix for testing. Default in NULL, indicating no test data is created. If trainCHR=predictCHR then a 7:3 split is created. |
genome |
version of the human genome assembly. Used to filter out bases overlapping centromeric regions. Accepted values - hg19 (default) or hg38. |
A list object containing two data.frames: 1) the training data, 2) the test data (only if predictCHR is not NULL, otherwise it is NA). "y" is an indicator whether the corresponding bin is a TAD boundary, and the subsequent columns have the association measures between bins and the genomic annotations
# Create training data for CHR21 and testing data for CHR22 with # 5 kb binning, oc-type predictors from 26 different transcription factor # binding sites from the GM12878 cell line, and random under-sampling # Read in ARROWHEAD-called TADs at 5kb data(arrowhead_gm12878_5kb) #Extract unique boundaries bounds.GR <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb, filter = FALSE, CHR = c("CHR21", "CHR22"), resolution = 5000) # Read in GRangesList of 26 TFBS data(tfbsList) tadData <- createTADdata(bounds.GR = bounds.GR, resolution = 5000, genomicElements.GR = tfbsList, featureType = "oc", resampling = "rus", trainCHR = "CHR21", predictCHR = "CHR22")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.