buildTFBSmodel: Builds a predictive model of transcription factor binding...

View source: R/Wimtrap.R

buildTFBSmodelR Documentation

Builds a predictive model of transcription factor binding site (TFBS) location

Description

buildTFBSmodel learns (and optionally validates) a predictive model of transcription factor(TF) binding sites based on the integration of genomic features extracted at the location of potential binding sites identified by a prior analysis (such as pattern-matching). This function implements an algorithm of supervised machine learning, the extreme gradient boosting (XGBoost), which will be fed by a dataset of potential binding sites of training TFs either labelled as 'positive' or 'negative' (i.e. ChIP-seq 'validated' or 'not-validated' in a given condition).

Usage

buildTFBSmodel(
  TFBSdata,
  ChIPpeaks,
  ChIPpeaks_length = 400,
  TFs_validation = NULL,
  model_assessment = TRUE,
  xgb_modeling = TRUE,
  balancing_only = FALSE
)

Arguments

TFBSdata

A named character vector as output by the getTFBSdata() function, defining the local paths to files encoding for the results of pattern-matching and geonmic feature extraction for the training TFs and/or studied TFs. Alternatively, it can be a list of 2 data.table objects, as output by buildTFBSmode with the option xgb_modeling = FALSE. The fist object represents the training dataset, the second, the validation. It can be also a GRanges object, as output with the option balancing_only = TRUE.

ChIPpeaks

A named character vector defining the local paths to BED files encoding the location of ChIP-peaks. The vector is named according to the training transcription factors that are described by the files indicated. Caution: the names of the ChIPpeaks have to find a match with those of TFBSdata.

ChIPpeaks_length

An integer setting a fixed length for the ChIP-peaks, that are defined as the intervals of ChIPpeaks_length bp that are centered on the regions encoded in the ChIPpeaks files. Default velue = 400.

TFs_validation

NULL (by default) or a character vector of names of training TFs. If NULL, the model performances will be assessed from a validation dataset obtained by sampling randomly 20% of the potential binding sites of all the training TFs. Otherwise the model will be validated with the data related to the training TFs named in the TFs_validation parameter and trained using the remaining training TFs.

model_assessment

A logical. If TRUE (by default), 1) the imortance of features is represented, 2) the confusion matrix is printed and 3) the ROC and AUC are plotted.

xgb_modeling

A logical. If TRUE (by default), the step of modeling will be achieved and a model will be output. If FALSE, only the steps of balancing, labelling and splitting between a training and a validation datasets will be carried on and the output will be a list of two data.table corresponding to the training and validation datasets.

balancing_only

If TRUE, only the balancing and labelling are achieved and a GRanges object is output. Default is set to FALSE.

Value

An object of xgb.Booster class if xgb_modeling = TRUE. A list of two data.table corresponding each to the training and validation datasets otherwise.

See Also

getTFBSdata() for obtaining the datasets and predictTFBS() to predict transcription factor binding site location

Examples

genomic_data.ex <- c(CE = system.file("extdata/conserved_elements_example.bed", package = "Wimtrap"),
                      DGF = system.file("extdata/DGF_example.bed", package = "Wimtrap"),
                      DHS = system.file("extdata/DHS_example.bed", package = "Wimtrap"),
                      X5UTR = system.file("extdata/x5utr_example.bed", package = "Wimtrap"),
                      CDS = system.file("extdata/cds_example.bed", package = "Wimtrap"),
                      Intron = system.file("extdata/intron_example.bed", package = "Wimtrap"),
                      X3UTR = system.file("extdata/x3utr_example.bed", package = "Wimtrap")
                     )
imported_genomic_data.ex <- importGenomicData(biomart = FALSE,
                                              genomic_data = genomic_data.ex,
                                              tss = system.file("extdata/tss_example.bed", package = "Wimtrap"),
                                              tts = system.file("extdata/tts_example.bed", package = "Wimtrap"))
TFBSdata.ex <- getTFBSdata(pfm = system.file("extdata/pfm_example.pfm", package = "Wimtrap"),
                           TFnames = c("PIF3", "TOC1"),
                           organism = NULL,
                           genome_sequence = system.file("extdata/genome_example.fa", package = "Wimtrap"),
                           imported_genomic_data = imported_genomic_data.ex)
TFBSmodel.ex <- buildTFBSmodel(TFBSdata = TFBSdata.ex,
                               ChIPpeaks = c(PIF3 = system.file("extdata/PIF3_example.bed", package = "Wimtrap"),
                                             TOC1 = system.file("extdata/TOC1_example.bed", package = "Wimtrap")),
                               TFs_validation = "PIF3")

RiviereQuentin/Wimtrap documentation built on June 29, 2024, 7:17 p.m.