testTargetPredictions: Test the performances of predicting gene targets based on the...

View source: R/Wimtrap.R

testTargetPredictionsR Documentation

Test the performances of predicting gene targets based on the location of potential TFBS identified by Wimtrap

Description

This function aims at defining the optimal threshold to set on the TFBS prediction score output by Wimtrap in order to infer the potential gene targets of transcription factors. Subsequently, the performances at predicting gene targets are assessed for each transcription factor considered by considering the whole set of potential TFBS on a given chromosome (unbalanced dataset).

Usage

testTargetPredictions(
  TFBSdata,
  TFBSmodel,
  chrTest = 1,
  tss,
  ChIPpeaks = NULL,
  ChIPpeaks_length = 400,
  targets = NULL
)

Arguments

TFBSdata

A named character vector as output by the getTFBSdata() function, defining the local paths to files encoding for the results of pattern-matching and geonmic feature extraction for the training TFs and/or studied TFs.

TFBSmodel

A xgb.Booster object as output by the function buildTFBSmodel().

chrTest

An integer specifying the number of the chromosome that will be considered to assess the performances at predicting the TF gene targets. Default = 1.

tss

A list of GRanges objects as output by importGenomicData() or local path to a BED file defining the transcription stat site (TSS), name and orientation of each protein-coding transcript of the organism.

ChIPpeaks

A named character vector defining the local paths to BED files encoding the location of ChIP-peaks. The vector is named according to the transcription factors that are described by the files indicated. Caution: the names of the ChIPpeaks have to find a match with those of TFBSdata. Default is NULL and

ChIPpeaks_length

An integer setting a fixed length for the ChIP-peaks, that are defined as the intervals of ChIPpeaks_length bp that are centered on the regions encoded in the ChIPpeaks files. Default value = 400.

targets

A named character vector defining the local paths to text files encoding the manually curated transcriptional targets of each transcription factor. The vector is named according to the transcription factors that are described by the files indicated. Caution: the names of the ChIPpeaks have to find a match with those of TFBSdata. Default = NULL (Transcriptional targets are predicted from ChIP-peaks)

Details

Each gene is at first scored with the highest prediction score among the TFBSs associated with it and predicted by Wimtrap. Each gene is then labelled as positive or negative. The positive genes are the genes whose the TSS is the closest to an occurrence on a ChIP-peak of the cognate TF-primary motif. This allows to draw a ROC curve based on a balanced dataset obtained from all the chromosomes but one and to identify the best threshold to set on the prediction score in order to predict TF gene targets. Finally, the performances are assessed for each TF based on the whole dataset of predicted TFBS on the left-over chromosome.

Value

A data.frame that gives, for each TF considered, the performances of prediction of the transcriptional targets encoded on the test chromosome, taking into consideration all the TFBSs predicted by Wimtrap (prediction score >= 0.5) on that chromosome. Due to the highly imbalanced dataset, the performances are expressed in terms of recall, precision, accuracy and F-score. In addition, in the last column, is presented the optimal threshold obtained when including all the input TFs.

See Also

plotPredictions() to vizualize the results for a given potential target gene.

Examples

genomic_data.ex <- c(CE = system.file("extdata/conserved_elements_example.bed", package = "Wimtrap"),
                      DGF = system.file("extdata/DGF_example.bed", package = "Wimtrap"),
                      DHS = system.file("extdata/DHS_example.bed", package = "Wimtrap"),
                      X5UTR = system.file("extdata/x5utr_example.bed", package = "Wimtrap"),
                      CDS = system.file("extdata/cds_example.bed", package = "Wimtrap"),
                      Intron = system.file("extdata/intron_example.bed", package = "Wimtrap"),
                      X3UTR = system.file("extdata/x3utr_example.bed", package = "Wimtrap")
                     )
imported_genomic_data.ex <- importGenomicData(biomart = FALSE,
                                              genomic_data = genomic_data.ex,
                                              tss = system.file("extdata/tss_example.bed", package = "Wimtrap"),
                                              tts = system.file("extdata/tts_example.bed", package = "Wimtrap"))
TFBSdata.ex <- getTFBSdata(pfm = system.file("extdata/pfm_example.pfm", package = "Wimtrap"),
                           TFnames = c("PIF3", "TOC1"),
                           organism = NULL,
                           genome_sequence = system.file("extdata/genome_example.fa", package = "Wimtrap"),
                           imported_genomic_data = imported_genomic_data.ex)
TFBSmodel.ex <- buildTFBSmodel(TFBSdata = TFBSdata.ex,
                               ChIPpeaks = c(PIF3 = system.file("extdata/PIF3_example.bed", package = "Wimtrap"),
                                             TOC1 = system.file("extdata/TOC1_example.bed", package = "Wimtrap")),
                               TFs_validation = "PIF3")
##Determine the optimal score threshold
targetPerformances <- testTargetPredictions(
TFBSdata = TFBSdata.ex["TOC1"],
TFBSmodel = TFBSmodel.ex, 
tss = imported_genomic_data.ex,
ChIPpeaks =  c(TOC1 = system.file("extdata/TOC1_example.bed", package = "Wimtrap")))
optimal_threshold <- targetPerformances$threshold[1]
PIF3BS.predictions <- predictTFBS(TFBSmodel.ex,
                                  TFBSdata.ex,
                                  studiedTFs = "PIF3",
                                  score_threshold = optimal_threshold)
##To get the transcripts whose expression is potentially regulated by PIF3 do as follows:
PIF3_regulated.predictions <- as.character(PIF3BS.predictions$transcript[!duplicated(PIF3BS.predictions)])
###If you want to consider only the gene model,
###then do as follows:
PIF3_regulated.predictions <- unlist(strsplit(PIF3_regulated.predictions, "[.]"))[seq(1, 2*length(PIF3_regulated.predictions),2)]
PIF3_regulated.predictions <- PIF3_regulated.predictions[!duplicated(PIF3_regulated.predictions)]

RiviereQuentin/Wimtrap documentation built on June 29, 2024, 7:17 p.m.