knitr::opts_chunk$set(
  tidy=FALSE, cache=TRUE,
  dev="png",
  message=FALSE, error=FALSE, warning=TRUE
)

The servers linked by TFregulomeR

TFregulomeR data compendium is hosted in Singapore and Canada. As an API, TFregulomeR (from v2.0.0) will dynamically access and retrieve the data either from Singapore (default) or Canada server. The functions in TFregulomeR that dynamically link to the servers include dataBrowser(), loadPeaks(), searchMotif(), commonPeaks(), exclusivePeaks(), intersectPeakMatrix(), motifDistrib(), genomeAnnotate() and toTFBSTools(). By default, these functions are linking to Singapore server. If users opt to choose the Canada server, they can use the input parameter server='ca' when using these functions.

Browse TFregulomeR data compendium

TFregulomeR allows users to easily browse TFregulomeR compendium using only one simple function dataBrowser (it’s called TFBSBrowser before v1.2.0). Users can search the TFBS according to species, organ, sample type, cell/tissue name, TF name, disease state, and source. A data.frame will be returned upon searching, including the information in TFregulomeR ID (important for downstream analysis), species, organ, sample type, cell/tissue name, description of cell/tissue, disease state, TF name, source, TF source ID, number of peaks and number of peaks with motif.

In TFregulomeR project, we used MEME-ChIP to perform motif de novo discovery in each ChIP-seq. Highly and centrally enriched motifs were selected and compared with the existing TFBS databases, such as HOCOMOCO and JASPAR. 91 highly enriched motifs were not consistent with the TFBS databases. it might be due to the fact that in those cell types, the given TFs are indirectly recruited to genome, and/or that highly abundant presence of cohesion and polycomb group proteins masks the motif enrichment of the ChIP’ed TF. Besides, 136 motifs were not recorded for their corresponding TFs in the databases. In order to confirm the reliability of these 227 motifs, we used HOMER to perform a de novo motif discovery again. Motif results by HOMER were compared with those by MEME-ChIP and their similarity were measured by normalised Pearson correlation coefficient using compare-matrices function in RSAT with the formula: Ncor = cor * w / w_smaller, where cor is raw Pearson correlation coefficient, w is the alignment width of two matrices from MEME-ChIP and HOMER (the minimum value of w was set as 5), and w_smaller is the width of smaller motifs from MEME-ChIP and HOMER. We found that majority of those PWM matrices generated by MEME-ChIP, a combined algorithm suite of expectation maximization and regular expressions, were able to be recapitulated by HOMER, which takes advantage of hypergeometric enrichment (Figure 1). We have added the information into the last two columns of dataBrowser output (from v1.2.0).

In particular, if no input is given for the function, all records in TFregulomeR compendium will be returned.

knitr::include_graphics("../inst/extdata/motif_consistency.png")
library(TFregulomeR)
# browse all records in TFregulomeR TFBS compendium
all_record <- dataBrowser() # or TFBSBrowser() before v1.2.0
#> 1468 record(s) found: ...
#> ... covering 415 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 29 organ(s):
#> ... ... stem_cell, blood_and_lymph, connective_tissue, colorectum, brain, bone, stomach, prostate, breast, pancreas, skin, kidney, lung, eye, esophagus, heart, muscle, uterus, spleen, cervix, testis, liver, adrenal_gland, neck_and_mouth, pleura, ovary, thymus, fallopian, vagina
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in  414  different cell(s) or tissue(s)
#> ... in 8 type(s) of disease state(s):
#> ... ... normal, tumor, Simpson_Golabi_Behmel_syndrome, progeria, metaplasia, unknown, immortalized, premetastatic
#> ... from the source(s): GTRD, MethMotif

# returned table
head(all_record)
#>                                               ID species             organ   sample_type
#> 1 GTRD-EXP000061_HSA_embryonic-stem-cells_PRDM14   human         stem_cell primary_cells
#> 2          GTRD-EXP000080_HSA_CD4pos-T-cells_YY1   human   blood_and_lymph primary_cells
#> 3                 GTRD-EXP000128_HSA_EWS502_FLI1   human connective_tissue     cell_line
#> 4                  GTRD-EXP000132_HSA_HUVEC_FLI1   human connective_tissue     cell_line
#> 5                GTRD-EXP000140_HSA_LS180_TCF7L2   human        colorectum     cell_line
#> 6                 GTRD-EXP000142_HSA_LS180_CEBPB   human        colorectum     cell_line
#>       cell_tissue_name                      description disease_state     TF source source_ID
#> 1 embryonic-stem-cells             embryonic stem cells        normal PRDM14   GTRD EXP000061
#> 2       CD4pos-T-cells                     CD4+ T-cells        normal    YY1   GTRD EXP000080
#> 3               EWS502                    Ewing sarcoma         tumor   FLI1   GTRD EXP000128
#> 4                HUVEC umbilical vein endothelial cells        normal   FLI1   GTRD EXP000132
#> 5                LS180                     colon cancer         tumor TCF7L2   GTRD EXP000140
#> 6                LS180                     colon cancer         tumor  CEBPB   GTRD EXP000142
#>   peak_num peak_with_motif_num Consistent_with_HOCOMOCO_JASPAR Ncor_between_MEME_ChIP_and_HOMER
#> 1    17482                5656                             YES                               NA
#> 2    18298                2038                             YES                               NA
#> 3    89523               28914                             YES                               NA
#> 4    71437               36126                             YES                               NA
#> 5    16517                2312                             YES                               NA
#> 6   128542               67747                             YES                               NA

# browse TFBSs in blood and lymph
blood_and_lymph_record <- dataBrowser(organ = "blood_and_lymph") # or TFBSBrowser() before v1.2.0
#> 494 record(s) found: ...
#> ... covering 197 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 1 organ(s):
#> ... ... blood_and_lymph
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in  129  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... normal, tumor
#> ... from the source(s): GTRD, MethMotif

# browse all CEBPB TFBSs
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 record(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, 
#> breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

Retrieve datasets from TFregulomeR data warehouse

We have designed some useful functions for data retrieval from TFregulomeR compendium. Retrieval of motif matrix and DNA methylation matrix if the source is MethMotif (If the source is GTRD, no DNA methylation information is available) can be achieved using searchMotif. Further, these obtained matrices can be easily saved locally using exportMMPFM, and corresponding (Meth)Motif logo (MethMotif logo, if the source is MethMotif; motif logo, if the source is GTRD) can be simply plotted using plotLogo. Here, we introduced an object of class "MethMotif" using S4 class in order for an easy and intuitive storage, manipulation and conversion (with other packages such as "TFBSTools") of a MethMotif matrix, which contains a TF motif weight position matrix and its DNA methylation matrix (beta score matrix). What's more, we allow users to directly load peak regions of a TF of interest (all peaks or peaks with motif only) from TFregulomeR compendium using loadPeaks.

# according to TFBSBrowser results for all CEBPB TFBS query, we select two CEBPB TFBSs 
# from MethMotif and GTRD: MM1_HSA_K562_CEBPB, GTRD-EXP040801_HSA_HL-60_CEBPB.

# loading MethMotif object in "MEME" format. Currently we support "MEME" and "TRANSFAC".
K562_CEBPB <- searchMotif(id = "MM1_HSA_K562_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
HL60_CEBPB <- searchMotif(id = "GTRD-EXP040801_HSA_HL-60_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
class(K562_CEBPB)
#> [1] "MethMotif"
#> attr(,"package")
#> [1] "TFregulomeR"

After obtaining a MethMotif matrix, user can use plotLogo to plot logo as below (Figure 2). If the TFBS source is MethMotif, then a MethMotif logo will be saved. Two options are available for motif logo, "entropy" and "frequency", and also different methylation levels ("all", "methylated" and "unmethylated") can be opted for methylation bar charts. However, if the TFBS source is GTRD, only a motif logo will be saved due to the unknown DNA methylation within motif.

In the plot, the number of peaks with motif will be printed. It should be NOTED that the motif logo is generated by the all possible TFBSs (initally scanned by FIMO after de novo motif discovery using MEME-ChIP suite with a p-value less than 1e-4) in the peak regions (+/-100bp around peak summits). It's possbile that one peak region contains more than one significant TFBSs. Hence, the number of TFBSs in the peak regions could be larger than the number of peaks with motif.

For each MethMotif logo, bar plot above motif logo denotes the cytosines in CpG context covered by WGBS at each base position in motif and these CpGs are segregated into three groups shown in different colors, namely homogeneously methylated (orange bar, beta score > 90%), heterogeneously methylated (green bar, beta score 10-90%) and homogenously unmethylated (blue bar, beta score < 10%).

plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-unmethylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-unmethylated-only.pdf' has been saved!

# plot motif logo for GTRD-EXP040801_HSA_HL-60_CEBPB. No DNA methylation states available here
plotLogo(HL60_CEBPB, logo_type = "entropy")
#> Success: a PDF named 'GTRD-EXP040801_HSA_HL-60_CEBPB-logo-entropy.pdf' has been saved!
knitr::include_graphics("../inst/extdata/plotLogo.png")

Motif matrix as well as methylation matrix (beta score matrix), if available, can be saved locally using exportMMPFM. To be noted, the function exportMMPFM is also able to export (Meth)Motif matrix for the outputs of commonPeaks, exclusivePeaks and intersectPeakMatrix by specifying "fun = ". We will introduce it in the following section.

# export MethMotif matrix for MM1_HSA_K562_CEBPB
exportMMPFM(fun_output = K562_CEBPB, fun = "searchMotif", save_motif_PFM = TRUE, save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export searchMotif
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB-motif-MEME.txt'.

# export motif matrix for GTRD-EXP040801_HSA_HL-60_CEBPB
exportMMPFM(fun_output = HL60_CEBPB, fun = "searchMotif", save_motif_PFM = TRUE, save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export searchMotif
#> ... ... ... ... No beta score matrix is available. Skip!
#> ... ... ... ... Motif PFM has been saved as 'GTRD-EXP040801_HSA_HL-60_CEBPB-motif-MEME.txt'.

# exprot motif matrix in TRANSFAC format
K562_CEBPB_TRANSFAC <- searchMotif(id = "MM1_HSA_K562_CEBPB", motif_format = "TRANSFAC")
#> There are a matched record exported in a MethMotif object.
exportMMPFM(fun_output <- K562_CEBPB_TRANSFAC, fun = "searchMotif", save_motif_PFM = TRUE, save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export searchMotif
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB-motif-TRANSFAC.txt'.

More importantly, we allow the users to load peak regions (all peaks or peaks only with motif) of all TFs in TFregulomeR compendium using loadPeaks. To be noted, the peak regions of a given TF in TFregulomeR compendium are the peak summits (hg38 for human), and the TFBS is enriched in a +/- 100bp window surrounding the peak summits. For each peak region, we also provide its tag (read) fold change (fifth column of returned peaks). This read enrichment value is obtained from MACS2 and denotes the fold change of reads in TF ChIP-seq compared to input ChIP-seq.

K562_CEBPB_peaks <- loadPeaks(id = "MM1_HSA_K562_CEBPB", includeMotifOnly = TRUE)
#> Success: peak file has been returned in a data frame!
head(K562_CEBPB_peaks)
#>    chr     start       end                                    id tag_fold_change
#> 1 chr3 101823721 101823722 MM1_HSA_K562_CEBPB_peaks_with_motif_1        22.91742
#> 2 chr3 101850619 101850620 MM1_HSA_K562_CEBPB_peaks_with_motif_2        13.09647
#> 3 chr3 102182290 102182291 MM1_HSA_K562_CEBPB_peaks_with_motif_3        17.28870
#> 4 chr3 105626970 105626971 MM1_HSA_K562_CEBPB_peaks_with_motif_4        23.36092
#> 5 chr3 105647238 105647239 MM1_HSA_K562_CEBPB_peaks_with_motif_5        34.80412
#> 6 chr3 105899733 105899734 MM1_HSA_K562_CEBPB_peaks_with_motif_6        13.61153

Study TFBS propensity

Common peak regions

knitr::include_graphics("../inst/extdata/commonPeaks_tutorial.png")

TFregulomeR provides the functionality to find the common peak regions along with DNA methylation profiles and read (tag) enrichments using commonPeaks.

For the target peak sets, users can directly use TFregulomeR TF peaks (hg38 for human) by inputting its TFregulomeR IDs in target_peak_id (all peaks or peaks with motif only can be opted using motif_only_for_target_peak), or their own peak regions in user_target_peak_list. If customised peak sets are provided, all peak sets should be stored in an R list(), and each peak set should be a bed-format data.frame with the first three columns as chromosome (starting with 'chr'), start and end. It's recommended that users provide the UNIQUE IDs for their customised peak list in user_target_peak_id (also should be unique to the provided TFregulomeR ID list in target_peak_id). If unavailable, the function will automatically assign IDs for the user's peak sets. It should be noted that if the customised peak set is derived TFregulomeR compendium, it's highly recommended that its TFregulomeR ID should be provided correspondingly in user_target_peak_id if one opts to profile the DNA methylation levels. Even though TFregulomeR peak sets are peak summits, the function is able to recognise it with the provided TFregulomeR ID in user_target_peak_id and automatically expand +/- 100bp during the analysis.

For the compared peak sets, same rules are applicable to options compared_peak_id, motif_only_for_compared_peak, user_compared_peak_list and user_compared_peak_id when loading compared peak sets.

During the analysis, EACH of target peak set from TFregulomeR compendium using target_peak_id and/or user provided using user_target_peak_list will be compared with ALL input compared peak sets (compared_peak_id and user_compared_peak_list), to get a final target sub-ensemble peaks shared by all compared peak sets. If methylation_profile_in_narrow_region=TRUE, DNA methylation profiling in +/- 100bp surrounding peak summit will be performed for each target common peak sub-ensemble, if its ID labeled in target_peak_id and user_target_peak_list matches a MethMotif ID ("MM1_HSA_") in TFregulomeR.

# read my local file. Here we use the HCT116 CEBPB binding sites from the publication PMID: 30380113
my_peak_path <- system.file("extdata", "HCT116_CEBPb_binding_sites.txt", package = "TFregulomeR")
my_peak <- read.delim(my_peak_path, sep = "\t", header = FALSE)
head(my_peak)
#>      V1        V2        V3
#> 1  chr1  58585814  58585827
#> 2 chr12 122925699 122925712
#> 3  chr9   5818111   5818124
#> 4 chr19   5850889   5850902
#> 5 chr10   5951274   5951287
#> 6  chr1   8175025   8175038

# To get the sub-ensemble peaks of K562 CEBPB peaks and my peaks, which are
# share with the CEBPB peaks in all cell types from TFregulomeR compendium,
# and at the same time to profile the DNA methylation states in the final subsets.

# 1) Get all CEBPB records in TFregulomeR compendium
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 records(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

# 2) Start commonPeaks analysis
commonPeak_output <- commonPeaks(target_peak_id = "MM1_HSA_K562_CEBPB",
                                 motif_only_for_target_peak = TRUE, 
                                 user_target_peak_list = list(my_peak), 
                                 user_target_peak_id = c("HCT116_CEBPB"), 
                                 compared_peak_id = CEBPB_record$ID, 
                                 motif_only_for_compared_peak = TRUE, 
                                 methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::commonPeaks() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome
#> Loading target peak list ... ...
#> ... You have 1 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_target_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> ... You have 1 customised peak set(s)
#> Loading compared peak list ... ...
#> ... You have 16 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_compared_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'GTRD-EXP000142_HSA_LS180_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP010975_HSA_Ishikawa_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030173_HSA_LoVo_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030702_HSA_blood-monocytes_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP034967_HSA_mesenchymal-stem-cells_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP036478_HSA_fetal-osteoblasts_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040652_HSA_monocyte-derived-macrophages_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040801_HSA_HL-60_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_A549_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_H1-hESC_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HCT116_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HeLa-S3_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HepG2_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_IMR-90_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_MCF-7_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing: MM1_HSA_K562_CEBPB... ...
#> Start analysing: HCT116_CEBPB... ...
#> Done analysing.

After getting the output of commonPeaks, you can use commonPeakResult to get 1) the summary, 2) the common peak regions, 3) DNA methylation levels in 200bp around common peak summits if the input TF source is MethMotif in TFregulomeR compendium and 4) the (Meth)Motif logo if TF input is from TFregulomeR warehouse or with TFregulomeR ID.

commonPeak_result <- commonPeakResult(commonPeaks = commonPeak_output,
                                      return_common_peak_sites = TRUE, 
                                      save_MethMotif_logo = TRUE, 
                                      return_methylation_profile = TRUE, 
                                      return_summary = TRUE)
#> Start getting the results of commonPeakResult ...
#> ... ... You chose to return common peak sites;
#> ... ... You chose to return methylation profile;
#> ... ... You chose to return common peak summary;
#> ... ... ... ALL of common peak sets, methylation profiles and peak summary will be 
#> stored in a list, and named with 'common_peak_list', 'methylation_profile' and 
#> 'peak_summary' in the list. Use 'names()' in the output for its detials.
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_common_peaks-logo-entropy.pdf' has been saved!
#> ... ... ... The input peak set for the results 'HCT116_CEBPB_common_peaks' was 
#> not originated from TFregulomeR or the number of direct binding sites in the 
#> common peaks is 0, so no motif logo available.

# the contents in commonPeak_result
names(commonPeak_result)
#> [1] "common_peak_list"    "methylation_profile" "peak_summary" 
common_peak_list <- commonPeak_result$common_peak_list
methylation_profile <- commonPeak_result$methylation_profile
peak_summary <- commonPeak_result$peak_summary

# peak summary: 1.137225% of K562 CEBPB peaks and 2.517540% of my peaks are common with 
# the CEBPB peaks in all cell types available in TFregulomeR warehouse.
peak_summary
#>                                 percentage_in_original_inputs(%)
#> MM1_HSA_K562_CEBPB_common_peaks                         1.137225
#> HCT116_CEBPB_common_peaks                               2.517540

# common peak regions
K562_CEBPB_common_peak <- common_peak_list$MM1_HSA_K562_CEBPB_common_peaks
head(K562_CEBPB_common_peak)
#>      chr     start       end                                      id tag_fold_change
#> 39  chr3 126261153 126261154  MM1_HSA_K562_CEBPB_peaks_with_motif_39        91.77499
#> 145 chr3 152100699 152100700 MM1_HSA_K562_CEBPB_peaks_with_motif_145        74.01529
#> 276 chr3 188239782 188239783 MM1_HSA_K562_CEBPB_peaks_with_motif_276        13.06116
#> 283 chr3 194003766 194003767 MM1_HSA_K562_CEBPB_peaks_with_motif_283        26.67740
#> 435 chr4  73704747  73704748 MM1_HSA_K562_CEBPB_peaks_with_motif_435        54.26637
#> 456 chr4  77158137  77158138 MM1_HSA_K562_CEBPB_peaks_with_motif_456        31.97526

# methylation profile in common peak regions
names(methylation_profile)
#> [1] "MM1_HSA_K562_CEBPB_common_peaks" "HCT116_CEBPB_common_peaks"
methylation_profile$MM1_HSA_K562_CEBPB_common_peaks
#>         CpG_num
#> 0-10%       161  #161 CpG methylation scores are less than 0.1 (homogeneously unmethylated) in +/-100bp window around common peak summits
#> 10-20%        9
#> 20-30%       17
#> 30-40%        2
#> 40-50%        1
#> 50-60%        2
#> 60-70%        9
#> 70-80%        5
#> 80-90%       13
#> 90-100%      25 #25 CpG methylation scores are more than 0.9 (homogeneously methylated) in +/-100bp window around common peak summits

# customised input peaks are not originated from TFregulomeR compendium, so no DNA methylation states
methylation_profile$HCT116_CEBPB_common_peaks
#>      [,1]
#> [1,]   NA
knitr::include_graphics("../inst/extdata/MM1_HSA_K562_CEBPB_common_peaks-logo-entropy.png")

Exclusive peak regions

knitr::include_graphics("../inst/extdata/exclusivePeaks_tutorial.png")

Exclusive peak regions are important to study a TF's context-specific function. Hence, we implemented such functionality to achieve the extraction of the context dependent peak loci along with DNA methylation profiles, exclusivePeaks.

For the target peak sets, users can directly use TFregulomeR TF peaks (hg38 for human) by inputting the TFregulomeR IDs in target_peak_id (all peaks or peaks with motif only can be selected using motif_only_for_target_peak), or their own peak regions in user_target_peak_list. If customised peak sets are provided, all peak sets should be stored in an R list(), and each peak set should be a bed-format data.frame with the first three columns as chromosome (starting with 'chr'), start and end. It's recommended that users provide the UNIQUE IDs for their customised peak list in user_target_peak_id (also should be unique to the provided TFregulomeR ID list in target_peak_id). If unavailable, the function will automatically assign IDs for the provided peak sets. It should be noted that if the customised target peak set is derived from TFregulomeR compendium, it's highly recommended that its TFregulomeR ID should be provided correspondingly in user_target_peak_id if one opts to profile the DNA methylation levels . Even though TFregulomeR peak sets are peak summits, the function is able to recognise it with the provided TFregulomeR ID in user_target_peak_id and automatically expand +/- 100bp during the analysis.

For the excluded peak sets, same rules are applicable to options excluded_peak_id, motif_only_for_excluded_peak, user_excluded_peak_list and user_excluded_peak_id when loading excluded peak sets.

During the analysis, EACH of target peak set from TFregulomeR compendium using target_peak_id and/or user provided using user_target_peak_list will be compared with ALL input excluded peak sets from excluded_peak_id and user_excluded_peak_list, to get a final target peak sub-ensemble which is exclusive from all input excluded peak sets. If methylation_profile_in_narrow_region=TRUE, DNA methylation profiling in +/- 100bp surrounding peak summit will be performed for each target exclusive peak sub-ensemble, if its ID labeled in target_peak_id and user_target_peak_list matches a MethMotif ID record ("MM1_HSA_") in TFregulomeR compendium.

# To get the exclusive subset of K562 CEBPB peaks and at the same time to profile the DNA 
# methylation states in the exclusive subset

# 1) Get all CEBPB records in TFregulomeR warehouse
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 record(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

# 2) All CEBPB TFregulomeR IDs except MM1_HSA_K562_CEBPB
CEBPB_record_ID_noK562 <- CEBPB_record$ID[!(CEBPB_record$ID %in% "MM1_HSA_K562_CEBPB")]
exclusivePeak_output <- exclusivePeaks(target_peak_id = "MM1_HSA_K562_CEBPB", 
                                       motif_only_for_target_peak = TRUE, 
                                       excluded_peak_id = CEBPB_record_ID_noK562, 
                                       motif_only_for_excluded_peak = TRUE, 
                                       methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::exclusivePeaks() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome
#> Loading target peak list ... ...
#> ... You have 1 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_target_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> Loading excluded peak list ... ...
#> ... You have 15 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_excluded_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'GTRD-EXP000142_HSA_LS180_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP010975_HSA_Ishikawa_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030173_HSA_LoVo_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030702_HSA_blood-monocytes_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP034967_HSA_mesenchymal-stem-cells_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP036478_HSA_fetal-osteoblasts_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040652_HSA_monocyte-derived-macrophages_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040801_HSA_HL-60_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_A549_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_H1-hESC_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HCT116_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HeLa-S3_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HepG2_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_IMR-90_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_MCF-7_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing: MM1_HSA_K562_CEBPB... ...
#> Done analysing.

After getting the output of exclusivePeaks, you can use exclusivePeakResult to get 1) the summary, 2) the exclusive peak regions, 3) DNA methylation levels in exclusive peak subsets if the input source is MethMotif in TFregulomeR compendium and 4) the (Meth)Motif logo if the input target is from TFregulomeR compendium or comes with a TFregulomeR ID.

exclusivePeak_result <- exclusivePeakResult(exclusivePeaks = exclusivePeak_output,
                                            return_exclusive_peak_sites = TRUE,
                                            save_MethMotif_logo = TRUE, 
                                            return_methylation_profile = TRUE,
                                            return_summary = TRUE)
#> Start getting the results of exclusivePeaks ...
#> ... ... You chose to return exclusive peak sites;
#> ... ... You chose to return methylation profile;
#> ... ... You chose to return exclusive peak summary;
#> ... ... ... ALL of exclusive peak sets, methylation profiles and peak summary will 
#> be stored in a list, and named with 'exclusive_peak_list', 'methylation_profile' and 
#> 'peak_summary' in the list. Use 'names()' in the output for its detials.
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_exclusive_peaks-logo-entropy.pdf' has been saved!

# the contents in exclusivePeak_result
names(exclusivePeak_result)
#> [1] "exclusive_peak_list" "methylation_profile" "peak_summary" 
exclusive_peak_list <- exclusivePeak_result$exclusive_peak_list
peak_summary <- exclusivePeak_result$peak_summary
methylation_profile <- exclusivePeak_result$methylation_profile

# peak summary, 9.110437% of K562 CEBPB peaks are unique compared with 
# all CEBPB TFBSs in TFregulomeR warehouse
peak_summary
#>                                    percentage_in_original_inputs(%)
#> MM1_HSA_K562_CEBPB_exclusive_peaks                         9.110437

K562_CEBPB_exclusive_peak <- exclusive_peak_list$MM1_HSA_K562_CEBPB_exclusive_peaks
head(K562_CEBPB_exclusive_peak)
#>     chr     start       end                                     id tag_fold_change
#> 4  chr3 105626970 105626971  MM1_HSA_K562_CEBPB_peaks_with_motif_4        23.36092
#> 22 chr3 119695686 119695687 MM1_HSA_K562_CEBPB_peaks_with_motif_22        10.89517
#> 45 chr3 128415444 128415445 MM1_HSA_K562_CEBPB_peaks_with_motif_45        40.57037
#> 57 chr3 130184388 130184389 MM1_HSA_K562_CEBPB_peaks_with_motif_57        23.26712
#> 58 chr3 130343212 130343213 MM1_HSA_K562_CEBPB_peaks_with_motif_58        13.14518
#> 68 chr3 133629332 133629333 MM1_HSA_K562_CEBPB_peaks_with_motif_68        20.32872

# methylation profile in exclusive peak regions
names(methylation_profile)
#> [1] "MM1_HSA_K562_CEBPB_exclusive_peaks"
methylation_profile$MM1_HSA_K562_CEBPB_exclusive_peaks
#>         CpG_num
#> 0-10%       852  #852 CpG methylation scores are less than 0.1 (homogeneously unmethylated) in +/-100bp window around exclusive peak summits
#> 10-20%      149
#> 20-30%       86
#> 30-40%       63
#> 40-50%       45
#> 50-60%       39
#> 60-70%       25
#> 70-80%       46
#> 80-90%       29
#> 90-100%      40 #40 CpG methylation scores are more than 0.9 ((homogeneously methylated)) in +/-100bp window around exclusive peak summits
knitr::include_graphics("../inst/extdata/MM1_HSA_K562_CEBPB_exclusive_peaks-logo-entropy.png")

Intersected peak matrix

knitr::include_graphics("../inst/extdata/intersectPeakMatrix_tutorial.png")

TFregulomeR allows users to portray the co-binding landscapes between two collections of TFs, along with DNA methylation states in the pair-wise intersected peaks, using intersectPeakMatrix. This functionality is particularly useful to study TF cofactor and interactome in a cell type. Different from commonPeaks, intersectPeakMatrix perform an exhaustive pair-wise intersection analysis between peak list x and peak list y, to form an x*y intersection matrix. Therefore, it is required for users to provide the two lists of peak sets.

For peak list x, users can directly use TFregulomeR peaks by providing TFregulomeR ID in peak_id_x and indicating whether loading peaks with motif only using motif_only_for_id_x. In addition, customised peak sets can also be input in user_peak_list_x. It's recommended that UNIQUE IDs (also unique to IDs in peak_id_x) be provided for each customised peak set in user_peak_x_id. If the customised peak set is derived from TFregulomeR compendium, it's highly recommended that the corresponding TFregulomeR ID be provided in user_peak_x_id, which allows the function to recognise the source of peak set and to properly profile the DNA methylation states (if methylation_profile_in_narrow_region=TRUE) as well as read enrichments in the intersected regions. Even though TFregulomeR peak sets are peak summits, the function is able to recognise it with the provided TFregulomeR ID in peak_id_x and automatically expand +/- 100bp during the analysis.

Same principles are applicable for peak list y.

Advice: It's to be noted that function intersectPeakMatrix could take up several minutes to hours depending on the number of input TFs. It is a wise choice to save a intersectPeakMatrix output into R data format using saveRDS(object, file="my_data.rds") and restore it using readRDS(file="my_data.rds") if you need to re-use the output to extract other results by function intersectPeakMatrixResult in the future.

# profile the co-binding landscapes of all K562 TFs in TFregulomeR warehouse surrounding
# K562 CEBPB common and exclusive peaks

# browse all TFBS record in K562
K562_TFBS = dataBrowser(cell_tissue_name = "K562") # or TFBSBrowser() before v1.2.0
#> 131 record(s) found: ...
#> ... covering 131 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 1 organ(s):
#> ... ... blood_and_lymph
#> ... in 1 sample type(s):
#> ... ... cell_line
#> ... in  1  different cell(s) or tissue(s)
#> ... in 1 type(s) of disease state(s):
#> ... ... tumor
#> ... from the source(s): MethMotif

# co-binding landscapes in K562 CEBPB common peaks
intersectMatrix_common <- intersectPeakMatrix(user_peak_list_x = list(K562_CEBPB_common_peak),
                                              user_peak_x_id =  c("MM1_HSA_K562_CEBPB"), 
                                              peak_id_y = K562_TFBS$ID, 
                                              motif_only_for_id_y = TRUE, 
                                              methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::intersectPeakMatrix() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome. It will make the program slow. 
#> Disable it if you want a speedy analysis and do not care about methylation
#> Loading peak list x ... ...
#> ... You have 1 customised peak set(s)
#> Loading peak list y ... ...
#> ... You have 131 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_id_y' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_AFF1'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ARID2'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ARID3A'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ATF1'
#>     ... ...
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ZSCAN29'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing list x:MM1_HSA_K562_CEBPB... ...
#> ... ... Start analysing list y:MM1_HSA_K562_AFF1
#> ... ... Start analysing list y:MM1_HSA_K562_ARID2
#> ... ... Start analysing list y:MM1_HSA_K562_ARID3A
#> ... ... Start analysing list y:MM1_HSA_K562_ATF1
#>     ... ...
#> ... ... Start analysing list y:MM1_HSA_K562_ZSCAN29

# K562 CEBPB exclusive peaks
intersectMatrix_exclusive <- intersectPeakMatrix(user_peak_list_x = list(K562_CEBPB_exclusive_peak), 
                                                 user_peak_x_id = c("MM1_HSA_K562_CEBPB"), 
                                                 peak_id_y = K562_TFBS$ID,
                                                 motif_only_for_id_y = TRUE,
                                                 methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::intersectPeakMatrix() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome. It will make the program slow. 
#> Disable it if you want a speedy analysis and do not care about methylation
#> Loading peak list x ... ...
#> ... You have 1 customised peak set(s)
#> Loading peak list y ... ...
#> ... You have 108 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_id_y' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_AFF1'
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ARID2'
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ARID3A'
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ATF1'
#>     ... ...
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ZSCAN29'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing list x:MM1_HSA_K562_CEBPB... ...
#> ... ... Start analysing list y:MM1_HSA_K562_AFF1
#> ... ... Start analysing list y:MM1_HSA_K562_ARID2
#> ... ... Start analysing list y:MM1_HSA_K562_ARID3A
#> ... ... Start analysing list y:MM1_HSA_K562_ATF1
#>     ... ...
#> ... ... Start analysing list y:MM1_HSA_K562_ZSCAN29

We have implemented intersectPeakMatrixResult in TFregulomeR package, allowing an easy extraction and interpretation of intersectPeakMatrix output. It is worth noting that there are two ways of interpretations in intersection between set A and B, that is, 1) the percentage of A overlapped with B, and 2) the percentage of B intersected with A. Same principle is applicable for the output of intersectPeakMatrix.

The output of intersectPeakMatrix is a X*Y matrix table (X peak sets in peak list x and Y peak sets in peak list y) and each table cell contains an intersectPeakMatrix class object. IntersectPeakMatrix class is a exclusively-designed S4 class and each object contains:

1) Information of peak x subset overlapped with peak y:

i) overlap percentage in peak x;

ii) motif in peak x overlapped with peak y;

iii) tag enrichment in peak x overlapped with peak y;

iv) DNA methylation in peak x overlapped with peak y, if peak x is derived from MethMotif in TFregulomeR compendium.

and

2) Information of peak y subset overlapped with peak x:

i) overlap percentage in peak y;

ii) motif in peak y overlapped with peak x;

iii) tag enrichment in peak y overlapped with peak x;

iv) DNA methylation in peak y overlapped with peak x, if peak y is derived from MethMotif in TFregulomeR compendium.

By using return_intersection_matrix = TRUE and angle_of_matrix = "x" in function
intersectPeakMatrixResult, user can obtain an intersection matrix table. Row i and column j table cell denotes the percent of peak x(i) overlapped with peak y(j). If angle_of_matrix = "y", then row i and column j table cell denotes the percent of peak y(j) overlapped with peak x(i).

By using return_methylation_profile = TRUE and angle_of_methylation_profile = "x" in function
intersectPeakMatrixResult, user can obtain a DNA methylation matrix table. Row i and column j table cell contains a vector about the statistics of CpG methylation states within the peak x(i) overlapped with peak y(j). If angle_of_matrix = "y", then row i and column j table cell contains a vector about the statistics of CpG methylation states within the peak y(j) overlapped with peak x(i).

By using save_MethMotif_logo = TRUE and angle_of_logo = "x" in function
intersectPeakMatrixResult, the function will plot and save (Meth)Motif logo for each peak x intersected with peak y (if any of peak x is derived from TFregulomeR compendium). If angle_of_logo = "y", the function will plot and save (Meth)Motif logo for each peak y intersected with peak x (if any of peak y is derived from TFregulomeR compendium). Indeed, it's not necessary to plot all (Meth)Motif logos for every pair of intersection at the same time. One can focus only on some subsets of peak_list_x and peak_list_y, using saving_MethMotif_logo_x_id and saving_MethMotif_logo_y_id respectively.

Lastly, By using return_tag_density = TRUE and angle_of_tag_density = "x" in function
intersectPeakMatrixResult, user can obtain a read enrichment matrix table. Row i and column j table cell denotes a read enrichment value within the peak x(i) overlapped with peak y(j). If angle_of_tag_density = "y", then row i and column j table cell denotes a read enrichment value within the peak y(j) overlapped with peak y(x).There are five read enrichment values to be selected using tag_density_value: median, mean, SD (standard deviation), quartile_25 (first quartile) and quartile_75 (third quartile).

In order to make cofactor analysis more straightforward, a function cofactorReport has been exclusively designed. Just by simply inputting the output of intersectPeakMatrix into the function, cofactor report will be saved as a PDF file for each peak x. In the report, top cofactors of peak x will be reported along with motif sequences, DNA methylation within motif and 200bp peak regions as well as read enrichment scores (median, first quartile and third quartile).

# get the intersection matrix for K562 common peaks
IM_K562_common_result <- intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_common, 
                                                   return_intersection_matrix = TRUE, 
                                                   angle_of_matrix = "x", 
                                                   return_methylation_profile = TRUE, 
                                                   angle_of_methylation_profile = "x",
                                                   return_tag_density = TRUE,
                                                   angle_of_tag_density = "x",
                                                   tag_density_value = "median")
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose to return intersection matrix;
#> ... ... ... You chose x-wise intersection matrix;
#> ... ... You chose to return tag density;
#> ... ... ... the tag density value you chose to return is median
#> ... ... ... You chose to return tag density for peak list x;
#> ... ... You chose to return methylation profile;
#> ... ... ... You chose to return methylation profile for peak list x;
#> ... ... You chose NOT to save MethMotif logo in PDF if any;

names(IM_K562_common_result)
#> [1] "intersection_matrix"        "tag_density_matrix"         "methylation_profile_matrix"

# return intersection matrix table for K562 CEBPB shared peaks
IM_K562_common_intersect_matrix <- IM_K562_common_result$intersection_matrix
dim(IM_K562_common_intersect_matrix)
#> [1]   1 131 # 1 row represents the 1 input peak x, 131 columns represent the 131 peak y

# e.g. 1.111111%, 3.333333% and  1.111111% of K562 common peaks overlapped with MM1_HSA_K562_AFF1, MM1_HSA_K562_ARID2 and MM1_HSA_K562_ARID3A respectively.
IM_K562_common_intersect_matrix[1,1:3]
#>                    MM1_HSA_K562_AFF1 MM1_HSA_K562_ARID2 MM1_HSA_K562_ARID3A
#> MM1_HSA_K562_CEBPB          1.111111           3.333333            1.111111

# find the top 10 TFs co-binding in K562 CEBPB shared peaks
IM_K562_common_result_t <- as.data.frame(t(IM_K562_common_intersect_matrix))
attach(IM_K562_common_result_t)
IM_K562_common_result_order <- as.data.frame(IM_K562_common_result_t[order(-MM1_HSA_K562_CEBPB),,drop = FALSE])
detach(IM_K562_common_result_t)
head(IM_K562_common_result_order, n = 10)
#>                    MM1_HSA_K562_CEBPB
#> MM1_HSA_K562_CEBPB         100.000000
#> MM1_HSA_K562_CEBPD          36.666667
#> MM1_HSA_K562_CTCF           12.222222
#> MM1_HSA_K562_E4F1           10.000000
#> MM1_HSA_K562_JUND           10.000000
#> MM1_HSA_K562_FOS             8.888889
#> MM1_HSA_K562_JUN             8.888889
#> MM1_HSA_K562_CTCFL           7.777778
#> MM1_HSA_K562_ELF4            7.777778
#> MM1_HSA_K562_ATF4            6.666667

# The highest co-binding factor in K562 CEBPB common peaks is CEBPD.
# Then plot MethMotif logo for the K562 CEBPB common sites intersected with CEBPD peaks
intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_common, 
                          save_MethMotif_logo = TRUE, 
                          angle_of_logo = "x", 
                          saving_MethMotif_logo_y_id = c("MM1_HSA_K562_CEBPD"))
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose NOT to return intersection matrix;
#> ... ... You chose NOT to return methylation profile;
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose x-wise MethMotif logo;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-logo-entropy.pdf' 
#> has been saved!

# return DNA methylation matrix table for K562 CEBPB shared peaks
IM_K562_common_methylation <- IM_K562_common_result$methylation_profile_matrix
dim(IM_K562_common_methylation)
#> [1]   1 131 # 1 row represents the 1 input peak x, 131 columns represent the 131 peak y
rownames(IM_K562_common_methylation)
#> [1] "MM1_HSA_K562_CEBPB"
colnames(IM_K562_common_methylation)[1:5]
#> [1] "MM1_HSA_K562_AFF1"   "MM1_HSA_K562_ARID2"  "MM1_HSA_K562_ARID3A"
#> [4] "MM1_HSA_K562_ATF1"   "MM1_HSA_K562_ATF2"  

# the methylation level in the intersected regions between CEBPB and CEBPBD in K562
# each cell in the matrix is a list()
IM_K562_common_methylation["MM1_HSA_K562_CEBPB","MM1_HSA_K562_CEBPD"][[1]]
#>         CpG_num
#> 0-10%        64 #64 CpG methylation scores are less than 0.1 (homogeneously unmethylated) in +/-100bp window around intersected peak summits
#> 10-20%        8
#> 20-30%        7
#> 30-40%        2
#> 40-50%        1
#> 50-60%        1
#> 60-70%        8
#> 70-80%        3
#> 80-90%       10
#> 90-100%      16 #16 CpG methylation scores are more than 0.9 (homogeneously methylated) in +/-100bp window around intersected peak summits


# return read enrichment matrix table for K562 CEBPB shared peaks
IM_K562_common_read_enrichment <- IM_K562_common_result$tag_density_matrix
IM_K562_common_read_enrichment[,seq(1,3,1)]
#>                    MM1_HSA_K562_AFF1 MM1_HSA_K562_ARID2 MM1_HSA_K562_ARID3A
#> MM1_HSA_K562_CEBPB          54.27878           34.56091            17.68113
# the read ernichment median value in the K562 CEBPB peaks overlapped with AFF1 is 54.27878

# Simply using cofactorReport(), cofactors for each peak x will be automatically reported
cofactorReport(intersectPeakMatrix = intersectMatrix_common)
#> Start cofactorReport ...
#> ... The maximum number of cofactors to be reported is 10
#> ... The minimum percent of co-binding peaks for a cofactor is 5%
#> ... Each peak set derived from TFregulomeR compendium in 'peak list x' will be reported in an individual PDF file
#> ... ... Start reporting peak id 'MM1_HSA_K562_CEBPB' ...
#> ... ... ... The number of cofactors passing 'cobinding_threshold' for peak id 'MM1_HSA_K562_CEBPB' is 15. Only top 10 will be selected.
#> ... ... ... Cofactor report for id 'MM1_HSA_K562_CEBPB' has been saved as MM1_HSA_K562_CEBPB_cofactor_report.pdf

# get the intersection matrix for K562 exclusive peaks
IM_K562_exclusive_result <- intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_exclusive, 
                                                      return_intersection_matrix = TRUE, 
                                                      angle_of_matrix = "x", 
                                                      return_methylation_profile = TRUE, 
                                                      angle_of_methylation_profile = "x",
                                                      return_tag_density = TRUE,
                                                      angle_of_tag_density = "x",
                                                      tag_density_value = "median")
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose to return intersection matrix;
#> ... ... ... You chose x-wise intersection matrix;
#> ... ... You chose to return tag density;
#> ... ... ... the tag density value you chose to return is median
#> ... ... ... You chose to return tag density for peak list x;
#> ... ... You chose to return methylation profile;
#> ... ... ... You chose to return methylation profile for peak list x;
#> ... ... You chose NOT to save MethMotif logo in PDF if any;

names(IM_K562_exclusive_result)
#> [1] "intersection_matrix"        "tag_density_matrix"         "methylation_profile_matrix"

# return intersection matrix table for K562 CEBPB exclusive peaks
IM_K562_exclusive_intersect_matrix <- IM_K562_exclusive_result$intersection_matrix
# e.g. 0.4160888%, 0% and  0.554785% of K562 exclusive peaks overlapped with MM1_HSA_K562_AFF1, MM1_HSA_K562_ARID2 and MM1_HSA_K562_ARID3A respectively.
IM_K562_exclusive_intersect_matrix[1,1:3]
#>                    MM1_HSA_K562_AFF1 MM1_HSA_K562_ARID2 MM1_HSA_K562_ARID3A
#> MM1_HSA_K562_CEBPB         0.4160888                  0            0.554785

# find the top 10 TFs co-binding in K562 exclusive results
IM_K562_exclusive_result_t <- as.data.frame(t(IM_K562_exclusive_intersect_matrix))
attach(IM_K562_exclusive_result_t)
IM_K562_exclusive_result_order <- as.data.frame(IM_K562_exclusive_result_t[order(-MM1_HSA_K562_CEBPB),,drop = FALSE])
detach(IM_K562_exclusive_result_t)
head(IM_K562_exclusive_result_order, n = 10)
#>                      MM1_HSA_K562_CEBPB
#> MM1_HSA_K562_CEBPB           100.000000
#> MM1_HSA_K562_ATF4             35.367545
#> MM1_HSA_K562_TCF12            17.337032
#> MM1_HSA_K562_CBFA2T3          16.643551
#> MM1_HSA_K562_TAL1             16.366158
#> MM1_HSA_K562_GATA1            11.650485
#> MM1_HSA_K562_FOXM1             9.431345
#> MM1_HSA_K562_ATF7              8.876560
#> MM1_HSA_K562_NFE2              8.599168
#> MM1_HSA_K562_MAFG              7.073509

# The highest co-binding factor in K562 CEBPB exclusive peaks is ATF4.
# Then plot MethMotif logo for the K562 CEBPB exclusive sites intersected with ATF4 peaks
intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_exclusive, 
                          save_MethMotif_logo = TRUE, 
                          angle_of_logo = "x", 
                          saving_MethMotif_logo_y_id = c("MM1_HSA_K562_ATF4"))
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose NOT to return intersection matrix;
#> ... ... You chose NOT to return methylation profile;
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose x-wise MethMotif logo;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-logo-entropy.pdf' 
#> has been saved!

# the methylation levels in the intersected regions between CEBPB and ATF4
IM_K562_exclusive_methylation <- IM_K562_exclusive_result$methylation_profile_matrix
IM_K562_exclusive_methylation["MM1_HSA_K562_CEBPB","MM1_HSA_K562_ATF4"][[1]]
#>         CpG_num
#> 0-10%       303
#> 10-20%       44
#> 20-30%       22
#> 30-40%       19
#> 40-50%       12
#> 50-60%       11
#> 60-70%        8
#> 70-80%       14
#> 80-90%       12
#> 90-100%       8

# Simply using cofactorReport(), cofactors for each peak x will be automatically reported
cofactorReport(intersectPeakMatrix = intersectMatrix_exclusive)
#> Start cofactorReport ...
#> ... The maximum number of cofactors to be reported is 10
#> ... The minimum percent of co-binding peaks for a cofactor is 5%
#> ... Each peak set derived from TFregulomeR compendium in 'peak list x' will be reported in an individual PDF file
#> ... ... Start reporting peak id 'MM1_HSA_K562_CEBPB' ...
#> ... ... ... The number of cofactors passing 'cobinding_threshold' for peak id 'MM1_HSA_K562_CEBPB' is 13. Only top 10 will be selected.
#> ... ... ... Cofactor report for id 'MM1_HSA_K562_CEBPB' has been saved as MM1_HSA_K562_CEBPB_cofactor_report.pdf
knitr::include_graphics("../inst/extdata/MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-logo-entropy.png")
knitr::include_graphics("../inst/extdata/MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-logo-entropy.png")
knitr::include_graphics("../inst/extdata/cofactor_report.png")

Export motif PFM and beta score matrix

As aforementioned, exportMMPFM is not only designed for searchMotif outputs, but also compatible with the outputs from commonPeaks, exclusivePeaks, intersectPeakMatrix. Here, we want to export the motif PFMs and beta score matrices for K562 CEBPB common and exclusive peaks, as well as the common peaks intersected with CEBPD peaks and the exclusive peaks intersected with ATF4 peaks.

# export motif PFM and beta score matrix for K562 CEBPB common peaks
exportMMPFM(fun_output = commonPeak_output, 
            fun = "commonPeaks", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export commonPeaks...
#> ... ... ... export id = MM1_HSA_K562_CEBPB_common_peaks
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_common_peaks-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_common_peaks-motif-MEME.txt'.
#> ... ... ... export id = HCT116_CEBPB_common_peaks
#> ... ... ... the original peaks of HCT116_CEBPB_common_peaks is not loaded from TFregulomeR, 
#> or in the common peak the number of TFBS is zero. Hence no further action for this id!

# export motif PFM and beta score matrix for K562 CEBPB exclusive peaks
exportMMPFM(fun_output = exclusivePeak_output, 
            fun = "exclusivePeaks", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export exclusivePeaks...
#> ... ... ... export id = MM1_HSA_K562_CEBPB_exclusive_peaks
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_exclusive_peaks-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_exclusive_peaks-motif-MEME.txt'.

# export motif PFM and beta score matrix for K562 CEBPB common peaks intersected with K562 CEBPD peaks
exportMMPFM(fun_output = intersectMatrix_common, 
            fun = "intersectPeakMatrix", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE, 
            angle_of_matrix_for_intersectPeakMatrix = "x", 
            saving_id_y_for_intersectPeakMatrix = c("MM1_HSA_K562_CEBPD"))
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export intersectPeakMatrix...
#> ... ... we will export in the x wide of intersectPeakMatrix output since the input angle_of_matrix_for_intersectPeakMatrix = 'x'
#> ... ... ... export id = MM1_HSA_K562_CEBPB
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-motif-MEME.txt'.

# export motif PFM and beta score matrix for K562 CEBPB exclusive peaks intersected with K562 ATF4 peaks
exportMMPFM(fun_output = intersectMatrix_exclusive, 
            fun = "intersectPeakMatrix", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE, 
            angle_of_matrix_for_intersectPeakMatrix = "x", 
            saving_id_y_for_intersectPeakMatrix = c("MM1_HSA_K562_ATF4"))
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export intersectPeakMatrix...
#> ... ... we will export in the x wide of intersectPeakMatrix output since the input angle_of_matrix_for_intersectPeakMatrix = 'x'
#> ... ... ... export id = MM1_HSA_K562_CEBPB
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-motif-MEME.txt'.

Motif distribution

TFregulomeR also allows users to plot the distributions of a given TFBS from the TFregulomeR warehouse in a list of peak sets, using motifDistrib and plotDistrib sequentially. By providing the TFregulomeR ID in id as the input of the motifDistrib, motifDistrib will calculate the occurrences the TFBSs in the given list of peak sets input in peak_list. The unique IDs corresponding to peak_list is required to be provided at the same time using peak_id. If a peak set is derived from TFregulomeR compendium, the TFregulomeR ID should be provided correspondingly; if it is self provided, you can name it with a unique ID yourself.

It should be noted that even though the loaded peak regions from the TFregulomeR compendium are the peak summits, you don't need to expand the regions. Once you tell motifDistrib the peak set is a TFregulomeR TF subset by providing TFregulomeR ID in the peak_id, it will automatically operate on the peaks itself.

The output of motifDistrib is the input of plotDistrib. In each motif distribution plot, x-axis is the relative distance (bp) to the peak center, while y-axis is the percentage of the TFBS at the position.

Here, we show the distributions of K562 CEBPB motif in K562 CEBPB exclusive peaks and our own peaks locally loaded previously.

# loading my peaks
my_peak_path <- system.file("extdata", "HCT116_CEBPb_binding_sites.txt", package = "TFregulomeR")
my_peak <- read.delim(my_peak_path, sep = "\t", header = FALSE)

motifDistrib_output <- motifDistrib(id = "MM1_HSA_K562_CEBPB", 
                                    peak_list = list(K562_CEBPB_exclusive_peak,
                                                     my_peak),
                                    peak_id = c("MM1_HSA_K562_CEBPB","my_peak"))
#> motifDistrib starts analysing for MethMotif ID = MM1_HSA_K562_CEBPB
#> ... ... analysing peak set MM1_HSA_K562_CEBPB
#> ... ... analysing peak set my_peak
plotDistrib(motifDistrib = motifDistrib_output)
#> Distribution of motif MM1_HSA_K562_CEBPB in peak set MM1_HSA_K562_CEBPB has been saved!
#> Distribution of motif MM1_HSA_K562_CEBPB in peak set my_peak has been saved!
knitr::include_graphics("../inst/extdata/motif_distribution.png")

Annotate TFBS locations

TFregulomeR is able to annotate TFBS genomic locations using genomeAnnotate. The annotation process is following the order: promoter, TTS, exon, 5' UTR exon, 3' UTR exon, intron and intergenic region. Specifically, promoter is defined as the range from 1000bp upstream of TSS to 100bp downstream of TSS, and TTS is defined as the range from 100bp upstream of TTS to 1000bp downstream of TTS. Users can change the parameters using promoter_range and TTS_range respectively. The annotation output of genomeAnnotate is intuitive, not only will a data.frame containing annotation results be returned, but also an HTML report will be saved.

# annotate the locations of K562 CEBPB exclusive peaks
# loading UCSC knownGene
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

K562_CEBPB_exclusivePeak_loc <- genomeAnnotate(peaks = K562_CEBPB_exclusive_peak, 
                                               return_annotation = TRUE, 
                                               return_html_report = TRUE)
#> Start genomeAnnotate ...
#> ... ... You chose to return annotated results in a data.frame.
#> ... ... You chose to return an HTML report.
#> ... ... annotating promoters defined as upstream 1000bp and downstream 100bp
#> ... ... annotating TTS defined as upstream 100bp and downstream 1000bp
#> ... ... annotating exons
#> ... ... annotating 5' UTR
#> ... ... annotating 3' UTR
#> ... ... annotating introns
#> ... ... annotating intergenic regions
#> ... ... An html report has been generated as 'genomeAnnotate_result.html'!
#> ... ... The annotation results have been returned in a data.frame!

head(K562_CEBPB_exclusivePeak_loc)
#>    chr     start       end                      id   annotation geneName
#> 1 chr3 133629333 133629333  genomeAnnotate_peak_68 promoter-TSS   TOPBP1
#> 2 chr1 155858041 155858041 genomeAnnotate_peak_255 promoter-TSS    GON4L
#> 3 chr3 193662882 193662882 genomeAnnotate_peak_280 promoter-TSS     OPA1
#> 4 chr4  72038972  72038972 genomeAnnotate_peak_426 promoter-TSS   NPFFR2
#> 5 chr4  74448009  74448009 genomeAnnotate_peak_447 promoter-TSS     AREG
#> 6 chr4 143905637 143905637 genomeAnnotate_peak_635 promoter-TSS     GYPE
#>                                                                                                    transcript
#> 1                                                                         ENST00000513818.1;ENST00000506779.1
#> 2 ENST00000368331.5;ENST00000271883.9;ENST00000620426.4;ENST00000361040.9;ENST00000471341.5;ENST00000622608.1
#> 3                                                                                           ENST00000445863.1
#> 4                                                                         ENST00000395999.5;ENST00000358749.3
#> 5                                                                                           ENST00000511560.1
#> 6                                                       ENST00000358615.8;ENST00000506264.5;ENST00000437468.2
#>             distanceToTSS
#> 1                 941;602
#> 2 745;745;745;836;823;745
#> 3                       8
#> 4                 205;401
#> 5                     629
#> 6               77;117;73
knitr::include_graphics("../inst/extdata/genomeAnnotate.png")

Annotate TFBS functions

The key function of transcription factors is to regulate gene expression. By working with Genomic Regions Enrichment of Annotations Tool (GREAT), TFregulomeR allows users to annotate the functions of TFBSs using greatAnnotate. Given that GREAT server doesn't support hg38, liftOver R package has been incorporated in TFregulomeR to convert hg38 to hg19. The annotation output of greatAnnotate is intuitive, not only will a data.frame containing annotation results be returned, but also an HTML report will be saved. The HTML report takes advantage of rbokeh package, which presents a vivid and dynamic interface (Figure 13).

# annotate the functions of K562 CEBPB exclusive peaks
# loading GREAT R package 'rGREAT'
library(rGREAT)
# the peak assembly is "hg38", and 'liftOver' is needed for conversion
library(liftOver)
# 'rbokeh' is required for an HTML report generation
library(rbokeh)

K562_CEBPB_exclusivePeak_func <- greatAnnotate(peaks = K562_CEBPB_exclusive_peak, 
                                               return_annotation = TRUE, 
                                               return_html_report = TRUE)
#> Start greatAnnotate ...
#> ... ... You chose to return annotated results in a data.frame.
#> ... ... You chose to return an HTML report.
#> ... ... assembly is hg38. Now converting to hg19 using liftOver...
#> ... ... number of the original input regions is 721
#> ... ... number of the regions successfully converted to hg19 is 721
#> ... ... start GREAT analysis
#> ... ... An html report has been generated as 'greatAnnotate_result.html'!
#> ... ... The annotation results have been returned in a data.frame!

head(K562_CEBPB_exclusivePeak_func)
#>   category         ID                             name
#> 1       MF GO:0005488                          binding
#> 2       MF GO:0005515                  protein binding
#> 3       BP GO:0065007            biological regulation
#> 4       BP GO:0050794   regulation of cellular process
#> 5       BP GO:0019222  regulation of metabolic process
#> 6       BP GO:0050789 regulation of biological process
#>   number_of_targeting_genes adjusted_pvalue
#> 1                       817    0.0001415278
#> 2                       497    0.0007221064
#> 3                       653    0.0059313854
#> 4                       602    0.0059313854
#> 5                       386    0.0072758686
#> 6                       628    0.0072758686
knitr::include_graphics("../inst/extdata/greatAnnotate.png")

Connect with TFBSTools

TFregulomeR is not working alone. We have built a function allowing conversion of the motif matrix in TFregulomeR warehouse to the subclass PFMatrix in TFBSTools, using toTFBSTools.

library(TFBSTools)
K562_CEBPB_TFBS_PFM <- toTFBSTools(id = "MM1_HSA_K562_CEBPB")
K562_CEBPB_TFBS_PFM
#> An object of class PFMatrix
#> ID: MM1_HSA_K562_CEBPB
#> Name: CEBPB
#> Matrix Class: Unknown
#> strand: *
#> Tags: 
#> list()
#> Background: 
#>      A      C      G      T 
#> 0.2717 0.2283 0.2283 0.2717 
#> Matrix: 
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#> A   93  305    0    0   49    0  280   91  290   533    15
#> C   61   74    0    0    0  533   59  205  242     0   215
#> G  192  141    0    0  390    0  139    3    0     0    73
#> T  187   13  533  533   94    0   55  234    1     0   230


linquynus/TFregulomeR-dev documentation built on April 23, 2022, 11:11 a.m.