assemble_training_data: Assembles TOP training data for all TF x cell type...
In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data

assemble_training_data

R Documentation

Assembles TOP training data for all TF x cell type combinations, then split training data into 10 partitions

Description

Prepares the training data for fitting TOP models. It splits training data into 10 partitions and assembles training data for all TF x cell type combinations for each of the partitions.

Usage

assemble_training_data(
  tf_cell_table,
  logistic_model = FALSE,
  chip_col = "chip",
  training_chrs = paste0("chr", seq(1, 21, 2)),
  n_partitions = 10,
  n_cores = n_partitions,
  max_sites = 50000,
  seed = 1
)

Arguments

`tf_cell_table`	A data frame listing all TF x cell type combinations and the training data for each combination. It should have at least three columns, with: TF names, cell types, and file names of the individual training data for each TF x cell type combination. The individual training data should be in .rds or text (.txt, or .csv) format.
`logistic_model`	Logical. If `logistic_model = TRUE`, prepare assembled data for the logistic version of TOP model. If `logistic_model = FALSE`, prepare assembled data for the quantitative occupancy model (default).
`chip_col`	The column name of ChIP data in the individual training data (default: ‘chip’).
`training_chrs`	Chromosomes used for training the model (default: odd chromosomes, chr1, chr3, ..., chr21)
`n_partitions`	Number of partitions to split the training data (default: 10).
`n_cores`	Number of cores to run in parallel (default: equal to `n_partitions`).
`max_sites`	Max number of candidate sites to keep for each TF x cell type combination (default: 50000). To reduce computation time, randomly select `max_sites` candidate sites for each TF x cell type combination, if the number of candidate sites exceeds `max_sites`.
`seed`	A number for the seed used when sampling sites.

Value

A list of data frames (default: 10), each containing one partition of the training data with all TF x cell type combinations.

Examples

## Not run: 

#  tf_cell_table should have three columns with:
#  TF names, cell types, and paths to the training data files, like:
#  |   tf_name    |   cell_type   |        data_file         |
#  |:------------:|:-------------:|:------------------------:|
#  |     CTCF     |     K562      |   CTCF.K562.data.rds     |
#  |     CTCF     |     A549      |   CTCF.A549.data.rds     |
#  |     CTCF     |    GM12878    |   CTCF.GM12878.data.rds  |
#  |     ...      |     ...       |   ...                    |

# Assembles training data for the quantitative occupancy model,
# uses odd chromosomes for training, keeps at most 50000 candidate sites for
# each TF x cell type combination, and splits training data into 10 partitions.
assembled_training_data <- assemble_training_data(tf_cell_table,
                                                  logistic_model = FALSE,
                                                  chip_col = 'chip',
                                                  training_chrs = paste0('chr', seq(1,21,2)),
                                                  n_partitions=10,
                                                  max_sites = 50000)

# Assembles training data for the logistic version of the model
assembled_training_data <- assemble_training_data(tf_cell_table,
                                                  logistic_model = TRUE,
                                                  chip_col = 'chip_label',
                                                  training_chrs = paste0('chr', seq(1,21,2)),
                                                  n_partitions=10,
                                                  max_sites = 50000)


## End(Not run)

HarteminkLab/TOP documentation built on June 11, 2025, 5:34 p.m.

HarteminkLab/TOP index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

HarteminkLab/TOP
Predict transcription factor occupancy using DNase- or ATAC-seq data

assemble_training_data: Assembles TOP training data for all TF x cell type...
In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data

Assembles TOP training data for all TF x cell type combinations, then split training data into 10 partitions

Description

Usage

Arguments

Value

Examples

Related to assemble_training_data in HarteminkLab/TOP...

R Package Documentation

Browse R Packages

We want your feedback!

HarteminkLab/TOP Predict transcription factor occupancy using DNase- or ATAC-seq data

assemble_training_data: Assembles TOP training data for all TF x cell type... In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data

Assembles TOP training data for all TF x cell type combinations, then split training data into 10 partitions

Description

Usage

Arguments

Value

Examples

Related to assemble_training_data in HarteminkLab/TOP...

R Package Documentation

Browse R Packages

We want your feedback!

HarteminkLab/TOP
Predict transcription factor occupancy using DNase- or ATAC-seq data

assemble_training_data: Assembles TOP training data for all TF x cell type...
In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data