Create_many_Bricks: Create the entire HDF5 structure and load the bintable

View source: R/Brick_functions.R

Create_many_BricksR Documentation

Create the entire HDF5 structure and load the bintable

Description

Create_many_Bricks creates the HDF file and returns a BrickContainer

Usage

Create_many_Bricks(
    BinTable,
    bin_delim = "\t",
    col_index = c(1, 2, 3),
    impose_discontinuity = TRUE,
    hdf_chunksize = NULL,
    output_directory = NA,
    file_prefix = NA,
    remove_existing = FALSE,
    link_existing = FALSE,
    experiment_name = NA,
    resolution = NA,
    type = c("both", "cis", "trans")
)

Arguments

BinTable

Required A string containing the path to the file to load as the binning table for the Hi-C experiment. The number of entries per chromosome defines the dimension of the associated Hi-C data matrices. For example, if chr1 contains 250 entries in the binning table, the cis Hi-C data matrix for chr1 will be expected to contain 250 rows and 250 cols. Similary, if the same binning table contained 150 entries for chr2, the trans Hi-C matrices for chr1,chr2 will be a matrix with dimension 250 rows and 150 cols.

There are no constraints on the bintable format. As long as the table is in a delimited format, the corresponding table columns can be outlined with the associated parameters. The columns of importance are chr, start and end.

It is recommended to always use binning tables where the end and start of consecutive ranges are not the same. If they are the same, this may lead to unexpected behaviour when using the GenomicRanges "any" overlap function.

bin_delim

Optional. Defaults to tabs. A character vector of length 1 specifying the delimiter used in the file containing the binning table.

col_index

Optional. Default "c(1,2,3)". A character vector of length 3 containing the indexes of the required columns in the binning table. the first index, corresponds to the chr column, the second to the start column and the third to the end column.

impose_discontinuity

Optional. Default TRUE. If TRUE, this parameter ensures a check to make sure that required the end and start coordinates of consecutive entries are not the same per chromosome.

hdf_chunksize

Optional. A numeric vector of length 1. If provided, the HDF dataset will use this value as the chunk size, for all matrices. By default, the ChunkSize is set to matrix dimensions/100.

output_directory

Required A string specifying the location where the HDF files will be created.

file_prefix

Required A string specifying the prefix that is concatenated to the hdf files stored in the output_directory.

remove_existing

Optional. Default FALSE. If TRUE, will remove the HDF file with the same name and create a new one. By default, it will not replace existing files.

link_existing

Optional. Default FALSE. If TRUE, will re-add the HDF file with the same name. By default, this parameter is set to FALSE.

experiment_name

Optional. If provided, this will be the experiment name for the BrickContainer.

resolution

required. A value of length 1 of class character or numeric specifying the resolution of the Hi-C data loaded.

type

optional. Default any A value from one of any, cis, trans specifying the type of matrices to load. Any will load both cis (intra-choromosomal, e.g. chr1 vs chr1) and trans ( inter-chromosomal, e.g. chr1 vs chr2) Hi-C matrices. Whereas cis and trans will load either cis or trans Hi-C matrices.

Details

This function creates the complete HDF data structure, loads the binning table associated to the Hi-C experiment, creates a 2D matrix layout for all specified chromosome pairs and creates a json file for the project. At the end, this function will return a S4 object of class BrickContainer. Please note, the binning table must be a discontinuous one (first range end != secode range start), as ranges overlaps using the "any" form will routinely identify adjacent ranges with the same end and start to be in the overlap. Therefore, this criteria is enforced as default behaviour.

The structure of the HDF file is as follows: The structure contains three major groups which are then hierarchically nested with other groups to finally lead to the corresponding datasets.

  • Base.matrices - group For storing Hi-C matrices

    • chromosome - group

    • chromosome - group

      • attributes - attribute

        • Filename - Name of the file

        • Min - min value of Hi-C matrix

        • Max - max value of Hi-C matrix

        • sparsity - specifies if this is a sparse matrix

        • distance - max distance of data from main diagonal

        • Done - specifies if a matrix has been loaded

      • matrix - dataset - contains the matrix

      • chr1_bin_coverage - dataset - proportion of row cells with values greater than 0

      • chr1_row_sums - dataset - total sum of all values in a row

      • chr2_col_sums - dataset - total sum of all values in a col

      • chr2_bin_coverage - dataset - proportion of col cells with values greater than 0

      • sparsity - dataset - proportion of non-zero cells near the diagonal

  • Base.ranges - group, Ranges tables for quick and easy access. Additional ranges tables are added here under separate group names.

    • Bintable - group - The main binning table associated to a Brick.

      • ranges - dataset - Contains the three main columns chr, start and end.

      • offsets - dataset - first occurence of any given chromosome in the ranges dataset.

      • lengths - dataset - Number of occurences of that chromosome

      • chr.names - dataset - What chromosomes are present in the given ranges table.

  • Base.metadata - group, A place to store metadata info

    • chromosomes - dataset - Metadata information specifying the chromosomes present in this particular Brick file.

    • other metadata tables.

Keep in mind that if the end coordinates and start coordinates of adjacent ranges are not separated by at least a value of 1, then impose.discontinuity = TRUE will likely cause an error to occur. This may seem obnoxious, but GenomicRanges by default will consider an overlap of 1 bp as an overlap. Therefore, to be certain that ranges which should not be, are not being targeted during retrieval operations, a check is initiated to make sure that adjacent ends and starts are not overlapping. To load continuous ranges, use impose.discontinuity = FALSE.

Also note, that col.index determines which columns to use for chr, start and end. Therefore, the original binning table may have 10 or 20 columns, but it only requires the first three in order of chr, start and end.

Value

This function will generate the target Brick file. Upon completion, the function will return an object of class BrickContainer.

Examples

Bintable.path <- system.file(file.path("extdata", "Bintable_100kb.bins"), 
package = "HiCBricks")
out_dir <- file.path(tempdir(), "Creator_test")
dir.create(out_dir)
My_BrickContainer <- Create_many_Bricks(BinTable = Bintable.path, 
    bin_delim = " ", output_directory = out_dir, file_prefix = "Test", 
    experiment_name = "Vignette Test", resolution = 100000, 
    remove_existing = TRUE)


koustav-pal/HiCBlocks documentation built on Oct. 29, 2022, 8:17 a.m.