shaman_score_hic_track: generate a score hic track based on observed and expected...

shaman_score_hic_trackR Documentation

generate a score hic track based on observed and expected (shuffled) hic data

Description

shaman_score_hic_track

Usage

shaman_score_hic_track(
  track_db,
  work_dir,
  score_track_nm,
  obs_track_nms,
  exp_track_nms = paste0(obs_track_nms, "_shuffle"),
  points_track_nms = obs_track_nms,
  near_cis = 5000000,
  expand = 2000000,
  k = 100,
  max_jobs = 100
)

Arguments

track_db

Directory of the misha database.

work_dir

Centralized directory to store temporary files.

score_track_nm

Score track that will be created.

obs_track_nms

Names of observed 2D genomic tracks for the hic data. Pooling of multiple observed tracks is supported.

exp_track_nms

Names of expected (shuffled) 2D genomic tracks. Pooling of multiple expected tracks is supported.

points_track_nms

Names of 2D genomic tracks that contain points on which to compute normalized score. Pooling points from multiple tracks is supported.

near_cis

Size of matrix in grid.

expand

Size of expansion, points to include outside the matrix for accurate computing of the score. Note that for each observed point, its k-nearest neighbors must be included in the expanded matrix.

k

The number of neighbor distances used for the score. For higher resolution maps, increase k. For lower resolution maps, decrease k.

max_jobs

Maximal number of qsub jobs.

Details

This function generates a 2D score track based on observed and expected hic data. The score is computed by generating a grid of small matrices spanning all chromosomes and computing the score of each matrix independantly. The model for computing the score relies on the KS D statistic computed for each observed point, over the distances of the k-nearest neighbors in the observed compared to the expected. High scores represent contact enrichment while low scores depict insulation. Note that this function requires either sge (qsub) or multicore to compute in a timely manner. Parameters can be set via shaman.sge_support or shaman.mc_support in shaman.conf file. Score computation on 1 billion reads on a distributed system may take 4-10 hours (with default parameters), depending on the number of cores available.

Each step creates temporary files of the matrix scores which are then joined to a track. Temporary files are deleted upon track creation.

Examples


# The example below runs on the test misha db provided with shaman.
# Note that this is a toy db sampled from K562 ela data -
# scoring based on the observed and expected tracks will not produce the score track,
# as most of the genome is missing (you will see message: number of points in focus interval < 1000)
# options(shaman.sge_support=1) #configuring sge engine mode - preferred
options(shaman.mc_support = 1) # configuring multi-core mode
if (gtrack.exists("hic_score_new")) {
    gtrack.rm("hic_score_new", force = TRUE)
    gdb.reload()
}
ret <- shaman_score_hic_track(shaman_get_test_track_db(),
    work_dir = tempdir(), # this can be set only in multi-core mode. For sge mode, work_dir must be accessible by all jobs.
    score_track_nm = "hic_score_new",
    obs_track_nms = "hic_obs",
    exp_track_nms = "hic_exp",
    near_cis = 1e09, # this test db contains very little data, can increase the size of each job
    max_jobs = parallel::detectCores()
) # increase number of jobs for optimal runtime when running in sge mode
gdb.reload()
gtrack.ls("hic_score_new") # new shuffled track that was created

tanaylab/shaman documentation built on April 2, 2022, 1:32 a.m.