estimate_distance_parameter | R Documentation |
Function to calculate distance penalty parameter (distance_parameter
)
for random genomic windows. Used to choose distance_parameter
to pass
to generate_cicero_models
.
estimate_distance_parameter(
cds,
window = 5e+05,
maxit = 100,
s = 0.75,
sample_num = 100,
distance_constraint = 250000,
distance_parameter_convergence = 1e-22,
max_elements = 200,
genomic_coords = cicero::human.hg19.genome,
max_sample_windows = 500
)
cds |
A cicero CDS object generated using |
window |
Size of the genomic window to query, in base pairs. |
maxit |
Maximum number of iterations for distance_parameter estimation. |
s |
Power law value. See details for more information. |
sample_num |
Number of random windows to calculate
|
distance_constraint |
Maximum distance of expected connections. Must be
smaller than |
distance_parameter_convergence |
Convergence step size for
|
max_elements |
Maximum number of elements per window allowed. Prevents very large models from slowing performance. |
genomic_coords |
Either a data frame or a path (character) to a file
with chromosome lengths. The file should have two columns, the first is
the chromosome name (ex. "chr1") and the second is the chromosome length
in base pairs. See |
max_sample_windows |
Maximum number of random windows to screen to find sample_num windows for distance calculation. Default 500. |
The purpose of this function is to calculate the distance scaling
parameter used to adjust the distance-based penalty function used in
Cicero's model calculation. The scaling parameter, in combination with the
power law value s
determines the distance-based penalty.
This function chooses random windows of the genome and calculates a
distance_parameter
. The function returns a vector of values
calculated on these random windows. We recommend using the mean value of
this vector moving forward with Cicero analysis.
The function works by finding the minimum distance scaling parameter such
that no more than 5
distance_constraint
have non-zero entries after graphical lasso
regularization and such that fewer than 80
nonzero.
If the chosen random window has fewer than 2 or greater than
max_elements
sites, the window is skipped. In addition, the random
window will be skipped if there are insufficient long-range comparisons
(see below) to be made. The max_elements
parameter exist to prevent
very dense windows from slowing the calculation. If you expect that your
data may regularly have this many sites in a window, you will need to
raise this parameter.
Calculating the distance_parameter
in a sample window requires
peaks in that window that are at a distance greater than the
distance_constraint
parameter. If there are not enough examples at
high distance have been found, the function will return the warning
"Warning: could not calculate sample_num distance_parameters - see
documentation details"
.When looking for sample_num
example
windows, the function will search max_sample_windows
windows. By
default this is set at 500, which should be well beyond the 100 windows
that need to be found. However, in very sparse datasets, increasing
max_sample_windows
may help avoid the above warning. Increasing
max_sample_windows
may slow performance in sparse datasets. If you
are still not able to get enough example windows, even with a large
max_sample_windows
paramter, this may mean your window
parameter needs to be larger or your distance_constraint
parameter
needs to be smaller. A less likely possibility is that your
max_elements
parameter needs to be larger. This would occur if your
data is particularly dense.
The parameter s
is a constant that captures the power-law
distribution of contact frequencies between different locations in the
genome as a function of their linear distance. For a complete discussion
of the various polymer models of DNA packed into the nucleus and of
justifiable values for s, we refer readers to (Dekker et al., 2013) for a
discussion of justifiable values for s. We use a value of 0.75 by default
in Cicero, which corresponds to the “tension globule” polymer model of DNA
(Sanborn et al., 2015). This parameter must be the same as the s parameter
for generate_cicero_models.
Further details are available in the publication that accompanies this
package. Run citation("cicero")
for publication details.
A list of results of length sample_num
. List members are
numeric distance_parameter
values.
Dekker, J., Marti-Renom, M.A., and Mirny, L.A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403.
Sanborn, A.L., Rao, S.S.P., Huang, S.-C., Durand, N.C., Huntley, M.H., Jewett, A.I., Bochkov, I.D., Chinnappan, D., Cutkosky, A., Li, J., et al. (2015). Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc. Natl. Acad. Sci. U. S. A. 112, E6456–E6465.
generate_cicero_models
data("cicero_data")
data("human.hg19.genome")
sample_genome <- subset(human.hg19.genome, V1 == "chr18")
sample_genome$V2[1] <- 100000
input_cds <- make_atac_cds(cicero_data, binarize = TRUE)
input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6,
reduction_method = 'tSNE',
norm_method = "none")
tsne_coords <- t(reducedDimA(input_cds))
row.names(tsne_coords) <- row.names(pData(input_cds))
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords)
distance_parameters <- estimate_distance_parameter(cicero_cds,
sample_num=5,
genomic_coords = sample_genome)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.