compute_near_integrations: Scans input matrix to find and merge near integration sites.
In calabrialab/ISAnalytics: Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

View source: R/recalibration-functions.R

compute_near_integrations

R Documentation

Scans input matrix to find and merge near integration sites.

Description

This function scans the input integration matrix to detect eventual integration sites that are too "near" to each other and merges them into single integration sites adjusting their values if needed.

Usage

compute_near_integrations(
  x,
  threshold = 4,
  is_identity_tags = c("chromosome", "is_strand"),
  keep_criteria = c("max_value", "keep_first"),
  value_columns = c("seqCount", "fragmentEstimate"),
  max_value_column = "seqCount",
  sample_id_column = pcr_id_column(),
  additional_agg_lambda = list(.default = default_rec_agg_lambdas()),
  max_workers = 4,
  map_as_file = TRUE,
  file_path = default_report_path(),
  strand_specific = lifecycle::deprecated()
)

Arguments

`x`	An integration matrix
`threshold`	A single integer that represents an absolute number of bases for which two integrations are considered distinct. If the threshold is set to 3 it means, provided fields `chr` and `strand` are the same, integrations sites which have at least 3 bases in between them are considered distinct.
`is_identity_tags`	Character vector of tags that identify the integration event as distinct (except for `"locus"`). See details.
`keep_criteria`	While scanning, which integration should be kept? The 2 possible choices for this parameter are: "max_value": keep the integration site which has the highest value (and collapse other values on that integration). "keep_first": keeps the first integration
`value_columns`	Character vector, contains the names of the numeric experimental columns
`max_value_column`	The column that has to be considered for searching the maximum value
`sample_id_column`	The name of the column containing the sample identifier
`additional_agg_lambda`	A named list containing aggregating functions for additional columns. See details.
`max_workers`	Maximum parallel workers allowed
`map_as_file`	Produce recalibration map as a .tsv file?
`file_path`	String representing the path were the file will be saved. Must be a folder. Relevant only if `map_as_file` is `TRUE`.
`strand_specific`	Deprecated, use `is_identity_tags`

Details

The concept of "near"

An integration event is uniquely identified by all fields specified in the mandatory_IS_vars() look-up table. It can happen to find IS that are formally distinct (different combination of values in the fields), but that should not considered distinct in practice, since they represent the same integration event - this may be due to artefacts at the putative locus of the IS in the merging of multiple sequencing libraries.

We say that an integration event IS1 is near to another integration event IS2 if the absolute difference of their loci is strictly lower than the set threshold.

The IS identity

There is also another aspect to be considered. Since the algorithm is based on a sliding window mechanism, on which groups of IS should we set and slide the window?

By default, we have 3 fields in the mandatory_IS_vars(): chr, integration_locus, strand, and we assume that all the fields contribute to the identity of the IS. This means that IS1 and IS2 can be compared only if they have the same chromosome and the same strand. However, if we would like to exclude the strand of the integration from our considerations then IS1 and IS2 can be selected from all the events that fall on the same chromosome. A practical example:

IS1 = ⁠(chr = "1", strand = "+", integration_locus = 14568)⁠

IS2 = ⁠(chr = "1", strand = "-", integration_locus = 14567)⁠

if is_identity_tags = c("chromosome", "is_strand") IS1 and IS2 are considered distinct because they differ in strand, therefore no correction will be applied to loci of either of the 2. If is_identity_tags = c("chromosome") then IS1 and IS2 are considered near, because the strand is irrelevant, hence one of the 2 IS will change locus.

Aggregating near IS

IS that fall in the same interval are evaluated according to the criterion selected - if recalibration is necessary, rows with the same sample ID are aggregated in a single row with a quantification value that is the sum of all the merged rows.

If the input integration matrix contains annotation columns, that is additional columns that are not

part of the mandatory IS vars (see mandatory_IS_vars())
part of the annotation IS vars (see annotation_IS_vars())
the sample identifier column
the quantification column

it is possible to specify how they should be aggregated. Defaults are provided for each column type (character, integer, numeric...), but custom functions can be specified as a named list, where names are column names in x and values are functions to be applied. NOTE: functions must be purrr-style lambdas and they must perform some kind of aggregating operation, aka they must take a vector as input and return a single value. The type of the output should match the type of the target column. If you specify custom lambdas, provide defaults in the special element .defaults. Example:

list(
  numeric_col = ~ sum(.x),
  char_col = ~ paste0(.x, collapse = ", "),
  .defaults = default_rec_agg_lambdas()
)

Value

An integration matrix with same or less number of rows

Required tags

The function will explicitly check for the presence of these tags:

chromosome
locus
is_strand
gene_symbol

Note

We do recommend to use this function in combination with comparison_matrix to automatically perform re-calibration on all quantification matrices.

Examples

data("integration_matrices", package = "ISAnalytics")
rec <- compute_near_integrations(
    x = integration_matrices, map_as_file = FALSE
)
head(rec)

calabrialab/ISAnalytics documentation built on Dec. 10, 2024, 10:50 p.m.

calabrialab/ISAnalytics index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

calabrialab/ISAnalytics
Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

compute_near_integrations: Scans input matrix to find and merge near integration sites.
In calabrialab/ISAnalytics: Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

Scans input matrix to find and merge near integration sites.

Description

Usage

Arguments

Details

The concept of "near"

The IS identity

Aggregating near IS

Value

Required tags

Note

See Also

Examples

Related to compute_near_integrations in calabrialab/ISAnalytics...

R Package Documentation

Browse R Packages

We want your feedback!

calabrialab/ISAnalytics Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

compute_near_integrations: Scans input matrix to find and merge near integration sites. In calabrialab/ISAnalytics: Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

Scans input matrix to find and merge near integration sites.

Description

Usage

Arguments

Details

The concept of "near"

The IS identity

Aggregating near IS

Value

Required tags

Note

See Also

Examples

Related to compute_near_integrations in calabrialab/ISAnalytics...

R Package Documentation

Browse R Packages

We want your feedback!

calabrialab/ISAnalytics
Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

compute_near_integrations: Scans input matrix to find and merge near integration sites.
In calabrialab/ISAnalytics: Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies