purity_filter: Filter integration sites based on purity.
In calabrialab/ISAnalytics: Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies

purity_filter

R Documentation

Filter integration sites based on purity.

Description

Filter that targets possible contamination between cell lines based on a numeric quantification (likely abundance or sequence count).

Usage

purity_filter(
  x,
  lineages = blood_lineages_default(),
  aggregation_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"),
  group_key = c("CellMarker", "Tissue"),
  selected_groups = NULL,
  join_on = "CellMarker",
  min_value = 3,
  impurity_threshold = 10,
  by_timepoint = TRUE,
  timepoint_column = "TimePoint",
  value_column = "seqCount_sum"
)

Arguments

`x`	An aggregated integration matrix, obtained via `aggregate_values_by_key()`
`lineages`	A data frame containing cell lineages information
`aggregation_key`	The key used for aggregating `x`
`group_key`	A character vector of column names for re-aggregation. Column names must be either in `x` or in `lineages`. See details.
`selected_groups`	Either NULL, a character vector or a data frame for group selection. See details.
`join_on`	Common columns to perform a join operation on
`min_value`	A minimum value to filter the input matrix. Integrations with a value strictly lower than `min_value` are excluded (dropped) from the output.
`impurity_threshold`	The ratio threshold for impurity in groups
`by_timepoint`	Should filtering be applied on each time point? If `FALSE`, all time points are merged together
`timepoint_column`	Column in `x` containing the time point
`value_column`	Column in `x` containing the numeric quantification of interest

Details

Setting input arguments

The input matrix can be re-aggregated with the provided group_key argument. This key contains the names of the columns to group on (besides the columns holding genomic coordinates of the integration sites) and must be contained in at least one of x or lineages data frames. If the key is not found only in x, then a join operation with the lineages data frame is performed on the common column(s) join_on.

Group selection

It is possible for the user to specify on which groups the logic of the filter should be applied to. For example: if we have group_key = c("HematoLineage") and we set selected_groups = c("CD34", "Myeloid","Lymphoid") it means that a single integration will be evaluated for the filter only for groups that have the values of "CD34", "Myeloid" and "Lymphoid" in the "HematoLineage" column. If the same integration is present in other groups it is kept as it is. selected_groups can be set to NULL if we want the logic to apply to every group present in the data frame, it can be set as a simple character vector as the example above if the group key has length 1 (and there is no need to filter on time point). If the group key is longer than 1 then the filter is applied only on the first element of the key.

If a more refined selection on groups is needed, a data frame can be provided instead:

group_key = c("CellMarker", "Tissue")
selected_groups = tibble::tribble(
~ CellMarker, ~ Tissue,
"CD34", "BM",
"CD14", "BM",
"CD14", "PB"
)

Columns in the data frame should be the same as group key (plus, eventually, the time point column). In this example only those groups identified by the rows in the provided data frame are processed.

Value

A data frame

Examples

data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
aggreg <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    value_cols = c("seqCount", "fragmentEstimate")
)
filtered_by_purity <- purity_filter(
    x = aggreg,
    value_column = "seqCount_sum"
)
head(filtered_by_purity)

calabrialab/ISAnalytics documentation built on Dec. 10, 2024, 10:50 p.m.