outliers_by_pool_fragments: Identify and flag outliers based on pool fragments.

View source: R/outlier-filtering.R

outliers_by_pool_fragmentsR Documentation

Identify and flag outliers based on pool fragments.

Description

[Stable] Identify and flag outliers based on expected number of raw reads per pool.

Usage

outliers_by_pool_fragments(
  metadata,
  key = "BARCODE_MUX",
  outlier_p_value_threshold = 0.01,
  normality_test = FALSE,
  normality_p_value_threshold = 0.05,
  transform_log2 = TRUE,
  per_pool_test = TRUE,
  pool_col = "PoolID",
  min_samples_per_pool = 5,
  flag_logic = "AND",
  keep_calc_cols = TRUE,
  report_path = default_report_path()
)

Arguments

metadata

The metadata data frame

key

A character vector of numeric column names

outlier_p_value_threshold

The p value threshold for a read to be considered an outlier

normality_test

Perform normality test? Normality is assessed for each column in the key using Shapiro-Wilk test and if the values do not follow a normal distribution, other calculations are skipped

normality_p_value_threshold

Normality threshold

transform_log2

Perform a log2 trasformation on values prior the actual calculations?

per_pool_test

Perform the test for each pool?

pool_col

A character vector of the names of the columns that uniquely identify a pool

min_samples_per_pool

The minimum number of samples that a pool needs to contain in order to be processed - relevant only if per_pool_test = TRUE

flag_logic

A character vector of logic operators to obtain a global flag formula - only relevant if the key is longer than one. All operators must be chosen between: AND, OR, XOR, NAND, NOR, XNOR

keep_calc_cols

Keep the calculation columns in the output data frame?

report_path

The path where the report file should be saved. Can be a folder, a file or NULL if no report should be produced. Defaults to {user_home}/ISAnalytics_reports.

Details

Modular structure

The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:

  • Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata

  • Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers

This function is an outlier test, and calculates for each column in the key

  • The zscore of the values

  • The tstudent of the values

  • The the associated p-value (tdist)

Optionally the test can be performed for each pool and a normality test can be run prior the actual calculations. Samples are flagged if this condition is respected:

  • tdist < outlier_p_value_threshold & zscore < 0

If the key contains more than one column an additional flag logic can be specified for combining the results. Example: let's suppose the key contains the names of two columns, X and Y key = c("X", "Y") if we specify the the argument flag_logic = "AND" then the reads will be flagged based on this global condition: (tdist_X < outlier_p_value_threshold & zscore_X < 0) AND (tdist_Y < outlier_p_value_threshold & zscore_Y < 0)

The user can specify one or more logical operators that will be applied in sequence.

Value

A data frame of metadata with the column to_remove

See Also

Other Data cleaning and pre-processing: aggregate_metadata(), aggregate_values_by_key(), compute_near_integrations(), default_meta_agg(), outlier_filter(), purity_filter(), realign_after_collisions(), remove_collisions(), threshold_filter()

Examples

data("association_file", package = "ISAnalytics")
flagged <- outliers_by_pool_fragments(association_file,
    report_path = NULL
)
head(flagged)

calabrialab/ISAnalytics documentation built on Dec. 10, 2024, 10:50 p.m.