View source: R/outlier-filtering.R
outliers_by_pool_fragments | R Documentation |
Identify and flag outliers based on expected number of raw reads per pool.
outliers_by_pool_fragments(
metadata,
key = "BARCODE_MUX",
outlier_p_value_threshold = 0.01,
normality_test = FALSE,
normality_p_value_threshold = 0.05,
transform_log2 = TRUE,
per_pool_test = TRUE,
pool_col = "PoolID",
min_samples_per_pool = 5,
flag_logic = "AND",
keep_calc_cols = TRUE,
report_path = default_report_path()
)
metadata |
The metadata data frame |
key |
A character vector of numeric column names |
outlier_p_value_threshold |
The p value threshold for a read to be considered an outlier |
normality_test |
Perform normality test? Normality is assessed for each column in the key using Shapiro-Wilk test and if the values do not follow a normal distribution, other calculations are skipped |
normality_p_value_threshold |
Normality threshold |
transform_log2 |
Perform a log2 trasformation on values prior the actual calculations? |
per_pool_test |
Perform the test for each pool? |
pool_col |
A character vector of the names of the columns that uniquely identify a pool |
min_samples_per_pool |
The minimum number of samples that a pool
needs to contain in order to be processed - relevant only if
|
flag_logic |
A character vector of logic operators to obtain a global flag formula - only relevant if the key is longer than one. All operators must be chosen between: AND, OR, XOR, NAND, NOR, XNOR |
keep_calc_cols |
Keep the calculation columns in the output data frame? |
report_path |
The path where the report file should be saved.
Can be a folder, a file or NULL if no report should be produced.
Defaults to |
The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:
Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata
Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers
This function is an outlier test, and calculates for each column in the key
The zscore of the values
The tstudent of the values
The the associated p-value (tdist)
Optionally the test can be performed for each pool and a normality test can be run prior the actual calculations. Samples are flagged if this condition is respected:
tdist < outlier_p_value_threshold & zscore < 0
If the key contains more than one column an additional flag logic can be
specified for combining the results.
Example:
let's suppose the key contains the names of two columns, X and Y
key = c("X", "Y")
if we specify the the argument flag_logic = "AND"
then the reads will
be flagged based on this global condition:
(tdist_X < outlier_p_value_threshold & zscore_X < 0) AND
(tdist_Y < outlier_p_value_threshold & zscore_Y < 0)
The user can specify one or more logical operators that will be applied in sequence.
A data frame of metadata with the column to_remove
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("association_file", package = "ISAnalytics")
flagged <- outliers_by_pool_fragments(association_file,
report_path = NULL
)
head(flagged)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.