bootHRT: Calculate Cellwise Flags for Anomaly Detection Using Bayesian...

View source: R/bootHRT.R

bootHRTR Documentation

Calculate Cellwise Flags for Anomaly Detection Using Bayesian Bootstrap

Description

The function uses Bayesian bootstrap to determine if a data entry is an outlier on not. The function takes a long-format data.frame object as input and returns it with two appended vectors. The first vector contains the anomaly scores as numbers between zero and one, and the second vector provides a set of logical values indicating whether the data entry is an outlier (TRUE) or not (FALSE).

Usage

bootHRT(a, contamination = 0.08, boot_max_it = 1000L)

Arguments

a

A long-format data.frame object with survey data. For details see information on the data format.

contamination

A number between zero and one used as a threshold when identifying outliers from the fuzzy scores. By default, the algorithm will identify approximately 8% of the data entries as anomalies.

boot_max_it

An integer number determining the iterations performed by Bayesian bootstrap algorithm. It is set to 1000 by default.

Details

The argument a is proivded as an object of class data.frame. This object is considered as a long-format data.frame, and it must have at least five columns with the following names:

"strata"

a character or factor column containing the information on the stratification.

"unit_id"

a character or factor column containing the ID of the statistical unit in the survey sample(x, size, replace = FALSE, prob = NULL).

"master_varname"

a character column containing the name of the observed variable.

"current_value_num"

a numeric the observed value, i.e., a data entrie

"pred_value"

a numeric a value observed on a previous survey for the same variable if available. If not available, the value can be set to NA or NaN. When working with longitudinal data, the value can be set to a time-series forecast or a filtered value.

The data.frame object in input can have more columns, but the extra columns would be ignored in the analyses. However, these extra columns would be preserved in the system memory and returned along with the results from the cellwise outlier-detection analysis.

The use of the R-packages dplyr, purrr, and tidyr is highly recommended to simplify the conversion of datasets between long and wide formats.

Value

The long-format data.frame is provided as input data and contains extra columns i.e., anomaly flags and outlier indicators columns. The samples from the posterior distribution of the contamination threshold are attached as an attribute vector of length B named "thresholds".

Author(s)

Luca Sartore drwolf85@gmail.com

Examples

# Load the package
library(HRTnomaly)
set.seed(2025L)
# Load the 'toy' data
data(toy)
# Detect cellwise outliers
res <- bootHRT(toy, boot_max_it = 10)

HRTnomaly documentation built on April 3, 2025, 6:17 p.m.