Outliergram for univariate functional datasets

Share:

Description

This function performs the outliergram of a univariate functional dataset, possibly with an adjustment of the true positive rate of outliers discovered under assumption of gaussianity.

Usage

1
2
3
outliergram(fData, MBD_data = NULL, MEI_data = NULL, q_low = 0,
  q_high = 1, p_check = 0.05, Fvalue = 1.5, adjust = FALSE,
  display = TRUE, xlab = NULL, ylab = NULL, main = NULL, ...)

Arguments

fData

the univariate functional dataset whose outliergram has to be determined.

MBD_data

a vector containing the MBD for each element of the dataset. If missing, MBDs are computed.

MEI_data

a vector containing the MEI for each element of the dataset. If not not provided, MEIs are computed.

q_low

parameter used in the part where data are shifted toward the center of the dataset. It indicates the quantile to be used to compute the target to compare functions in the secondary check for outliers. Defult is 0, i.e. High MEI functions (lying at the bottom of the dataset) are compared to the minimum of all the remaining functions.

q_high

parameter used in the part where data are shifted toward the center of the dataset. It indicates the quantile to be used to compute the target to compare functions in the secondary check for outliers. Defult is 1, i.e. Low MEI functions (lying at the top of the dataset) are compared to the maximum of all the remaining functions.

p_check

percentage of observations with either low or high MEI to be checked for outliers in the secondary step (shift towards the center of the dataset).

Fvalue

the F value to be used in the procedure that finds the shape outliers by looking at the lower parabolic limit in the outliergram. Default is 1.5. You can also leave the default value and, by providing the parameter adjust, specify that you want Fvalue to be adjusted for the dataset provided in fData.

adjust

either FALSE if you would like the default value for the inflation factor, F = 1.5, to be used, or a list specifying the parameters required by the adjustment.

  • "N_trials": the number of repetitions of the adujustment procedure based on the simulation of a gaussisan population of functional data, each one producing an adjusted value of F, which will lead to the averaged adjusted value \bar{F}. Default is 20;

  • "trial_size": the number of elements in the gaussian population of functional data that will be simulated at each repetition of the adjustment procedure. Default is 5 * fData$N;

  • "TPR": the True Positive Rate of outleirs, i.e. the proportion of observations in a dataset without shape outliers that have to be considered outliers. Default is 2 * pnorm( 4 * qnorm( 0.25 ) );

  • "F_min": the minimum value of F, defining the left boundary for the optimisation problem aimed at finding, for a given dataset of simulated gaussian data associated to fData, the optimal value of F. Default is 0.5;

  • "F_max": the maximum value of F, defining the right boundary for the optimisation problem aimed at finding, for a given dataset of simulated gaussian data associated to fData, the optimal value of F. Default is 20;

  • "tol": the tolerance to be used in the optimisation problem aimed at finding, for a given dataset of simulated gaussian data associated to fData, the optimal value of F. Default is 1e-3;

  • "maxiter": the maximum number of iterations to solve the optimisation problem aimed at finding, for a given dataset of simulated gaussian data associated to fData, the optimal value of F. Default is 100;

  • "VERBOSE": a parameter controlling the verbosity of the adjustment process;

display

either a logical value indicating wether you want the outliergram to be displayed, or the number of the graphical device where you want the outliergram to be displayed.

xlab

a list of two labels to use on the x axis when displaying the functional dataset and the outliergram

ylab

a list of two labels to use on the y axis when displaying the functional dataset and the outliergram;

main

a list of two titles to be used on the plot of the functional dataset and the outliergram;

...

additional graphical parameters to be used only in the plot of the functional dataset

Value

Even when used graphically to plot the outliergram, the function returns a list containing a numeric vector with the IDs of observations in fData that are considered as shape outliers and the value of Fvalue that has been used in determining them.

Adjustment

When the adjustment option is selected, the value of F is optimised for the univariate functional dataset provided with fData. In practice, a number adjust$N_trials of times a synthetic population (of size adjust$trial_size with the same covariance (robustly estimated from data) and centerline as fData is simulated without outliers and each time an optimised value F_i is computed so that a given proportion (adjust$TPR) of observations is flagged as outliers. The final value of F for the outliergram is determined as an average of F_1, F_2, …, F_{N_{trials}}. At each time step the optimisation problem is solved using stats::uniroot (Brent's method).

References

Arribas-Gil, A., and Romo, J. (2014). Shape outlier detection and visualization for functional data: the outliergram, Biostatistics, 15(4), 603-619.

See Also

fData, MEI, MBD, fbplot

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
set.seed( 1618 )

N = 200
P = 200
N_extra = 4

grid = seq( 0, 1, length.out = P )

Cov = exp_cov_function( grid, alpha = 0.2, beta = 0.8 )

Data = generate_gauss_fdata( N,
                             centerline = sin( 4 * pi * grid ),
                             Cov = Cov )

Data_extra = array( 0, dim = c( N_extra, P ) )

Data_extra[ 1, ] = generate_gauss_fdata( 1,
                                         sin( 4 * pi * grid + pi / 2 ),
                                         Cov = Cov )

Data_extra[ 2, ] = generate_gauss_fdata( 1,
                                         sin( 4 * pi * grid - pi / 2 ),
                                         Cov = Cov )

Data_extra[ 3, ] = generate_gauss_fdata( 1,
                                         sin( 4 * pi * grid + pi/ 3 ),
                                         Cov = Cov )

Data_extra[ 4, ] = generate_gauss_fdata( 1,
                                         sin( 4 * pi * grid - pi / 3),
                                         Cov = Cov )
Data = rbind( Data, Data_extra )

fD = fData( grid, Data )

outliergram( fD, display = TRUE )

outliergram( fD, Fvalue = 10, display = TRUE )
## Not run: 
outliergram( fD,
             adjust = list( N_trials = 10,
                            trial_size = 5 * nrow( Data ),
                            TPR = 0.01,
                            VERBOSE = FALSE ),
             display = TRUE )

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.