View source: R/iterative_estimators.R
outlier_detection | R Documentation |
outlier_detection
provides different types of outlier detection
algorithms depending on the arguments provided. The decision whether to
classify an observations as an outlier or not is based on its standardised
residual in comparison to some user-specified reference distribution.
The algorithms differ mainly in two ways. First, they can differ by the use
of initial estimator, i.e. the estimator based on which the first
classification as outliers is made. Second, the algorithm can either be
iterated a fixed number of times or until the difference in coefficient
estimates between the most recent model and the previous one is smaller than
some user-specified convergence criterion. The difference is measured by
the L2 norm.
outlier_detection( data, formula, ref_dist = c("normal"), sign_level, initial_est = c("robustified", "saturated", "user", "iis"), user_model = NULL, iterations = 1, convergence_criterion = NULL, max_iter = NULL, shuffle = FALSE, shuffle_seed = NULL, split = 0.5, verbose = FALSE, iis_args = NULL )
data |
A dataframe. |
formula |
A formula for the |
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
sign_level |
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not. |
initial_est |
A character vector that specifies the initial estimator
for the outlier detection algorithm. |
user_model |
A model object of class ivreg. Only
required if argument |
iterations |
Either an integer >= 0 that specifies how often the outlier
detection algorithm is iterated, or the character vector
|
convergence_criterion |
A numeric value or NULL. The algorithm stops as
soon as the difference in coefficient estimates between the most recent model
and the previous one is smaller than |
max_iter |
A numeric value >= 1 or NULL. If
|
shuffle |
A logical value or |
shuffle_seed |
An integer value that will set the seed for shuffling the
sample or |
split |
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split. |
verbose |
A logical value whether progress during estimation should be reported. |
iis_args |
A list with named entries corresponding to the arguments for
|
outlier_detection
returns an object of class
"robust2sls"
, which is a list with the following components:
$cons
A list which stores high-level information about the
function call and some results. $call
is the captured function call,
$formula
the formula argument, $data
the original data set,
$reference
the chosen reference distribution to classify outliers,
$sign_level
the significance level, $psi
the probability that
an observation is not classified as an outlier under the null hypothesis
of no outliers, $cutoff
the cutoff used to classify outliers if
their standardised residuals are larger than that value, $bias_corr
a bias correction factor to account for potential false positives
(observations classified as outliers even though they are not). There are
three further elements that are lists themselves.
$initial
stores settings about the initial estimator:
$estimator
is the type of the initial estimator (e.g. robustified or
saturated), $split
how the sample is split (NULL
if argument
not used), $shuffle
whether the sample is shuffled before splitting
(NULL
if argument not used), $shuffle_seed
the value of the
random seed (NULL
if argument not used).
$convergence
stores information about the convergence of the
outlier-detection algorithm:
$criterion
is the user-specified convergence criterion (NULL
if argument not used), $difference
is the L2 norm between the last
coefficient estimates and the previous ones (NULL
if argument not
used or only initial estimator calculated). $converged
is a logical
value indicating whether the algorithm has converged, i.e. whether the
difference is smaller than the convergence criterion (NULL
if
argument not used). $max_iter
is the maximum iteration set by the
user (NULL
if argument not used or not set).
$iterations
contains information about the user-specified iterations
argument ($setting
) and the actual number of iterations that were
done ($actual
). The actual number can be lower if the algorithm
converged already before the user-specified number of iterations were
reached.
$model
A list storing the model objects of class
ivreg for each iteration. Each model is stored under
$m0
, $m1
, ...
$res
A list storing the residuals of all observations for
each iteration. Residuals of observations where any of the y, x, or z
variables used in the 2SLS model are missing are set to NA. Each vector is
stored under $m0
, $m1
, ...
$stdres
A list storing the standardised residuals of all
observations for each iteration. Standardised residuals of observations
where any of the y, x, or z variables used in the 2SLS model are missing
are set to NA. Standardisation is done by dividing by sigma, which is not
adjusted for degrees of freedom. Each vector is stored under $m0
,
$m1
, ...
$sel
A list of logical vectors storing whether an observation
is included in the estimation or not. Observations are excluded (FALSE) if
they either have missing values in any of the x, y, or z variables needed
in the model or when they are classified as outliers based on the model.
Each vector is stored under $m0
, $m1
, ...
$type
A list of integer vectors indicating whether an
observation has any missing values in x, y, or z (-1
), whether it is
classified as an outlier (0
) or not (1
). Each vector is
stored under $m0
, $m1
, ...
Check Jiao (2019)
(as well as forthcoming working paper in the future) about conditions on the
initial estimator that should be satisfied for the initial estimator when
using initial_est == "user"
(e.g. they have to be Op(1)).
IIS is a generalisation of Saturated 2SLS
with
multiple block search but no asymptotic theory exists for IIS.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.