mhidetify: Multiple detection asymmetric influential measure for high...

Description Usage Arguments Value Author(s) References Examples

View source: R/mhidetify.R

Description

The function computes the asymmetric influential measure to identify influential observations in high dimensional linear regression using the multiple detection approach.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
mhidetify(
  x, 
  y, 
  number_subset, 
  size_subset, 
  asymvec, 
  ep=0.1, 
  alpha_swamp, 
  alpha_mask, 
  alpha_validate
  )

Arguments

x

Matrix of the predictors.

y

Numeric vector of the response variable.

number_subset

Number of random subsets, default is 5.

size_subset

Size of the random subsets. The default is half of the initial sample size.

asymvec

Numeric vector of the asymmetric values. It is suggested to choose 3 asymmetric points within the quartile.

ep

Threshold value to ensure that the estimated clean set is not empty. The default value is 0.1.

alpha_swamp

Significance level for the swamping stage.

alpha_mask

Significance level for the masking stage.

alpha_validate

Significance level for the validation stage.

Value

A dataframe with two variables.

ind

Index of the subjects of the sample

outlier_ind

Influential observations indicator: 1 is influential and 0 otherwise

Author(s)

Amadou Barry barryhafia@gmail.com

References

Barry, A., Bhagwat, N., Misic, B., Poline, J.-B., and Greenwood, C. M. T. (2020). Asymmetric influence measure for high dimensional regression. Communications in Statistics - Theory and Methods.

Barry, A., Bhagwat, N., Misic, B., Poline, J.-B., and Greenwood, C. M. T. (2021). An algorithm-based multiple detection influence measure for high dimensional regression using expectile. arXiv: 2105.12286 [stat]. arXiv: 2105.12286.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
## Simulate a dataset where the first 10 observations are influentials
require("MASS")
# the vector of asymmetric point
asymvec  <- c(0.25,0.5,0.75)

# the parameter of interest
beta_param <- c(3,1.5,0,0,2,rep(0,1000-5))

# the contamination parameter 
gama_param <- c(0,0,1,1,0,rep(1,1000-5))

# Covariance matrice for the predictors distribution 
sigmain <- diag(rep(1,1000))
for (i in 1:1000)
{
  for (j in i:1000) 
  {
    sigmain[i,j] <- 0.5^(abs(j-i))
    sigmain[j,i] <- sigmain[i,j]
  }
}

# set the seed
set.seed(13)

# the predictor matrix
x  <- mvrnorm(100, rep(0, 1000), sigmain)

# the error variable
error_var <- rnorm(100)

# the response variable
y  <- x %*% beta_param + error_var
y <- as.numeric(y)

### Generate influential observations
# the contaminated response variable
youtlier <- y
youtlier[1:10] <- x[1:10,] %*% (beta_param +  1.2*gama_param)  + error_var[1:10]
youtlier <- as.numeric(youtlier)


# number of random subsets
number_subset <- 5

# the size of the random subset
size_subset <- 100/2

# the significance level for the swamping stage
alpha_swamp <- 0.1

# the significance level for the masking stage
alpha_mask <- 0.01

# the significance level for the validation stage
alpha_validate <- 0.01

# Threshold value to ensure that the estimated clean set is not empty. 
ep <- 0.1

out <- 
  mhidetify(
    x, 
    youtlier, 
    number_subset, 
    size_subset, 
    asymvec, 
    ep, 
    alpha_swamp, 
    alpha_mask,
    alpha_validate)

hidetify documentation built on Aug. 20, 2021, 5:06 p.m.