simdata_contaminated: Generates random variates from K multivariate contaminated...

Description Usage Arguments Details Value Examples

View source: R/simdata-contaminated-normal.r

Description

We generate n_k observations (k = 1, …, K) from each of K multivariate contaminated normal distributions. Let N_p(μ, Σ) denote the p-dimensional multivariate normal distribution with mean vector μ and positive-definite covariance matrix Σ. Then, let the kth population have a p-dimensional multivariate contaminated normal distribution:

Usage

1
2
  simdata_contaminated(n, mean, cov, epsilon = rep(0, K),
    kappa = rep(1, K), seed = NULL)

Arguments

n

a vector (of length K) of the sample sizes for each population

mean

a vector or a list (of length K) of mean vectors

cov

a symmetric matrix or a list (of length K) of symmetric covariance matrices.

epsilon

a vector (of length K) indicating the probability of sampling a contaminated population (i.e., outlier) for each population

kappa

a vector (of length K) that determines the amount of scale contamination for each population

seed

seed for random number generation (If NULL, does not set seed)

Details

(1 - ε_k) N_p(μ_k, Σ_k) + ε_k N_p(μ_k, κ_k Σ_k),

where ε_k \in [0, 1] is the probability of sampling from a contaminated population (i.e., outlier) and κ_k ≥ 1 determines the amount of scale contamination. The contaminated normal distribution can be viewed as a mixture of two multivariate normal random distributions, where the second has a scaled covariance matrix, which can introduce extreme outliers for sufficiently large κ_k.

The number of populations, K, is determined from the length of the vector of sample sizes, coden. The mean vectors and covariance matrices each can be given in a list of length K. If one covariance matrix is given (as a matrix or a list having 1 element), then all populations share this common covariance matrix. The same logic applies to population means.

The contamination probabilities in epsilon can be given as a numeric vector or a single value, in which case the degrees of freedom is replicated K times. The same idea applies to the scale contamination in the kappa argument.

By default, epsilon is a vector of zeros, and kappa is a vector of ones. Hence, no contamination is applied by default.

Value

named list containing:

x:

A matrix whose rows are the observations generated and whose columns are the p features (variables)

y:

A vector denoting the population from which the observation in each row was generated.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Generates 10 observations from each of two multivariate contaminated normal
# populations with equal covariance matrices. Each population has a
# contamination probability of 0.05 and scale contamination of 10.
mean_list <- list(c(1, 0), c(0, 1))
cov_identity <- diag(2)
data <- simdata_contaminated(n = c(10, 10), mean = mean_list,
                             cov = cov_identity, epsilon = 0.05, kappa = 10,
                             seed = 42)
dim(data$x)
table(data$y)

# Generates 10 observations from each of three multivariate contaminated
# normal populations with unequal covariance matrices. The contamination
# probabilities and scales differ for each population as well.
set.seed(42)
mean_list <- list(c(-3, -3), c(0, 0), c(3, 3))
cov_list <- list(cov_identity, 2 * cov_identity, 3 * cov_identity)
data2 <- simdata_contaminated(n = c(10, 10, 10), mean = mean_list,
                              cov = cov_list, epsilon = c(0.05, 0.1, 0.2),
                              kappa = c(2, 5, 10))
dim(data2$x)
table(data2$y)

Example output

[1] 20  2

 1  2 
10 10 
[1] 30  2

 1  2  3 
10 10 10 

sortinghat documentation built on May 30, 2017, 4:52 a.m.