simdata_contaminated: Generates random variates from K multivariate contaminated... In sortinghat: sortinghat

Description

We generate n_k observations (k = 1, …, K) from each of K multivariate contaminated normal distributions. Let N_p(μ, Σ) denote the p-dimensional multivariate normal distribution with mean vector μ and positive-definite covariance matrix Σ. Then, let the kth population have a p-dimensional multivariate contaminated normal distribution:

Usage

 1 2  simdata_contaminated(n, mean, cov, epsilon = rep(0, K), kappa = rep(1, K), seed = NULL) 

Arguments

 n a vector (of length K) of the sample sizes for each population mean a vector or a list (of length K) of mean vectors cov a symmetric matrix or a list (of length K) of symmetric covariance matrices. epsilon a vector (of length K) indicating the probability of sampling a contaminated population (i.e., outlier) for each population kappa a vector (of length K) that determines the amount of scale contamination for each population seed seed for random number generation (If NULL, does not set seed)

Details

(1 - ε_k) N_p(μ_k, Σ_k) + ε_k N_p(μ_k, κ_k Σ_k),

where ε_k \in [0, 1] is the probability of sampling from a contaminated population (i.e., outlier) and κ_k ≥ 1 determines the amount of scale contamination. The contaminated normal distribution can be viewed as a mixture of two multivariate normal random distributions, where the second has a scaled covariance matrix, which can introduce extreme outliers for sufficiently large κ_k.

The number of populations, K, is determined from the length of the vector of sample sizes, coden. The mean vectors and covariance matrices each can be given in a list of length K. If one covariance matrix is given (as a matrix or a list having 1 element), then all populations share this common covariance matrix. The same logic applies to population means.

The contamination probabilities in epsilon can be given as a numeric vector or a single value, in which case the degrees of freedom is replicated K times. The same idea applies to the scale contamination in the kappa argument.

By default, epsilon is a vector of zeros, and kappa is a vector of ones. Hence, no contamination is applied by default.

Value

named list containing:

x:

A matrix whose rows are the observations generated and whose columns are the p features (variables)

y:

A vector denoting the population from which the observation in each row was generated.

Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Generates 10 observations from each of two multivariate contaminated normal # populations with equal covariance matrices. Each population has a # contamination probability of 0.05 and scale contamination of 10. mean_list <- list(c(1, 0), c(0, 1)) cov_identity <- diag(2) data <- simdata_contaminated(n = c(10, 10), mean = mean_list, cov = cov_identity, epsilon = 0.05, kappa = 10, seed = 42) dim(data$x) table(data$y) # Generates 10 observations from each of three multivariate contaminated # normal populations with unequal covariance matrices. The contamination # probabilities and scales differ for each population as well. set.seed(42) mean_list <- list(c(-3, -3), c(0, 0), c(3, 3)) cov_list <- list(cov_identity, 2 * cov_identity, 3 * cov_identity) data2 <- simdata_contaminated(n = c(10, 10, 10), mean = mean_list, cov = cov_list, epsilon = c(0.05, 0.1, 0.2), kappa = c(2, 5, 10)) dim(data2$x) table(data2$y) 

Example output

[1] 20  2

1  2
10 10
[1] 30  2

1  2  3
10 10 10


sortinghat documentation built on May 30, 2017, 4:52 a.m.