Description Usage Arguments Details Value Examples
View source: R/simdata-contaminated-normal.r
We generate n_k observations (k = 1, …, K) from each of K multivariate contaminated normal distributions. Let N_p(μ, Σ) denote the p-dimensional multivariate normal distribution with mean vector μ and positive-definite covariance matrix Σ. Then, let the kth population have a p-dimensional multivariate contaminated normal distribution:
a vector (of length K) of the sample sizes for each population
a vector or a list (of length K) of mean vectors
a symmetric matrix or a list (of length K) of symmetric covariance matrices.
a vector (of length K) indicating the probability of sampling a contaminated population (i.e., outlier) for each population
a vector (of length K) that determines the amount of scale contamination for each population
seed for random number generation (If
(1 - ε_k) N_p(μ_k, Σ_k) + ε_k N_p(μ_k, κ_k Σ_k),
where ε_k \in [0, 1] is the probability of sampling from a contaminated population (i.e., outlier) and κ_k ≥ 1 determines the amount of scale contamination. The contaminated normal distribution can be viewed as a mixture of two multivariate normal random distributions, where the second has a scaled covariance matrix, which can introduce extreme outliers for sufficiently large κ_k.
The number of populations,
K, is determined from
the length of the vector of sample sizes, coden. The
mean vectors and covariance matrices each can be given in
a list of length
K. If one covariance matrix is
given (as a matrix or a list having 1 element), then all
populations share this common covariance matrix. The same
logic applies to population means.
The contamination probabilities in
epsilon can be
given as a numeric vector or a single value, in which
case the degrees of freedom is replicated
The same idea applies to the scale contamination in the
epsilon is a vector of zeros, and
kappa is a vector of ones. Hence, no contamination
is applied by default.
named list containing:
whose rows are the observations generated and whose
columns are the
p features (variables)
A vector denoting the population from which the observation in each row was generated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
# Generates 10 observations from each of two multivariate contaminated normal # populations with equal covariance matrices. Each population has a # contamination probability of 0.05 and scale contamination of 10. mean_list <- list(c(1, 0), c(0, 1)) cov_identity <- diag(2) data <- simdata_contaminated(n = c(10, 10), mean = mean_list, cov = cov_identity, epsilon = 0.05, kappa = 10, seed = 42) dim(data$x) table(data$y) # Generates 10 observations from each of three multivariate contaminated # normal populations with unequal covariance matrices. The contamination # probabilities and scales differ for each population as well. set.seed(42) mean_list <- list(c(-3, -3), c(0, 0), c(3, 3)) cov_list <- list(cov_identity, 2 * cov_identity, 3 * cov_identity) data2 <- simdata_contaminated(n = c(10, 10, 10), mean = mean_list, cov = cov_list, epsilon = c(0.05, 0.1, 0.2), kappa = c(2, 5, 10)) dim(data2$x) table(data2$y)
 20 2 1 2 10 10  30 2 1 2 3 10 10 10
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.