Description Usage Arguments Details Value Examples
View source: R/simdata-contaminated-normal.r
We generate n_k observations (k = 1, …, K) from each of K multivariate contaminated normal distributions. Let N_p(μ, Σ) denote the p-dimensional multivariate normal distribution with mean vector μ and positive-definite covariance matrix Σ. Then, let the kth population have a p-dimensional multivariate contaminated normal distribution:
1 2 |
n |
a vector (of length K) of the sample sizes for each population |
mean |
a vector or a list (of length K) of mean vectors |
cov |
a symmetric matrix or a list (of length K) of symmetric covariance matrices. |
epsilon |
a vector (of length K) indicating the probability of sampling a contaminated population (i.e., outlier) for each population |
kappa |
a vector (of length K) that determines the amount of scale contamination for each population |
seed |
seed for random number generation (If
|
(1 - ε_k) N_p(μ_k, Σ_k) + ε_k N_p(μ_k, κ_k Σ_k),
where ε_k \in [0, 1] is the probability of sampling from a contaminated population (i.e., outlier) and κ_k ≥ 1 determines the amount of scale contamination. The contaminated normal distribution can be viewed as a mixture of two multivariate normal random distributions, where the second has a scaled covariance matrix, which can introduce extreme outliers for sufficiently large κ_k.
The number of populations, K
, is determined from
the length of the vector of sample sizes, coden. The
mean vectors and covariance matrices each can be given in
a list of length K
. If one covariance matrix is
given (as a matrix or a list having 1 element), then all
populations share this common covariance matrix. The same
logic applies to population means.
The contamination probabilities in epsilon
can be
given as a numeric vector or a single value, in which
case the degrees of freedom is replicated K
times.
The same idea applies to the scale contamination in the
kappa
argument.
By default, epsilon
is a vector of zeros, and
kappa
is a vector of ones. Hence, no contamination
is applied by default.
named list containing:
A matrix
whose rows are the observations generated and whose
columns are the p
features (variables)
A vector denoting the population from which the observation in each row was generated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Generates 10 observations from each of two multivariate contaminated normal
# populations with equal covariance matrices. Each population has a
# contamination probability of 0.05 and scale contamination of 10.
mean_list <- list(c(1, 0), c(0, 1))
cov_identity <- diag(2)
data <- simdata_contaminated(n = c(10, 10), mean = mean_list,
cov = cov_identity, epsilon = 0.05, kappa = 10,
seed = 42)
dim(data$x)
table(data$y)
# Generates 10 observations from each of three multivariate contaminated
# normal populations with unequal covariance matrices. The contamination
# probabilities and scales differ for each population as well.
set.seed(42)
mean_list <- list(c(-3, -3), c(0, 0), c(3, 3))
cov_list <- list(cov_identity, 2 * cov_identity, 3 * cov_identity)
data2 <- simdata_contaminated(n = c(10, 10, 10), mean = mean_list,
cov = cov_list, epsilon = c(0.05, 0.1, 0.2),
kappa = c(2, 5, 10))
dim(data2$x)
table(data2$y)
|
[1] 20 2
1 2
10 10
[1] 30 2
1 2 3
10 10 10
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.