sim_student: Generates random variates from multivariate Student's t...
In ramhiser/clusteval: Evaluation of Clustering Algorithms

Description Usage Arguments Details Value Examples

We generate n_k observations (k = 1, …, K_0) from each of K_0 multivariate Student's t distributions such that the Euclidean distance between each of the means and the origin is equal and scaled by Δ ≥ 0.

1 2	sim_student(n = rep(25, 5), p = 50, df = rep(6, 5), delta = 0, Sigma = diag(p), seed = NULL)

`n`	a vector (of length M) of the sample sizes for each population
`p`	the dimension of the multivariate Student's t distributions
`df`	a vector (of length M) of the degrees of freedom for each population
`delta`	the fixed distance between each population and the origin
`Sigma`	the common covariance matrix
`seed`	seed for random number generation (If NULL, does not set seed)

Let Π_k denote the kth population with a p-dimensional multivariate Student's t distribution, T_p(μ_k, Σ_k, c_k), where μ_k is the population location vector, Σ_k is the positive-definite covariance matrix, and c_k is the degrees of freedom.

Let e_k be the mth standard basis vector (i.e., the kth element is 1 and the remaining values are 0). Then, we define

μ_k = Δ ∑_{j=1}^{p/K_0} e_{(p/K_0)(k-1) + j}.

Note that p must be divisible by K_0. By default, the first 10 dimensions of μ_1 are set to Δ with all remaining dimensions set to 0, the second 10 dimensions of μ_2 are set to Δ with all remaining dimensions set to 0, and so on.

We use a common covariance matrix Σ_k = Σ for all populations.

For small values of c_k, the tails are heavier, and, therefore, the average number of outlying observations is increased.

By default, we let K_0 = 5, Δ = 0, Σ_k = I_p, and c_k = 6, k = 1, …, K_0, where I_p denotes the p \times p identity matrix. Furthermore, we generate 25 observations from each population by default.

For Δ = 0 and c_k = c, k = 1, …, K_0, the K_0 populations are equal.

named list containing:

x:: A matrix whose rows are the observations generated and whose columns are the p features (variables)
y:: A vector denoting the population from which the observation in each row was generated.

data_generated <- sim_student(n = 10 * seq_len(5), seed = 42)
dim(data_generated$x)
table(data_generated$y)

data_generated2 <- sim_student(p = 10, delta = 2, df = rep(2, 5))
table(data_generated2$y)
sample_means <- with(data_generated2,
                     tapply(seq_along(y), y, function(i) {
                            colMeans(x[i,])
                     }))
(sample_means <- do.call(rbind, sample_means))