sim_student: Generates random variates from multivariate Student's t...

Description Usage Arguments Details Value Examples

Description

We generate n_k observations (k = 1, …, K_0) from each of K_0 multivariate Student's t distributions such that the Euclidean distance between each of the means and the origin is equal and scaled by Δ ≥ 0.

Usage

1
2
sim_student(n = rep(25, 5), p = 50, df = rep(6, 5), delta = 0,
  Sigma = diag(p), seed = NULL)

Arguments

n

a vector (of length M) of the sample sizes for each population

p

the dimension of the multivariate Student's t distributions

df

a vector (of length M) of the degrees of freedom for each population

delta

the fixed distance between each population and the origin

Sigma

the common covariance matrix

seed

seed for random number generation (If NULL, does not set seed)

Details

Let Π_k denote the kth population with a p-dimensional multivariate Student's t distribution, T_p(μ_k, Σ_k, c_k), where μ_k is the population location vector, Σ_k is the positive-definite covariance matrix, and c_k is the degrees of freedom.

Let e_k be the mth standard basis vector (i.e., the kth element is 1 and the remaining values are 0). Then, we define

μ_k = Δ ∑_{j=1}^{p/K_0} e_{(p/K_0)(k-1) + j}.

Note that p must be divisible by K_0. By default, the first 10 dimensions of μ_1 are set to Δ with all remaining dimensions set to 0, the second 10 dimensions of μ_2 are set to Δ with all remaining dimensions set to 0, and so on.

We use a common covariance matrix Σ_k = Σ for all populations.

For small values of c_k, the tails are heavier, and, therefore, the average number of outlying observations is increased.

By default, we let K_0 = 5, Δ = 0, Σ_k = I_p, and c_k = 6, k = 1, …, K_0, where I_p denotes the p \times p identity matrix. Furthermore, we generate 25 observations from each population by default.

For Δ = 0 and c_k = c, k = 1, …, K_0, the K_0 populations are equal.

Value

named list containing:

x:

A matrix whose rows are the observations generated and whose columns are the p features (variables)

y:

A vector denoting the population from which the observation in each row was generated.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
data_generated <- sim_student(n = 10 * seq_len(5), seed = 42)
dim(data_generated$x)
table(data_generated$y)

data_generated2 <- sim_student(p = 10, delta = 2, df = rep(2, 5))
table(data_generated2$y)
sample_means <- with(data_generated2,
                     tapply(seq_along(y), y, function(i) {
                            colMeans(x[i,])
                     }))
(sample_means <- do.call(rbind, sample_means))

ramhiser/clusteval documentation built on May 26, 2019, 10:07 p.m.