# sim_student: Generates random variates from multivariate Student's t... In clusteval: Evaluation of Clustering Algorithms

## Description

We generate n_m observations (m = 1, …, M) from each of M multivariate Student's t distributions such that the Euclidean distance between each of the means and the origin is equal and scaled by Δ ≥ 0.

## Usage

 1 2  sim_student(n = rep(25, 5), p = 50, df = rep(6, 5), delta = 0, Sigma = diag(p), seed = NULL) 

## Arguments

 n a vector (of length M) of the sample sizes for each population p the dimension of the multivariate Student's t distributions df a vector (of length M) of the degrees of freedom for each population delta the fixed distance between each population and the origin Sigma the common covariance matrix seed seed for random number generation (If NULL, does not set seed)

## Details

Let Π_m denote the mth population with a p-dimensional multivariate Student's t distribution, T_p(μ_m, Σ_m, c_m), where μ_m is the population location vector, Σ_m is the positive-definite covariance matrix, and c_m is the degrees of freedom.

Let e_m be the mth standard basis vector (i.e., the mth element is 1 and the remaining values are 0). Then, we define

μ_m = Δ ∑_{j=1}^{p/M} e_{(p/M)(m-1) + j}.

Note that p must be divisible by M. By default, the first 10 dimensions of μ_1 are set to delta with all remaining dimensions set to 0, the second 10 dimensions of μ_2 are set to delta with all remaining dimensions set to 0, and so on.

We use a common covariance matrix Σ_m = Σ for all populations.

For small values of c_m, the tails are heavier, and, therefore, the average number of outlying observations is increased.

By default, we let M = 5, Δ = 0, Σ_m = I_p, and c_m = 6, m = 1, …, M, where I_p denotes the p \times p identity matrix. Furthermore, we generate 25 observations from each population by default.

For Δ = 0 and c_m = c, m = 1, …, M, the M populations are equal.

## Value

named list containing:

x:

A matrix whose rows are the observations generated and whose columns are the p features (variables)

y:

A vector denoting the population from which the observation in each row was generated.

## Examples

  1 2 3 4 5 6 7 8 9 10 11 data_generated <- sim_student(n = 10 * seq_len(5), seed = 42) dim(data_generated$x) table(data_generated$y) data_generated2 <- sim_student(p = 10, delta = 2, df = rep(2, 5)) table(data_generated2\$y) sample_means <- with(data_generated2, tapply(seq_along(y), y, function(i) { colMeans(x[i,]) })) (sample_means <- do.call(rbind, sample_means)) 

clusteval documentation built on May 29, 2017, 11:45 p.m.