SimData: SimData
In brodsa/wrsk: Robust (weighted) and sparse k-means clustering

Description Usage Arguments Details Value Author(s) References Examples

View source: R/SimData.R

This function generates a synthetic dataset for the purpose of data clustering in case of both outliers and noise variables are present.

SimData(size_grp, p_inf, p_noise = NULL, p_out_inf = NULL, pct_out = 0.1,
  scatter_out = TRUE, p_out_noise = NULL, noise_pct_out = 0.1,
  unif_out_range = NULL, mu_grp_range = NULL, s_out_range = NULL,
  rho_grp_range = NULL)

`size_grp`	A numeric vector containing the group sizes to be generated. The length of the vector corresponds to the number of groups to be generated.
`p_inf`	The number of informative variables describing the generated group structure.
`p_noise`	The number of noise variables which do not contribute to the group separation. If not specified, no noise variables are generated.
`p_out_inf`	The number of informative variables in which the observations are contaminated, i.e. replaced by outliers either scatter outliers or uniformly distributed outliers, see `scatter_out`.
`pct_out`	The proportion of observations to be contaminated in the informative variables, default is 0.10.
`scatter_out`	If `TRUE`, scattered outliers are generated with the characteristics specified in `s_out_range`, otherwise uniformly distributed outliers are produced with the specification defined in `unif_out_range`.
`p_out_noise`	The number of noise variables in which the contamination is conducted, see `unif_out_range`.
`noise_pct_out`	The proportion of observations to be contaminated in noise variables by replacing them with uniformly distributed outliers. The contaminated observations differ from those contaminated in informative variables, default is 0.10.
`unif_out_range`	Optional argument. You can change the interval of an uniform distribution to randomly generated outliers in `[min1,max1] or [min2,max2]`. The specification hase to be in the list, default is `list(min1=-12,max1=-6,min2=6,max2=12)`.
`mu_grp_range`	Optional argument, see references. Default is `list(min1=-6,max1=-3,min2=3, max2=6)`.
`s_out_range`	Optional argument, see references. Default is `list(min=3,max=10)`.
`rho_grp_range`	Optional argument, see references. Default is `list(min=0.1,max=0.9)`.

Groups are generated in the first p_inf informative variables with various characteristics following Gaussian models. The groups have different mean vectors and covariance matrix which is additionally randomly rotated. If uninformative variables are required, p_noise noise variables are generated following uniform distributions and added to an informative part. Two types of outliers - scattered and uniformly distributed- are considered to contaminated data. The outiers can be placed either in the informative or uninformative part.

`x`	A data matrix of a synthetic dataset.
`y`	An integer vector corresponding to a group membership before contamination.
`lb`	An integer vector with group labels and outlier labels denoted by 0.
`lbout`	An integer vector with group lables and labels for outlier in informative variables (0) and noise variables (the number of groups+1).

Sarka Brodinova <sarka.brodinova@tuwien.ac.at>

S. Brodinova, P. Filzmoser, T. Ortner, C. Breiteneder, M. Zaharieva. Robust and sparse k-means clustering for high-dimensional data. Submitted for publication, 2017. Available at http://arxiv.org/abs/1709.10012

# Generate 3 groups of equal sizes in the first 50 variables with 10% of
# scatter outliers in all 50 informative variables, and
# 10% of uniformly distributed outliers in 75 noise variables.

d <- SimData(size_grp=c(40,40,40),p_inf=50,
p_noise=750,p_out_noise=75)

# group membership with outliers in 0 group
table(d$lb)

# scatter outliers in 0 group and uniformly distributed outliers in 4 group
table(d$lbout)