SimData: SimData

Description Usage Arguments Details Value Author(s) References Examples

View source: R/SimData.R

Description

This function generates a synthetic dataset for the purpose of data clustering in case of both outliers and noise variables are present.

Usage

1
2
3
4
SimData(size_grp, p_inf, p_noise = NULL, p_out_inf = NULL, pct_out = 0.1,
  scatter_out = TRUE, p_out_noise = NULL, noise_pct_out = 0.1,
  unif_out_range = NULL, mu_grp_range = NULL, s_out_range = NULL,
  rho_grp_range = NULL)

Arguments

size_grp

A numeric vector containing the group sizes to be generated. The length of the vector corresponds to the number of groups to be generated.

p_inf

The number of informative variables describing the generated group structure.

p_noise

The number of noise variables which do not contribute to the group separation. If not specified, no noise variables are generated.

p_out_inf

The number of informative variables in which the observations are contaminated, i.e. replaced by outliers either scatter outliers or uniformly distributed outliers, see scatter_out.

pct_out

The proportion of observations to be contaminated in the informative variables, default is 0.10.

scatter_out

If TRUE, scattered outliers are generated with the characteristics specified in s_out_range, otherwise uniformly distributed outliers are produced with the specification defined in unif_out_range.

p_out_noise

The number of noise variables in which the contamination is conducted, see unif_out_range.

noise_pct_out

The proportion of observations to be contaminated in noise variables by replacing them with uniformly distributed outliers. The contaminated observations differ from those contaminated in informative variables, default is 0.10.

unif_out_range

Optional argument. You can change the interval of an uniform distribution to randomly generated outliers in [min1,max1] or [min2,max2]. The specification hase to be in the list, default is list(min1=-12,max1=-6,min2=6,max2=12).

mu_grp_range

Optional argument, see references. Default is list(min1=-6,max1=-3,min2=3, max2=6).

s_out_range

Optional argument, see references. Default is list(min=3,max=10).

rho_grp_range

Optional argument, see references. Default is list(min=0.1,max=0.9).

Details

Groups are generated in the first p_inf informative variables with various characteristics following Gaussian models. The groups have different mean vectors and covariance matrix which is additionally randomly rotated. If uninformative variables are required, p_noise noise variables are generated following uniform distributions and added to an informative part. Two types of outliers - scattered and uniformly distributed- are considered to contaminated data. The outiers can be placed either in the informative or uninformative part.

Value

x

A data matrix of a synthetic dataset.

y

An integer vector corresponding to a group membership before contamination.

lb

An integer vector with group labels and outlier labels denoted by 0.

lbout

An integer vector with group lables and labels for outlier in informative variables (0) and noise variables (the number of groups+1).

Author(s)

Sarka Brodinova <sarka.brodinova@tuwien.ac.at>

References

S. Brodinova, P. Filzmoser, T. Ortner, C. Breiteneder, M. Zaharieva. Robust and sparse k-means clustering for high-dimensional data. Submitted for publication, 2017. Available at http://arxiv.org/abs/1709.10012

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Generate 3 groups of equal sizes in the first 50 variables with 10% of
# scatter outliers in all 50 informative variables, and
# 10% of uniformly distributed outliers in 75 noise variables.

d <- SimData(size_grp=c(40,40,40),p_inf=50,
p_noise=750,p_out_noise=75)

# group membership with outliers in 0 group
table(d$lb)

# scatter outliers in 0 group and uniformly distributed outliers in 4 group
table(d$lbout)

brodsa/wrsk documentation built on April 7, 2020, 6:12 a.m.