Description Usage Arguments Details Value Author(s) References Examples
This function generates a synthetic dataset for the purpose of data clustering in case of both outliers and noise variables are present.
1 2 3 4 |
size_grp |
A numeric vector containing the group sizes to be generated. The length of the vector corresponds to the number of groups to be generated. |
p_inf |
The number of informative variables describing the generated group structure. |
p_noise |
The number of noise variables which do not contribute to the group separation. If not specified, no noise variables are generated. |
p_out_inf |
The number of informative variables in which the observations are contaminated,
i.e. replaced by outliers either scatter outliers or uniformly distributed outliers, see |
pct_out |
The proportion of observations to be contaminated in the informative variables, default is 0.10. |
scatter_out |
If |
p_out_noise |
The number of noise variables in which the contamination is conducted,
see |
noise_pct_out |
The proportion of observations to be contaminated in noise variables by replacing them with uniformly distributed outliers. The contaminated observations differ from those contaminated in informative variables, default is 0.10. |
unif_out_range |
Optional argument. You can change the interval of an uniform distribution
to randomly generated outliers in |
mu_grp_range |
Optional argument, see references. Default is |
s_out_range |
Optional argument, see references. Default is |
rho_grp_range |
Optional argument, see references. Default is |
Groups are generated in the first p_inf
informative variables with various characteristics
following Gaussian models. The groups have different mean vectors and covariance matrix which is additionally
randomly rotated. If uninformative variables are required, p_noise
noise variables
are generated following uniform distributions and added to an informative part.
Two types of outliers - scattered and uniformly distributed- are considered to contaminated data.
The outiers can be placed either in the informative or uninformative part.
x |
A data matrix of a synthetic dataset. |
y |
An integer vector corresponding to a group membership before contamination. |
lb |
An integer vector with group labels and outlier labels denoted by 0. |
lbout |
An integer vector with group lables and labels for outlier in informative variables (0) and noise variables (the number of groups+1). |
Sarka Brodinova <sarka.brodinova@tuwien.ac.at>
S. Brodinova, P. Filzmoser, T. Ortner, C. Breiteneder, M. Zaharieva. Robust and sparse k-means clustering for high-dimensional data. Submitted for publication, 2017. Available at http://arxiv.org/abs/1709.10012
1 2 3 4 5 6 7 8 9 10 11 12 | # Generate 3 groups of equal sizes in the first 50 variables with 10% of
# scatter outliers in all 50 informative variables, and
# 10% of uniformly distributed outliers in 75 noise variables.
d <- SimData(size_grp=c(40,40,40),p_inf=50,
p_noise=750,p_out_noise=75)
# group membership with outliers in 0 group
table(d$lb)
# scatter outliers in 0 group and uniformly distributed outliers in 4 group
table(d$lbout)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.