Description Usage Arguments Details Value Author(s) References Examples
Generate sparse data with outliers using simulation scheme detailed in Hubert et al. (2016).
1 2 |
m |
Number of datasets to generate, default is 100. |
n |
Number of observations, default is 100. |
p |
Number of dimensions, default is 10. |
a |
Numeric vector containing the inner group correlations for each block. The number of useful blocks is thus given by k=length(a)-1 which should be at least 2. By default, the correlations are equal to 0.9, 0.5 and 0, respectively. |
bLength |
Length of the blocks of useful variables, default is 4. |
SD |
Numeric vector containing the standard deviations of the blocks of variables, default is |
eps |
Proportion of contamination, should be between 0 and 0.5. Default is 0 (no contamination). |
seed |
Logical indicating if a seed is used when generating the datasets, default is |
Firstly, we generate a correlation matrix such that it has sparse eigenvectors.
We design the correlation matrix to have length(a)=k+1 groups of variables with no correlation between variables from different groups. The first k groups consist of bLength
variables each. The correlation between the different variables of the group is equal to a[1]
for group 1, .... . The (k+1)th group contains the remaining p-k \times bLength variables, which we specify to have correlation a[k+1]
.
Secondly, the correlation matrix R
is transformed into the covariance matrix Σ= V^{0.5} \cdot R \cdot V^{0.5}, where V=diag(SD^2).
Thirdly, the n
observations are generated from a p-variate normal distribution with mean the p-variate zero-vector and covariance matrix Σ. Standard normally distributed noise terms are also added to each of the p
variables to make the sparse structure of the data harder to detect.
Finally, (100 \times eps)\% of the data points are randomly replaced by outliers.
These outliers are generated from a p-variate normal distribution as in Croux et al. (2013).
The ith eigenvector of R, for i=1,...,k, is given by a (sparse) vector with the (bLength \times (i-1)+1)th till the (bLength \times i)th elements equal to 1/√{bLength} and all other elements equal to zero.
See Hubert et al. (2016) for more details.
A list with components:
data |
List of length m containing all data matrices. |
ind |
List of length m containing the numeric vectors with the indices of the contaminated observations. |
R |
Correlation matrix of the data, a numeric matrix of size p by p. |
Sigma |
Covariance matrix of the data (Σ), a numeric matrix of size p by p. |
Tom Reynkens
Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). “Sparse PCA for High-Dimensional Data with Outliers,” Technometrics, 58, 424–434.
Croux, C., Filzmoser, P., and Fritz, H. (2013), “Robust Sparse Principal Component Analysis,” Technometrics, 55, 202–214.
1 2 3 4 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.