Generates artificial datasets with outliers


This function generates multivariate normal datasets with several possible types of outliers. It is used in several simulation studies. For a detailed description, see the referenced papers.


generateData(n, d, mu, Sigma, perout, gamma,
             outlierType = "casewise", seed = NULL)



The number of observations


The dimension of the data.


The center of the clean data.


The covariance matrix of the clean data. Could be obtained from generateCorMat.


The type of contamination to be generated. Should be one of:

  • "casewise": Generates point contamination in the direction of the last eigenvector of Sigma.

  • "cellwisePlain": Generates cellwise contamination by randomly replacing a number of cells by gamma.

  • "cellwiseStructured": Generates cellwise contamination by first randomly sampling contaminated cells, after which for each row, they are replaced by a multiple of the smallest eigenvector of Sigma restricted to the dimensions of the contaminated cells.

  • "both": combines "casewise" and "cellwiseStructured".


The percentage of generated outliers. For outlierType = "casewise" this is a fraction of rows. For outlierType = "cellWisePlain" or outlierType = "cellWiseStructured", a fraction of perout cells are replaced by contaminated cells. For outlierType = "both", a fraction of 0.5*perout of rowwise outliers is generated, after which the remaining data is contaminated with a fraction of 0.5*perout outlying cells.


How far outliers are from the center of the distribution.


Seed used to generate the data.


A list with components:

  • X
    The generated data matrix of size n \times d.

  • indcells
    A vector with the indices of the contaminated cells.

  • indrows
    A vector with the indices of the rowwise outliers.


J. Raymaekers and P.J. Rousseeuw


C. Agostinelli, Leung, A., Yohai, V. J., and Zamar, R. H. (2015). Robust Estimation of Multivariate Location and Scatter in the Presence of Cellwise and Casewise Contamination. Test, 24, 441-461.

Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)

J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)

n     <- 100
d     <- 5
mu    <- rep(0, d)
Sigma <- diag(d)
perout <- 0.1
gamma <- 10
data <- generateData(n, d, mu, Sigma, perout, gamma, outlierType = "cellwisePlain", seed  = 1)

