unisimrel: Function for data simulation

unisimrelR Documentation

Function for data simulation

Description

Functions for data simulation from a random regression model with one response variable where the data properties can be controlled by a few input parameters. The data simulation is based on the concept of relevant latent components and relevant predictors, and was developed for the purpose of testing methods for variable selection for prediction.

Usage

unisimrel(
  n,
  p,
  q,
  relpos,
  gamma,
  R2,
  ntest = NULL,
  muY = NULL,
  muX = NULL,
  lambda.min = .Machine$double.eps,
  sim = NULL
)

Arguments

n

The number of (training) samples to generate.

p

The total number of predictor variables to generate.

q

The number of relevant predictor variables (as a subset of p).

relpos

A vector indicating the position (between 1 and p) of the m relevant components, e.g. c(1,2) means that the first two latent components should be relevant. The length of relpos must be equal to m.

gamma

A number defining the speed of decline in eigenvalues (variances) of the latent components. The eigenvalues are assumed to decline according to an exponential model. The first eigenvalue is set equal to 1.

R2

The theoretical R-squared according to the true linear model. A number between 0 and 1.

ntest

The number of test samples to be generated (optional).

muY

The true mean of the response variable (optional). Default is muY=NULL.

muX

The p-vector of true means of the predictor variables (optional). Default is muX=NULL.

lambda.min

Lower bound of the eigenvalues. Defaults to .Machine$double.eps.

sim

A fitted simrel object. If this is given, the same regression coefficients will be used to simulate a new data set of requested size. Default is NULL, for which new regression coefficients are sampled.

Details

The data are simulated according to a multivariate normal model for the vector (y, z_1, z_2, z_3, ..., z_p)^t where y is the response variable and z = (z_1,..., z_p)^t is the vector of latent (principal) components. The ordered principal components are uncorrelated variables with declining variances (eigenvalues) defined for component j as e^{-γ * j}/e^{-γ}. Hence, the variance (eigenvalue) of the first principal component is equal to 1, and a large value of γ gives a rapid decline in the variances. The variance of the response variable is by default fixed equal to 1.

Some of the principal components (ordered by their decreasing variances) are assumed to be relevant for the prediction of the response. The indices of the positions of the relevant components are set by the relpos argument. The joint degree of relevance for the relevant components is determined by the population R-squared defined by R2.

In order to obtain predictor variables x = (x_1, x_2, ..., x_p)^t for y, a random rotation of the principal components is performed. Hence, x = R^t*z for some random rotation matrix R. For values of q satisfying m <= q <p only a subspace of dimension q containing the m relevant component(s) is rotated. This facilitates the possibility to generate q relevant predictor variables (x's). The indices of the relevant predictors is randomly selected with the only restriction that the index set contains the indices in relpos. The final index set of the relevant predictors is saved in the output argument relpred. If q=p all p predictor variables are relevant for the prediction of y.

For further details on the simulation approach, please see S<e6>b<f8>, Alm<f8>y and Helland (2015).

Value

A simrel object with list of following items,

call

The call to simrel.

X

The (n x p) simulated predictor matrix.

Y

The n-vector of simulated response values.

beta

The vector of true regression coefficients.

beta0

The true intercept. This is zero if muY=NULL and muX=NULL

muY

The true mean of the response variable.

muX

The p-vector of true means of the predictor variables.

relpred

The index of the true relevant predictors, that is the x-variables with non-zero true regression coefficients.

TESTX

The (ntest x p) matrix of optional test samples.

TESTY

The ntest-vector of responses of the optional test samples.

n

The number of simulated samples.

p

The number of predictor variables.

m

The number of relevant components.

q

The number of relevant predictors.

gamma

The decline parameter in the exponential model for the true eigenvalues.

lambda

The true eigenvalues of the covariance matrix of the p predictor variables.

R2

The true R-squared value of the linear model.

relpos

The positions of the relevant components.

minerror

The minimum achievable prediction error. Also the variance of the noise term in the linear model.

r

The sampled correlations between the principal components and the response.

Sigma

The true covariance matrix of (y,z_1, z_2, ..., z_p)^t.

Rotation

The random rotation matrix which is used to achieve the predictor variables as rotations of the latent components. Equals the transposed of the eigenvector-matrix of the covariance matrix of (x_1,...,x_p)^t.

type

The type of response generated, either "univariate" as returned from simrel, or "bivariate" as returned from simrel2.

Author(s)

Solve S<e6>b<f8> and Kristian H. Liland

References

Helland, I. S. and Alm<f8>y, T., 1994, Comparison of prediction methods when only a few components are relevant, J. Amer. Statist. Ass., 89(426), 583 – 591.

S<e6>b<f8>, S., Alm<f8>y, T. and Helland, I. S., 2015, simrel - A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors, Chemometr. Intell. Lab.(in press),doi:10.1016/j.chemolab.2015.05.012.

Examples


#Linear model data, large n, small p
mydata <- unisimrel(n = 250, p = 20, q = 5, relpos = c(2, 4), gamma = 0.25, R2 = 0.75)

#Estimating model parameters using ordinary least squares
lmfit <- lm(mydata$Y ~ mydata$X)
summary(lmfit)

#Comparing true with estimated regression coefficients
plot(mydata$beta, lmfit$coef[-1], xlab = "True regression coefficients",
  ylab = "Estimated regression coefficients")
abline(0,1)

#Linear model data, small n, large p
mydata <- unisimrel(n = 50, p = 200, q = 25, relpos = c(2, 4), gamma = 0.25, R2 = 0.8 )

#Simulating more samples with identical distribution as previous simulation
mydata2 <- unisimrel(n = 2500, sim = mydata)

#Estimating model parameters using partial least squares regression with
#cross-validation to determine the number of relevant components.
if (requireNamespace("pls", quietly = TRUE)) {
  require(pls)
  plsfit <- plsr(mydata$Y ~ mydata$X, 15, validation = "CV")
 
  #Validation plot and finding the number of relevant components.
  plot(0:15, c(plsfit$validation$PRESS0, plsfit$validation$PRESS),
    type = "b", xlab = "Components", ylab = "PRESS")
  mincomp <- which(plsfit$validation$PRESS == min(plsfit$validation$PRESS))
 
  #Comparing true with estimated regression coefficients
  plot(mydata$beta, plsfit$coef[, 1, mincomp], xlab = "True regression coefficients",
    ylab = "Estimated regression coefficients")
  abline(0, 1)
}

therimalaya/simulatr documentation built on Nov. 20, 2022, 2:49 p.m.