d.spls.simulate: Simulation of a data
In dual.spls: Dual Sparse Partial Least Squares Regression

d.spls.simulate

R Documentation

Simulation of a data

Description

The function d.spls.simulate simulates G mixtures of nondes Gaussians from which it builds a data set of predictors X and response y in a way that X can be divided into G groups and the values of y depend on the values of X.

Usage

d.spls.simulate(n=200,p=100,nondes=50,sigmaondes=0.05,sigmay=0.5,int.coef=1:5)

Arguments

`n`	a positive integer. `n` is the number of observations. Default value is `200`.
`p`	a numeric vector of length `G` representing the number of variables. Default value is `100`.
`nondes`	a numeric vector of length `G`. `nondes` is the number of Guassians in each mixture. Default value is `50`.
`sigmaondes`	a numeric vector of length `G`. `sigmaondes` is the standard deviation of the Gaussians for each group `g`. Default value is `0.05`.
`sigmay`	a real value. `sigmay` is the uncertainty on `y`. Default value is `0.5`.
`int.coef`	a numeric vector of the coefficients of the linear combination in the construction of the response vector `y`.

Details

The predictors matrix X is a concatenations of G predictors sub matrices. Each is computed using a mixture of Gaussian i.e. summing the following Gaussians:

A \exp{(-\frac{(\textrm{xech}-\mu)^2}{2 \sigma^2})}.

Where

A is a numeric vector of random values between 0 and 1,
xech is an element from the sequence of p(g) equally spaced values from 0 to 1. p(g) is the number of variables of the sub matrix g, for g \in \{1, \dots, G\},
\mu is a random value in [0,1] representing the mean of the Gaussians,
\sigma is a positive real value specified by the user and representing the standard deviation of the Gaussians.

The response vector y is a linear combination of the predictors to which we add a noise of uncertainty sigmay. It is computed as follows:

y_i= \sigma_y \times V_i +\sum_{g=1}^G \sum_{k=1}^K \textrm{int.coeff}_k \times \textrm{sum}X^{g}_{ik}

Where

G is the number of predictor sub matrices,
i is the index of the observation,
V is a normally distributed vector of 0 mean and unitary standard deviation,
K is the length of the vector int.coeff,
\textrm{sum}X^{g} is a matrix of n rows and K columns. The values of the column k are the sum of selected parts of each row of the sub matrix X^g. The columns of X^g are separated equally and each part is used for the K columns of \textrm{sum}X^{g}.

Value

A list of the following attributes

`X`	the concatenated predictors matrix.
`y`	the response vector.
`y0`	the response vector without noise `sigmay`.
`sigmay`	the uncertainty on `y`.
`sigmaondes`	the standard deviation of the Gaussians.
`G`	the number of groups.

Author(s)

Louna Alsouki François Wahl

Examples

### load dual.spls library
library(dual.spls)
####one predictors matrix
### parameters
n <- 100
p <- 50
nondes <- 20
sigmaondes <- 0.5
data1=d.spls.simulate(n=n,p=p,nondes=nondes,sigmaondes=sigmaondes)

Xa <- data1$X
ya <- data1$y

###plotting the data
plot(Xa[1,],type='l',ylim=c(0,max(Xa)),main='Data', ylab='Xa',col=1)
for (i in 2:n){ lines(Xa[i,],col=i) }

####two predictors matrix
### parameters
n <- 100
p <- c(50,100)
nondes <- c(20,30)
sigmaondes <- c(0.05,0.02)
data2=d.spls.simulate(n=n,p=p,nondes=nondes,sigmaondes=sigmaondes)

Xb <- data2$X
X1 <- Xb[,(1:p[1])]
X2 <- Xb[,(p[1]+1):(p[1]+p[2])]
yb <- data2$y

###plotting the data
plot(Xb[1,],type='l',ylim=c(0,max(Xb)),main='Data', ylab='Xb',col=1)
for (i in 2:n){ lines(Xb[i,],col=i) }

###plotting the data
plot(X1[1,],type='l',ylim=c(0,max(X1)),main='Data X1', ylab='X1',col=1)
for (i in 2:n){ lines(X1[i,],col=i) }

###plotting the data
plot(X2[1,],type='l',ylim=c(0,max(X2)),main='Data X2', ylab='X2',col=1)
for (i in 2:n){ lines(X2[i,],col=i) }

dual.spls documentation built on April 19, 2023, 1:07 a.m.