mixture_generator: Gaussian mixtures dataset generator with regression between...
In CorReg: Linear Regression Based on Linear Structure Between Variables

Description Usage Arguments Value Examples

Generates a dataset (with an additional validation sample) made of Gaussian mixtures with some of them generated by sub-regressions on others. A response variable is then added by linear regression. This function is used to generate datasets for simulations using CorReg, or just with Gaussian Mitures.

mixture_generator(n = 130, p = 100, ratio = 0.4, max_compl = 1,
  valid = 1000, positive = 0.6, sigma_Y = 10, sigma_X = NULL,
  R2 = NULL, R2Y = 0.4, meanvar = NULL, sigmavar = NULL, lambda = 3,
  Amax = NULL, lambdapois = 10, gamma = FALSE, gammashape = 1,
  gammascale = 0.5, tp1 = 1, tp2 = 1, tp3 = 1, nonlin = 0,
  pnonlin = 2, scale = TRUE, Z = NULL)

`n`	the number of individuals in the learning dataset
`p`	the number of covariates (without the response)
`ratio`	the ratio of covariates generated by sub-regressions on others
`max_compl`	the number of covariates in each sub-regression
`valid`	the number of individuals in the validation sample
`positive`	the ratio of positive coefficients in both the regression and the sub-regressions
`sigma_Y`	the standard deviation for the noise of the regression
`sigma_X`	the standard deviation for the noise of the sub-regressions (all). ignored if `gamma=TRUE` or if `R2` is not NULL
`R2`	the strength of the sub-regressions (coefficients will be chosen to obtain this value).
`R2Y`	the strength of the main regression (coefficients will be chosen to obtain this value).
`meanvar`	vector of means for the covariates.
`sigmavar`	standard deviation of the covariates.
`lambda`	parameter of the Poisson's law that defines the number of components in Gaussian Mixture models
`Amax`	the maximum number of covariates with non-zero coefficients in the regression
`lambdapois`	parameter used to generate the coefficient in the subregressions. Poisson's distribution.
`gamma`	(boolean) to generate a p-sized vector `sigma_X` gamma-distributed
`gammashape`	shape parameter of the gamma distribution (if needed)
`gammascale`	scale parameter of the gamma distribution (if needed)
`tp1`	the ratio of right-side (explicative) covariates allowed to have a non-zero coefficient in the regression
`tp2`	the ratio of left-side (redundant) covariates allowed to have a non-zero coefficient in the regression
`tp3`	the ratio of strictly independent covariates allowed to have a non-zero coefficient in the regression
`nonlin`	to use non linear structure (squared or log). If not null, it is the proba to use power pnonlin instead of log. The type is drawn for each link between covariates
`pnonlin`	the power used if non linear structure
`scale`	(boolean) to scale X before computing Y
`Z`	the binary squared adjacency matrix (size p) to obtain. If NULL it is randomly generated, based on `ratio` and `max_compl` parameters.

a list that contains:

`X_appr`	matrix of the learning set. `p` covariates following Gaussian Mixtures with some of them generated by sub-regressions on others.
`Y_appr`	Response variable vector (size `n`) generated by linear regression on `X_appr` with coefficients `A` and residual standard deviation `sigma_Y`.
`A`	vector of the of the regression generating `Y_appr`
`B`	Matrix of the coefficients of sub-regressions (first line : the intercepts) then `B[i-1,j]` is the coefficient associated to `X_appr[,i]` in the sub-regression that generates `X_appr[,j]`
`Z`	Binary squared adjacency matrix of size `p` that describes the structure of sub-regressions. `Z[i,j]`=1 if `X_appr[,i]` explains `X_appr[,j]`
`X_test`	validation sample generated the same way as `X_appr`, with `valid` individuals.
`Y_test`	Response vector associated to the validation sample
`sigma_X`	Vector of the standard deviations of the residuals of the sub-regressions (one value for each sub-regression)
`sigma_Y`	Standard deviation of the residual of the regression that generates `Y_appr` and `Y_test`.
`nbcomp`	vector of the number of components for covariates that are not explained by others.

## Not run: 
require(CorReg)
   #dataset generation
   base=mixture_generator(n=1500,p=10,valid=0)
   X_appr=base$X_appr #learning sample
   Y_appr=base$Y_appr#response variable
   for(i in 1:ncol(X_appr)){
   hist(X_appr[,i])
   }
   
## End(Not run)