gen.data: Generate simulated data
In bestridge: A Comprehensive R Package for Best Subset Selection

Description Usage Arguments Details Value Author(s) See Also Examples

Generate data for simulations under the generalized linear model and Cox model.

gen.data(
  n,
  p,
  k = NULL,
  rho = 0,
  family = c("gaussian", "binomial", "poisson", "cox"),
  beta = NULL,
  cortype = 1,
  snr = 10,
  censoring = TRUE,
  c = 1,
  scal,
  sigma = 1,
  seed = 1
)

`n`	The number of observations.
`p`	The number of predictors of interest.
`k`	The number of nonzero coefficients in the underlying regression model. Can be omitted if `beta` is supplied.
`rho`	A parameter used to characterize the pairwise correlation in predictors. Default is `0`.
`family`	The distribution of the simulated data. `"gaussian"` for gaussian data.`"binomial"` for binary data. `"poisson"` for count data. `"cox"` for survival data.
`beta`	The coefficient values in the underlying regression model.
`cortype`	The correlation structure. `cortype = 1` denotes the exponential structure, where the covariance matrix has (i,j) entry equals rho^{\|i-j\|}. codecortype = 2 denotes the constant structure, where the (i,j) entry of covariance matrix is rho for every i \neq j and 1 elsewhere. `cortype = 3` denotes the moving average structure. Details can be found below.
`snr`	A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as as the variance of xβ divided by the variance of a gaussian noise: \frac{Var(xβ)}{σ^2}. The gaussian noise ε is set with mean 0 and variance. The noise is added to the linear predictor η = xβ. Default is `snr = 10`. This option is invalid for `cortype = 3`.
`censoring`	Whether data is censored or not. Valid only for `family = "cox"`. Default is `TRUE`.
`c`	The censoring rate. Default is `1`.
`scal`	A parameter in generating survival time based on the Weibull distribution. Only used for the "`cox`" family.
`sigma`	A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance σ^2. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio. Valid only for `cortype = 3`.
`seed`	seed to be used in generating the random numbers.

We generate an n \times p random Gaussian matrix X with mean 0 and a covariance matrix with an exponential structure or a constant structure. For the exponential structure, the covariance matrix has (i,j) entry equals rho^{|i-j|}. For the constant structure, the (i,j) entry of the covariance matrix is rho for every i \neq j and 1 elsewhere. For the moving average structure, For the design matrix X, we first generate an n \times p random Gaussian matrix \bar{X} whose entries are i.i.d. \sim N(0,1) and then normalize its columns to the √ n length. Then the design matrix X is generated with X_j = \bar{X}_j + ρ(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,…,p-1.

For family = "gaussian" , the data model is

Y = X β + ε.

The underlying regression coefficient β has uniform distribution [m, 100m], m=5 √{2log(p)/n}.

For family= "binomial", the data model is

Prob(Y = 1) = \exp(X β + ε)/(1 + \exp(X β + ε)).

The underlying regression coefficient β has uniform distribution [2m, 10m], m = 5σ √{2log(p)/n}.

For family = "poisson" , the data is modeled to have an exponential distribution:

Y = Exp(\exp(X β + ε)).

For family = "cox", the data model is

T = (-\log(S(t))/\exp(X β))^{1/scal}.

The centering time is generated from uniform distribution [0, c], then we define the censor status as δ = I\{T ≤q C\}, R = min\{T, C\}. The underlying regression coefficient β has uniform distribution [2m, 10m], m = 5σ √{2log(p)/n}. In the above models, ε \sim N(0, σ^2 ), where σ^2 is determined by the snr.

`x`	Design matrix of predictors.
`y`	Response variable.
`Tbeta`	The coefficients used in the underlying regression model.

Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.

bsrr, predict.bsrr.

# Generate simulated data
n <- 200
p <- 20
k <- 5
rho <- 0.4
SNR <- 10
cortype <- 1
seed <- 10
Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
lambda.list <- exp(seq(log(5), log(0.1), length.out = 10))
lm.bsrr <- bsrr(x, y, method = "pgsection")