gen.data: Generate simulated data
In scrcss319/BeSS: Best Subset Selection in Linear, Logistic and CoxPH Models

Description Usage Arguments Details Value Author(s) References Examples

Generate data for simulations under the generalized linear model and Cox model.

1 2	gen.data(n, p, family, K, rho = 0, sigma = 1, beta = NULL, censoring = TRUE, c = 1, scal)

`n`	The number of observations.
`p`	The number of predictors of interest.
`family`	The distribution of the simulated data. "`gaussian`" for gaussian data."`binomial`" for binary data. "`cox`" for survival data
`K`	The number of nonzero coefficients in the underlying regression model.
`rho`	A parameter used to characterize the pairwise correlation in predictors. Default is 0.
`sigma`	A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance σ^2. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio.
`beta`	The coefficient values in the underlying regression model.
`censoring`	Whether data is censored or not. Default is TRUE
`c`	The censoring rate. Default is 1.
`scal`	A parameter in generating survival time based on the Weibull distribution. Only used for the "`cox`" family.

For the design matrix X, we first generate an n x p random Gaussian matrix \bar{X} whose entries are i.i.d. \sim N(0,1) and then normalize its columns to the √ n length. Then the design matrix X is generated with X_j = \bar{X}_j + ρ(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,…,p-1.

For "gaussian" family, the data model is

Y = X β + ε, where ε \sim N(0, σ^2 ).

The underlying regression coefficient β has uniform distribution [m, 100m], m=5 √{2log(p)/n}.

For "binomial" family, the data model is

Prob(Y = 1) = exp(X β)/(1 + exp(X β))

The underlying regression coefficient β has uniform distribution [2m, 10m], m = 5σ √{2log(p)/n}.

For "cox" family, the data model is

T = (-log(S(t))/exp(X β))^(1/scal),

The centerning time C is generated from uniform distribution [0, c], then we define the censor status as δ = I{T <= C}, R = min{T, C}. The underlying regression coefficient β has uniform distribution [2m, 10m], m = 5σ √{2log(p)/n}.

A list with the following components: x, y, Tbeta.

`x`	Design matrix of predictors.
`y`	Response variable
`Tbeta`	The coefficients used in the underlying regression model.

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

Wen, C., Zhang, A., Quan, S. and Wang, X. (2017). BeSS: an R package for best subset selection in linear, logistic and CoxPH models. arXiv: 1709.06254.

# Generate simulated data
n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess(data$x, data$y, family = "gaussian")