Augmented.data: Accommodating missingness in environmental measurements in...

View source: R/Augmented.data.R

Augmented.dataR Documentation

Accommodating missingness in environmental measurements in gene-environment interaction analysis

Description

We consider the scenario with missingness in environmental (E) measurements. Our approach consists of two steps. We first develop a nonparametric kernel-based data augmentation approach to accommodate missingness. Then, we adopt a penalization approach BLMCP for regularized estimation and selection of important interactions and main genetic (G) effects, where the "main effects-interactions" hierarchical structure is respected. As E variables are usually preselected and have a low dimension, selection is not conducted on E variables. With a well-designed weighting scheme, a nice "byproduct" is that the proposed approach enjoys a certain robustness property.

Usage

Augmented.data(G, E, Y, h, family = c("continuous", "survival"), E_type)

Arguments

G

Input matrix of p genetic measurements consisting of n rows. Each row is an observation vector.

E

Input matrix of q environmental risk factors. Each row is an observation vector.

Y

Response variable. A quantitative vector for family="continuous". For family="survival", Y should be a two-column matrix with the first column being the log(survival time) and the second column being the censoring indicator. The indicator is a binary variable, with "1" indicating dead, and "0" indicating right censored.

h

The bandwidths of the kernel functions with the first and second elements corresponding to the discrete and continuous E factors.

family

Response type of Y (see above).

E_type

A vector indicating the type of each E factor, with "ED" representing discrete E factor, and "EC" representing continuous E factor.

Value

E_w

The augmented data corresponding to E.

G_w

The augmented data corresponding to G.

y_w

The augmented data corresponding to response y.

weight

The weights of the augmented observation data for accommodating missingness and also right censoring if family="survival".

References

Mengyun Wu, Yangguang Zang, Sanguo Zhang, Jian Huang, and Shuangge Ma. Accommodating missingness in environmental measurements in gene-environment interaction analysis. Genetic Epidemiology, 41(6):523-554, 2017.
Jin Liu, Jian Huang, Yawei Zhang, Qing Lan, Nathaniel Rothman, Tongzhang Zheng, and Shuangge Ma. Identification of gene-environment interactions in cancer studies using penalization. Genomics, 102(4):189-194, 2013.

Examples

set.seed(100)
sigmaG=AR(0.3,50)
G=MASS::mvrnorm(100,rep(0,50),sigmaG)
E=matrix(rnorm(100*5),100,5)
E[,2]=E[,2]>0
E[,3]=E[,3]>0
alpha=runif(5,2,3)
beta=matrix(0,5+1,50)
beta[1,1:7]=runif(7,2,3)
beta[2:4,1]=runif(3,2,3)
beta[2:3,2]=runif(2,2,3)
beta[5,3]=runif(1,2,3)

# continuous with Normal error N(0,4)
y1=simulated_data(G=G,E=E,alpha=alpha,beta=beta,error=rnorm(100,0,4),family="continuous")

# survival with Normal error N(0,1)
y2=simulated_data(G,E,alpha,beta,rnorm(100,0,1),family="survival",0.7,0.9)

# generate E measurements with missingness
miss_label1=c(2,6,8,15)
miss_label2=c(4,6,8,16)
E1=E2=E;E1[miss_label1,1]=NA;E2[miss_label2,1]=NA

# continuous
data_new1<-Augmented.data(G,E1,y1,h=c(0.5,1), family="continuous",
E_type=c("EC","ED","ED","EC","EC"))
fit1<-BLMCP(data_new1$G_w, data_new1$E_w, data_new1$y_w, data_new1$weight,
lambda1=0.025,lambda2=0.06,gamma1=3,gamma2=3,max_iter=200)
coef1=coef(fit1)
y1_hat=predict(fit1,E[c(1,2),],G[c(1,2),])
plot(fit1)

## survival
data_new2<-Augmented.data(G,E2,y2, h=c(0.5,1), family="survival",
E_type=c("EC","ED","ED","EC","EC"))
fit2<-BLMCP(data_new2$G_w, data_new2$E_w, data_new2$y_w, data_new2$weight,
lambda1=0.04,lambda2=0.05,gamma1=3,gamma2=3,max_iter=200)
coef2=coef(fit2)
y2_hat=predict(fit2,E[c(1,2),],G[c(1,2),])
plot(fit2)

GEInter documentation built on May 20, 2022, 1:17 a.m.