Description Usage Arguments Details Value Author(s) References Examples

`gim`

is used to fit generalized integration models, which assume linear or logistic regression model on an (internal) individual-level data, while integrating auxiliary or summary information of relevant variables that are estimated from external data, on which different working models could be assumed. `gim`

can work even if partial information from working models are available. Compared to conventional regression model, e.g., `glm`

, that is based on internal data, the estimate of `gim`

method gains additional power by making maximum use of all kinds of available data.

1 2 |

`formula` |
an object of class " |

`family` |
a character. |

`data` |
a data frame containing all variables that are specified in |

`model` |
a list describing auxiliary information and working models that are used to generate such information. See 'Details' and 'Examples' for more details. |

`nsample` |
a matrix specifying the number of samples shared in datasets that are used to fit the working models given in |

`ncase` |
a matrix specifying the number of cases shared in datasets that are used to fit the working models given in |

`nctrl` |
a matrix specifying the number of controls shared in datasets that are used to fit the working models given in |

`ref` |
a data frame containing the covariates specified in |

`...` |
for test purpose, use its default value. |

`formula`

`formula`

is the model to be used to fit a conventional regression model if no additional information is available. It could be very general as long as it is acceptable to the `glm`

or `lm`

functions. It can eliminate the intercept, `y ~ .-1`

, or involve arithmetic expressions, e.g., `log(x)`

, or other operators like `*`

for interactions `as.factor(x1)*I(x2 > 0)`

.

** model **
Summary information are calculated on data of external studies, but we do not have access to their raw data. Instead, estimates from working model fitted on external data are given (e.g., reported in literature). The argument

`model`

is a list, each component contains information of a working model. Specifically, a component is also a list of two entries `form`

and `info`

, where `form`

is a formula representing the fitted working model, and `info`

is a data frame with two columns `var`

and `bet`

, the names of variables and their estimates from the working model, respectively. Usually the estimate of intercept of a working model is unavailable as people fit but do not reporte it. If user is able to provide such an estimate, the name in column `var`

must be `"(Intercept)"`

. See below for an example.
Note that multiple working models could be fitted on the same external data, in that case, the summary information of each working model should be given in `model`

separately. For example, on an external dataset, if two models `y ~ x1`

and `y ~ x2`

are fitted, then the estimates of `x1`

and `x2`

should be given as two components in `model`

. This happens as many research groups can study the same datasets from different angles.

`data`

`gim`

requires an internal dataset `data`

in which individual-level samples are available. Statistically, this data is critical to provide information of correlation between covariates. This data is also known as the reference data in the literatures. Since general formula is supported in `gim`

, it is important to provide variables in `data`

so that `R`

can find columns of all variables parsed from formulas in `formula`

and `model`

. Read vignettes (upcoming) for more examples about how to create a proper `data`

for `gim`

. We will also release a function to help users with this. `gim`

will discard incomplete lines in `data`

.

** nsample **
Some of summary information can be calculated from datasets that share samples. Ignoring this will lead to underestimated standard error. For example, if a dataset is studied by two different models, the estimates from these two models are not independent but highly correlated. Therefore, this correlation must be properly handled when calculating the standard error of

`gim`

estimate, from which a hypothesis testing is conducted. `nsample`

is a squared matrix of dimension `p`

, which is equal to the length of `model`

. Thus, the (i,i) entry in `nsample`

is the number of samples used in fitting the working model specified in `model[[i]]$formula`

, while the (i,k) entry is the number of samples that are involved in fitting working models `model[[i]]$formula`

and `model[[k]]$formula`

. For example, if two working models, e.g., `y ~ x1`

and `y ~ x2`

are fitted on the same dataset of 100 samples, then `nsample`

is a matrix of all entries being 100. Read example below and vignettes (upcoming) for more examples.
** ncase ** and

`nctrl`

`nsample`

for their formats.
** ref **
By default,

`ref`

is `NULL`

if it is not specified explicitly. This assumes that the internal and external populations are the same, and `gim`

will assign `data`

to `ref`

implicitly. If this assumption holds, and you have additional covariates data (no outcome), e.g. `add.ref`

, that also comes from the internal population, you can specified `ref`

as `rbind(data, add.ref)`

where the column of missing outcome in `add.ref`

is set as `NA`

. You can also rbind `data`

and `add.ref`

, with outcome in `data`

being deleted. If the external population is different from the internal population, you have to assign `add.ref`

to `ref`

as reference.
`gim`

returns an object of class "`gim`

". The function `summary`

can be used to print a summary of the results. We will support the use of `anova`

in later versions.

The generic accessor functions `coef`

, `confint`

, and `vcov`

can be used to extract coefficients, confidence intervals, and variance-covariance of estimates from the object returned by `gim`

.

An object of class "`gim`

" is a list containing the following components:

`coefficients ` |
a named vector of coefficients |

`vcov ` |
the variance-covariance matrix of estimates, including the intercept |

`sigma2 ` |
estimated variance of error term in a linear model. Only available for the |

`call ` |
the matched call |

`V.bet ` |
the variance-covariance matrix of external estimate |

Han Zhang

Zhang, H., Deng, L., Schiffman, M., Qin, J., Yu, K. (2020) Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika. asaa014, https://doi.org/10.1093/biomet/asaa014

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | ```
## An artificial dataset is lazyloaded to illustrate the concept of GIM method
## It contains:
## A continuous outcome y.
## Four covariates x1, x2, x3, x4 (character).
## A binary outcome d
head(dat)
## internal data of 500 samples
dat0 <- dat[1:500, ]
## three external datasets.
## dat2 and dat3 share some samples
dat1 <- dat[501:1500, c('y', 'x1', 'x2')]
dat2 <- dat[1501:2500, c('y', 'x1', 'x3', 'x4')]
dat3 <- dat[2001:3000, c('y', 'x3', 'x4')]
## four working models are fitted
form1 <- 'y ~ I(x1 < 0) + I(x2 > 0)'
form2 <- 'y ~ x3 + x4'
form3 <- 'y ~ I(x4 == "a")'
form4 <- 'y ~ sqrt(x3)'
## two working models are fitted on dat3
## thus nsample is a 4x4 matrix
nsample <- matrix(c(1000, 0, 0, 0,
0, 1000, 500, 500,
0, 500, 1000, 1000,
0, 500, 1000, 1000),
4, 4)
fit1 <- summary(lm(form1, dat1))$coef
fit2 <- summary(lm(form2, dat2))$coef
fit3 <- summary(lm(form3, dat3))$coef ## <-- dat3 is used twice
fit4 <- summary(lm(form4, dat3))$coef ## <-- dat3 is used twice
options(stringsAsFactors = FALSE)
model <- list()
## partial information is available
model[[1]] <- list(form = form1,
info = data.frame(var = rownames(fit1)[2],
bet = fit1[2, 1]))
## intercept is provided, but miss estimate of a covariate
model[[2]] <- list(form = form2,
info = data.frame(var = rownames(fit2)[1:2],
bet = fit2[1:2, 1]))
model[[3]] <- list(form = form3,
info = data.frame(var = rownames(fit3)[2],
bet = fit3[2, 1]))
model[[4]] <- list(form = form4,
info = data.frame(var = rownames(fit4)[2],
bet = fit4[2, 1]))
form <- 'y ~ I(x1 < 0) + I(x1 > 1) + x2 * x4 + log(x3) - 1'
fit <- gim(form, 'gaussian', dat0, model, nsample)
summary(fit)
coef(fit)
confint(fit)
# one can compare the gim estimates with those estimated from internal data
fit0 <- lm(form, dat0)
summary(fit0)
# by default, covariates in dat is used as reference in gim
# which assumes that the external and internal populations are the same
fit1 <- gim(form, 'gaussian', dat0, model, nsample, ref = dat0)
all(coef(fit) == coef(fit1)) # TRUE
# if additional reference is available,
# and it comes from the internal population from which dat is sampled
# gim can use it
add.ref <- dat[3001:3500, ]
add.ref$y <- NA ## <-- outcome is unavailable in reference
ref <- rbind(dat0, add.ref)
fit2 <- gim(form, 'gaussian', dat0, model, nsample, ref = ref)
# if the external population is different from the internal population
# then reference for summary data specified in model needs to be provided
ext.ref <- dat[3501:4000, ] ## <-- as an example, assume ext.ref is different
## from dat0
fit3 <- gim(form, 'gaussian', dat0, model, nsample, ref = ext.ref)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.