ds.glm.b: ds.glm calling glmDS1, glmDS2
In datashield/dsBetaTestClient5: New DataSHIELD Client Functions

Description Usage Arguments Details Value Author(s) See Also Examples

Fits a generalized linear model (glm) on data from a single or multiple sources

1
2
3

ds.glm.b(formula = NULL, data = NULL, family = NULL, offset = NULL,
  weights = NULL, checks = FALSE, maxit = 15, CI = 0.95,
  viewIter = FALSE, datasources = NULL)

`formula`	Denotes an object of class formula which is a character string which describes the model to be fitted. Most shortcut notation allowed by R's standard glm() function is also allowed by ds.glm. Many glms can be fitted very simply using a formula like: "y~a+b+c+d" which simply means fit a glm with y as the outcome variable with a, b, c and d as covariates. By default all such models also include an intercept (regression constant) term. If all you need to fit are straightforward models such as these, you do not need to read the remainder of this information about "formula". But if you need to fit a more complex model in a customised way, the next paragraph gives a few additional pointers. As an example, the formula: "EVENT~1+TID+SEXFAGE.60" denotes fit a glm with the variable "EVENT" as its outcome with covariates TID (in this case a 6 level factor [categorical] variable denoting "time period" with values between 1 and 6), SEXF (also a factor variable denoting sex and AGE.60 (a quantitative variable representing age-60 in years). The term "1" forces the model to include an intercept term which it would also have done by default (see above) but using "1" may usefully be contrasted with using "0" (as explained below). The "" between SEXF and AGE.60 means fit all possible main effects and interactions for and between those two covariates. As SEXF is a factor this is equivalent to writing SEXF+AGE.60+SEXF1:AGE.60 (the last element being) the interaction term representing the product of SEXF level 1 (in this case female) and AGE.60. If the formula had instead been written as : "EVENT~0+TID+SEXF*AGE.60" the 0 would mean do NOT fit an intercept term and, because TID happens to be a six level factor this would mean that the first six model parameters which were originally intercept+TID2+TID3+TID4+TID5+TID6 using the first formula will now become TID1+TID2+TID3+TID4+TID5+TID6. Conveniently, this means that the effect of each time period may now be estimated directly. For example, the effect of time period 3 is now obtained directly as TID3 rather than intercept+TID3 as was the case using the original formula.
`data`	A character string specifying the name of an (optional) dataframe that contains all of the variables in the glm formula. This avoids you having to specify the name of the dataframe in front of each covariate in the formula e.g. if the dataframe is called 'DataFrame' you avoid having to write: "DataFrame$y~DataFrame$a+DataFrame$b+DataFrame$c+DataFrame$d" Processing stops if a non existing data frame is indicated.
`family`	This argument identifies the error distribution function to use in the model. At present ds.glm has been written to fit family="gaussian" (i.e. a conventional linear model, family="binomial" (i.e. a conventional [unconditional] logistic regression model), and family = "poisson" (i.e. a Poisson regression model - of which perhaps the most commonly used application is for survival analysis using Piecewise Exponential Regression (PER) which typically closely approximates Cox regression in its main estimates and standard errors. More information about PER can be found in the help folder for the ds.lexis function which sets up the data structure for a PER. At present the gaussian family is automatically coupled with an 'identity' link function, the binomial family with a 'logistic' link function and the poisson family with a 'log' link function. For the majority of applications typically encountered in epidemiology and medical statistics, one these three classes of models will generally be what you need. However, if a particular user wishes us to implement an alternative family (e.g. 'gamma') or an alternative family/link combination (e.g. binomial with probit) we can discuss how best to meet that request: it will almost certainly be possible, but we may seek a small amount of funding or practical programming support from the user in order to ensure that it can be carried out in a timely manner
`offset`	A character string specifying the name of a variable to be used as an offset (effectively a component of the glm which has a known coefficient a-priori and so does not need to be estimated by the model). As an example, an offset is needed to fit a piecewise exponential regression model. Unlike the standard glm() function in R, ds.glm() only allows an offset to be set using the offset= argument, it CANNOT be included directly in the formula via notation such as "y~a+b+c+d+offset(offset.vector.name)". In ds.glm this must be specified as: formula="y~a+b+c+d", ..., offset="offset.vector.name" and ds.glm then incorporates it appropriately into the formula itself.
`weights`	A character string specifying the name of a variable containing prior regression weights for the fitting process. Like offset, ds.glm does not allow a weights vector to be written directly into the glm formula. Using weights provides an alternative way to fit PER models if you want to avoid using an offset, but this approach may be viewed as less 'elegant'
`checks`	This argument is a boolean. If TRUE ds.glm then undertakes a series of checks of the structural integrity of the model that can take several minutes. Specifically it verifies that the variables in the model are all defined (exist) on the server site at every study and that they have the correct characteristics required to fit a GLM. The default value is FALSE if an unexplained problem in the model fit is encountered.
`maxit`	A numeric scalar denoting the maximum number of iterations that are permitted before ds.glm declares that the model has failed to converge. Logistic regression and Poisson regression models can require many iterations, particularly if the starting value of the regression constant is far away from its actual value that the glm is trying to estimate. In consequence we often set maxit=30 - but depending on the nature of the models you wish to fit, you may wish to be alerted much more quickly than this if there is a delay in convergence, or you may wish to all MORE iterations.
`CI`	a numeric, the confidence interval.
`viewIter`	a boolean, tells whether the results of the intermediate iterations should be printed on screen or not. Default is FALSE (i.e. only final results are shown).
`datasources`	a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a `dataframe`, from opal datasources.
`startBetas`	A vector specifying starting values for the regression coefficients which (formally) make up the linear predictor. By default, a vector of zeros (see above)

Fits a glm on data from a single source or from multiple sources. In the latter case the data are co-analysed (when using ds.glm) by using an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R. In this situation marked heterogeneity between sources should be corrected (where possible) with fixed effects. e.g. if each study in a (binary) logistic regression analysis has an independent intercept, it is equivalent to allowing each study to have a different baseline risk of disease. This may also be viewed as being an IP (individual person) meta-analysis with fixed effects. An alternative function ds.glmREMA (ds.glm for Random Effects MetaAnalysis) is available if the co-analysis is to be realised by separate fitting on each study followed by random effects meta-analysis of the results.

Privacy protected iterative fitting of a glm is explained here:

(1) Begin with a guess for the coefficient vector to start iteration 1 (let's call it beta.vector[1]). Using beta.vector[1], each source calculates the score vector (and information matrix) generated by its data - given beta.vector[1] - as the sum of the score vector components (and the sum of the components of the information matrix) derived from each individual data record in that source. NB in most models the starting values in beta.vector[1] are set to be zero for all parameters.

(2) Transmit the resultant score vector and information matrix from each source back to the clientside server (CS) at the analysis centre. Let's denote SCORE[1][j] and INFORMATION.MATRIX[1][j] as the score vector and information matrix generated by study j at the end of the 1st iteration.

(3) CS sums the score vectors, and equivalently the information matrices, across all studies (i.e. j=1:S, where S is the number of studies). Note that, given beta.vector[1], this gives precisely the same final sums for the score vectors and information matrices as would have been obtained if all data had been in one central warehoused database and the overall score vector and information matrix at the end of the first iteration had been calculated (as is standard) by simply summing across all individuals. The only difference is that instead of directly adding all values across all individuals, we first sum across all individuals in each data source and then sum those study totals across all studies - i.e. this generates EXACTLY the same ultimate sums

(4) CS then calculates sum(SCORES) viewed as being "the sum of the score vectors divided (NB 'matrix division') by the sum of the information matrices". If one uses the conventional algorithm (IRLS) to update generalized linear models from iteration to iteration this quantity happens to be precisely the vector to be added to the current value of beta.vector (i.e. beta.vector[1]) to obtain beta.vector[2] which is the improved estimate of the beta.vector to be used in iteration 2. This updating algorithm is often called the IRLS (Iterative Reweighted Least Squares) algorithm - which is closely related to the Newton Raphson approach but uses the expected information rather than the observed information.

(5) Repeat steps 2-4 until the model converges (using the standard R convergence criterion)

NB An alternative way to coherently pool the glm across multiple sources is to fit each glm to completion (i.e. multiple iterations until convergence) in each source and then return the final parameter estimates and standard errors to the CS where they can be pooled using study-level meta-analysis. The alternative function ds.glm.REMA fits the glms to completion in each source and returns the final estimates and standard errors (rather than score vectors and information matrices) and we currently use the function ds.metaFor derived from functions in the package metafor (Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 148. http://www.jstatsoft.org/v36/i03/)

The main elements of the output returned by ds.glm are listed (below) as Example 1 under 'examples'.

Burton,P;Gaye,A;Laflamme,P

ds.glmREMA

ds.lexis for survival analysis using piecewise exponential regression

ds.gee for generalized estimating equation models

{
#  Example 1:
#The components of the standard output from ds.glm are listed below. Additional explanatory material for some
#components appears in square brackets in CAPITALIZED font. The model fitted here is a poisson regression model -
#a piecewise exponential regression model. It regresses a 1/0 ([died,failed]/[survived,censored]) numeric outcome held in the variable 'EVENT'
#on SEXF (a factor with male=0 and female=1) and AGE.60 ([age - 60] in years). The '*' in the formula denotes inclusion of both main effects
#and interactions for these covariates. When this formula is interpreted by ds.glm itself it separates these out for clarity
#when constructing the list of regression coefficients in the model. So, e.g., the final coefficient is named 'SEXF1:AGE.60' where
#the notation ':' indicates the interaction between SEXF and AGE.60. TID.f is a six level factor that allows the baseline hazard
#to vary across six different time periods that are precisely the same in each study: in this particular example the periods
#are 0-2 years, 2-3 years, 3-6 yrs, 6-6.5 years, 6.5-8 years and 8-10 years. The '1' in the formula explicitly forces the model
#to include an intercept, which means that the baseline (log) rate of death in each period may be obtained as the sum of
#the coefficient for the intercept + the coefficient for the corresponding period. So, for example, the baseline (log)rate
#of death in period 1 is approximately -0.988 + (-1.114) = -2.102. If this is exponentiated one obtains the value 0.1222
#an estimated event rate of 0.1222 events per annum. If the notation '0' had been used instead of '1', no intercept would
#have been fitted and the coefficient for period 2 would have been -2.102 directly estimating the baseline log rate in
#period 2.
#
#$Nvalid [TOTAL NUMBER OF VALID OBSERVATIONAL UNITS ACROSS ALL STUDIES]
#[1] 6254
#
#$Nmissing [TOTAL NUMBER OF OBSERVATIONAL UNITS ACROSS ALL STUDIES WITH AT LEAST ONE DATA ITEM MISSING]
#[1] 134
#
#$Ntotal [TOTAL OBSERVATIONAL UNITS ACROSS ALL STUDIES - SUM OF VALID AND MISSING UNITS]
#[1] 6388
#
#$disclosure.risk
#       RISK OF DISCLOSURE [THE VALUE 1 INDICATES THAT ONE OF THE DISCLOSURE TRAPS HAS BEEN TRIGGERED IN THAT STUDY]
#study1                  0
#study2                  0
#study3                  0
#
#$errorMessage
#       ERROR MESSAGES [EXPLANATION FOR ANY ERRORS OR DISCLOSURE RISKS IDENTIFIED]
#study1 "No errors"
#study2 "No errors"
#study3 "No errors"
#
#$nsubs [TOTAL NUMBER OF OBSERVATIONAL UNITS USED BY ds.glm - NB USUALLY SAME AS $Nvalid]
#[1] 6254
#
#$iter [TOTAL NUMBER OF ITERATIONS BEFORE CONVERGENCE ACHIEVED]
#[1] 9
#
#$family [ERROR FAMILY AND LINK FUNCTION]
#Family: poisson
#Link function: log
#
#
#$formula [MODEL FORMULA]
#[1] "EVENT ~ 1 + TID.f + SEXF * AGE.60 + offset(LOG.SURV)"
#
#
#[ESTIMATED COEFFICIENTS AND STANDARD ERRORS etc EXPANDED WITH ESTIMATED CONFIDENCE INTERVALS
#WITH % COVERAGE SPECIFIED BY CI= ARGUMENT. FOR POISSON MODEL, OUTPUT IS GENERATED ON SCALE OF LINEAR PREDICTOR (LOG RATES AND LOG RATE RATIOS)
#AND ON THE NATURAL SCALE AFTER EXPONENTIATION (RATES AND RATE RATIOS)]
#$coefficients
#                  Estimate  Std. Error      z-value       p-value low0.95CI.LP high0.95CI.LP EXPONENTIATED RR low0.95CI.EXP high0.95CI.EXP
#(Intercept)  -0.9883668593 0.040709331 -24.27863194 3.296958e-130 -1.068155681  -0.908578037       0.37218402    0.34364172      0.4030970
#TID.f2       -1.1135769332 0.113156582  -9.84102663  7.494215e-23 -1.335359758  -0.891794109       0.32838226    0.26306352      0.4099197
#TID.f3       -2.4132187489 0.145110041 -16.63026714  4.207143e-62 -2.697629203  -2.128808294       0.08952667    0.06736503      0.1189790
#TID.f4       -0.9834247291 0.202670688  -4.85232836  1.220204e-06 -1.380651979  -0.586197480       0.37402796    0.25141458      0.5564391
#TID.f5       -1.2752840562 0.152704468  -8.35132119  6.750358e-17 -1.574579313  -0.975988799       0.27935161    0.20709466      0.3768196
#TID.f6       -1.0459413608 0.141295558  -7.40250701  1.336371e-13 -1.322875566  -0.769007156       0.35136091    0.26636824      0.4634730
#SEXF1        -0.6239830856 0.060097042 -10.38292508  2.965473e-25 -0.741771124  -0.506195048       0.53580602    0.47626964      0.6027848
#AGE.60        0.0428295263 0.002671504  16.03198917  7.639759e-58  0.037593474   0.048065578       1.04375995    1.03830905      1.0492395
#SEXF1:AGE.60 -0.0003408707 0.004111513  -0.08290639  9.339260e-01 -0.008399288   0.007717547       0.99965919    0.99163589      1.0077474
#
#$dev [RESIDUAL DEVIANCE]
#[1] 5266.551
#
#$df [RESIDUAL DEGREES OF FREEDOM - NB RESIDUAL DEGREES OF FREEDOM + NUMBER OF PARAMETERS IN MODEL = $nsubs]
#[1] 6245
#
#$output.information [REMINDER TO USER THAT THERE IS MORE INFORMATION AT THE TOP OF THE OUTPUT]
#[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
#
#  Example 2:
#EXAMPLE 2 DIFFERS FROM EXAMPLE 1 PRIMARILY IN HAVING '0' RATHER THAN '1' IN THE MODEL FORMULA. THIS FORCES THE MODEL
#NOT TO INCLUDE A REGRESSION INTERCEPT. BECAUSE THE SECOND COVARIATE TID.f IS A FACTOR, REMOVAL OF THE INTERCEPT IN
#THIS WAY MEANS THAT THE REGRESSION PARAMETERS ASSOCIATED WITH THE TID.F COVARIATE EACH DIRECTLY ESTIMATE THE LOG RATE IN
#EACH TIME PERIOD. SO, FOR EXAMPLE, TID.f2 DIRECTLY ESTIMATES THE LOG RATE IN PERIOD 2, I.E. 2.1019 WHICH IS PRECISELY
#EQUIVALENT TO THE ESTIMATE DERIVED FROM THE SUM OF INTERCEPT AND TID.f2 IN EXAMPLE 1 (ABOVE).
#
#
#$nsubs
#[1] 6254
#
#$iter
#[1] 9
#
#$family
#Family: poisson
#Link function: log
#
#
#$formula
#[1] "EVENT ~ 0 + TID.f + SEXF * AGE.60 + offset(LOG.SURV)"
#
#$coefficients
#                  Estimate  Std. Error      z-value       p-value low0.95CI.LP high0.95CI.LP EXPONENTIATED RR low0.95CI.EXP high0.95CI.EXP
#TID.f1       -0.9883668593 0.040709331 -24.27863194 3.296958e-130 -1.068155681  -0.908578037       0.37218402    0.34364172     0.40309701
#TID.f2       -2.1019437924 0.111730815 -18.81257014  5.957806e-79 -2.320932166  -1.882955419       0.12221863    0.09818202     0.15213980
#TID.f3       -3.4015856082 0.144060220 -23.61224772 2.884936e-123 -3.683938452  -3.119232765       0.03332039    0.02512383     0.04419106
#TID.f4       -1.9717915884 0.201948314  -9.76384278  1.609388e-22 -2.367603011  -1.575980166       0.13920723    0.09370507     0.20680475
#TID.f5       -2.2636509154 0.151818841 -14.91021073  2.828585e-50 -2.561210376  -1.966091454       0.10397020    0.07721123     0.14000300
#TID.f6       -2.0343082201 0.140582544 -14.47056064  1.859503e-47 -2.309844942  -1.758771498       0.13077092    0.09927664     0.17225635
#SEXF1        -0.6239830856 0.060097042 -10.38292508  2.965473e-25 -0.741771124  -0.506195048       0.53580602    0.47626964     0.60278479
#AGE.60        0.0428295263 0.002671504  16.03198917  7.639759e-58  0.037593474   0.048065578       1.04375995    1.03830905     1.04923946
#SEXF1:AGE.60 -0.0003408707 0.004111513  -0.08290639  9.339260e-01 -0.008399288   0.007717547       0.99965919    0.99163589     1.00774740
#
#$dev
#[1] 5266.551
#
#$df
#[1] 6245
#
#$output.information
#[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
#
#  Example 3:
#EXAMPLE 3 DIFFERS FROM EXAMPLE 2 PRIMARILY BECAUSE IT INCLUDES A REGRESSION WEIGHT VECTOR. IN THIS PARTICULAR
#CASE THE VECTOR WEIGHTS4 IS SIMPLY A VECTOR OF 4s AND, FROM A THEORETICAL PERSPECTIVE, THIS SHOULD SIMPLY MAKE EACH SINGLE
#OBSERVATION EQUIVALENT TO 4 OBSERVATIONS. THIS SHOULD INCREASE THE DEVIANCE AND RESIDUAL DEVIANCE BY A FACTOR OF FOUR
#WHICH IT DOES (5266.551*4=21066.2). FURTHERMORE, BECAUSE THE STANDARD ERROR OF PARAMETER ESTIMATES IS INVERSELY RELATED TO THE
#SQUARE ROOT OF THE NUMBER OF OBSERVATIONS, EACH STANDARD ERROR SHOULD BE HALVED (1/sqrt(4) = 0.5) WHICH CAN AGAIN BE SEEN
#TO BE TRUE, AND THE ESTIMATED VALUE OF z-STATISTICS DOUBLED.
#
#$iter
#[1] 9
#
#$family
#
#Family: poisson
#Link function: log
#
#
#$formula
#[1] "EVENT ~ 0 + TID.f + SEXF * AGE.60 + offset(LOG.SURV) + weights(WEIGHTS4)"
#
#$coefficients
#                  Estimate  Std. Error     z-value       p-value low0.95CI.LP high0.95CI.LP EXPONENTIATED RR low0.95CI.EXP high0.95CI.EXP
#TID.f1       -0.9883668593 0.020354665 -48.5572639  0.000000e+00  -1.02826127  -0.948472448       0.37218402    0.35762824     0.38733224
#TID.f2       -2.1019437924 0.055865407 -37.6251403  0.000000e+00  -2.21143798  -1.992449606       0.12221863    0.10954301     0.13636099
#TID.f3       -3.4015856082 0.072030110 -47.2244954  0.000000e+00  -3.54276203  -3.260409187       0.03332039    0.02893330     0.03837269
#TID.f4       -1.9717915884 0.100974157 -19.5276856  6.386915e-85  -2.16969730  -1.773885877       0.13920723    0.11421218     0.16967238
#TID.f5       -2.2636509154 0.075909421 -29.8204215 2.123826e-195  -2.41243065  -2.114871185       0.10397020    0.08959725     0.12064883
#TID.f6       -2.0343082201 0.070291272 -28.9411213 3.629743e-184  -2.17207658  -1.896539859       0.13077092    0.11394076     0.15008704
#SEXF1        -0.6239830856 0.030048521 -20.7658502  8.815359e-96  -0.68287710  -0.565089067       0.53580602    0.50516150     0.56830953
#AGE.60        0.0428295263 0.001335752  32.0639783 1.401856e-225   0.04021150   0.045447552       1.04375995    1.04103093     1.04649612
#SEXF1:AGE.60 -0.0003408707 0.002055757  -0.1658128  8.683043e-01  -0.00437008   0.003688338       0.99965919    0.99563946     1.00369515
#
#$dev
#[1] 21066.2
#
#$df
#[1] 6245
#
#$output.information
#[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
#
# load the file that contains the login details
#data(glmLoginData)
#
# login and assign all the variables to R
#opals <- datashield.login(logins=glmLoginData, assign=TRUE)
#
# EXAMPLE 4: RUN A LOGISTIC REGRESSION WITHOUT INTERACTION (E.G. DIABETES PREDICTION USING BMI AND HDL LEVELS AND GENDER)
#mod <- ds.glm(formula='D$DIS_DIAB~D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='binomial')
#mod
#
# EXAMPLE 5: RUN THE SAME GLM MODEL WITHOUT AN INTERCEPT
# (produces separate baseline estimates for Male and Female)
#mod <- ds.glm(formula='D$DIS_DIAB~0+D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='binomial')
#mod
#
# EXAMPLE 6: RUN THE SAME GLM MODEL WITH AN INTERACTION BETWEEN GENDER AND PM_BMI_CONTINUOUS
#mod <- ds.glm(formula='D$DIS_DIAB~D$GENDER*D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='binomial')
#mod
#
# Example 7: FIT A STANDARD GAUSSIAN LINEAR MODEL WITH AN INTERACTION
#mod <- ds.glm(formula='D$PM_BMI_CONTINUOUS~D$DIS_DIAB*D$GENDER+D$LAB_HDL', family='gaussian')
#mod
#
# clear the Datashield R sessions and logout
#datashield.logout(opals)
}