R/ds.glm.b.R

#' 
#' @title ds.glm calling glmDS1, glmDS2
#' @description Fits a generalized linear model (glm) on data from a single or multiple sources
#' @details Fits a glm on data from a single source or from multiple sources. In the latter case the data
#' are co-analysed (when using ds.glm) by using an approach that is mathematically equivalent
#' to placing all individual-level
#' data from all sources in one central warehouse and analysing those data using the conventional glm()
#' function in R. In this situation marked heterogeneity between sources should be corrected
#' (where possible) with fixed effects. e.g. if each study in a (binary) logistic regression analysis
#' has an independent intercept, it is equivalent to allowing each study to have a different
#' baseline risk of disease. This may also be viewed as being an IP (individual person) meta-analysis
#' with fixed effects. An alternative function ds.glmREMA (ds.glm for Random Effects MetaAnalysis) is
#' available if the co-analysis is to be realised by separate fitting on each study followed by
#' random effects meta-analysis of the results.
#'
#' Privacy protected iterative fitting of a glm is explained here:
#'
#' (1) Begin with a guess for the coefficient vector to start iteration 1 (let's call it beta.vector[1]).
#' Using beta.vector[1], each source calculates the score vector (and information matrix) generated
#' by its data - given beta.vector[1] - as the sum of the score vector components (and the sum of the components
#' of the information matrix) derived from each individual data record in that source. NB in most models
#' the starting values in beta.vector[1] are set to be zero for all parameters.
#'
#' (2) Transmit the resultant score vector and information matrix from each source back to the clientside
#' server (CS) at the analysis centre. Let's denote SCORE[1][j] and INFORMATION.MATRIX[1][j] as the
#' score vector and information matrix generated by study j at the end of the 1st iteration.
#' 
#' (3) CS sums the score vectors, and equivalently the information matrices, across all studies
#' (i.e. j=1:S, where S is the number of studies). Note that,
#' given beta.vector[1], this gives precisely the same final sums
#' for the score vectors and information matrices as would have been obtained if all data had been in one
#' central warehoused database and the overall score vector and information matrix at the end of
#' the first iteration had been calculated
#' (as is standard) by simply summing across all individuals. The only difference is that
#' instead of directly adding all values across
#' all individuals, we first sum across all individuals in each data source and then sum those study
#' totals across all studies - i.e. this generates EXACTLY the same ultimate sums
#'
#' (4) CS then calculates sum(SCORES)%*%inverse(sum(INFORMATION.MATRICES)) - heuristically this may be
#' viewed as being "the sum of the score vectors divided (NB 'matrix division') by the sum of the
#' information matrices". If one uses the conventional algorithm (IRLS) 
#' to update generalized linear models from iteration to iteration this quantity happens to be
#' precisely the vector to be added to the current value of beta.vector (i.e. beta.vector[1]) to obtain
#' beta.vector[2] which is the improved estimate of the beta.vector to be used in iteration 2.
#' This updating algorithm is often  called the IRLS (Iterative Reweighted Least Squares) algorithm 
#' - which is closely related to the Newton Raphson approach but uses the expected information rather than
#' the observed information.
#'
#' (5) Repeat steps 2-4 until the model converges (using the standard R convergence criterion)
#'
#' NB An alternative way to coherently pool the glm across multiple sources is to fit each
#' glm to completion (i.e. multiple iterations until convergence) in each source and then return
#' the final parameter estimates and standard errors to the CS where they can be pooled using
#' study-level meta-analysis. The alternative function ds.glm.REMA fits the glms to completion
#' in each source and returns the final estimates and standard errors (rather than score vectors
#' and information matrices) and we currently use the function ds.metaFor derived from functions in the
#' package metafor (Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package.
#' Journal of Statistical Software, 36(3), 1–48. http://www.jstatsoft.org/v36/i03/)
#'
#' @param formula Denotes an object of class formula which is a character string which describes
#' the model to be fitted. Most shortcut notation allowed by R's standard glm() function is
#' also allowed by ds.glm. Many glms can be fitted very simply using a formula like:
#' "y~a+b+c+d" which simply means fit a glm with y as the outcome variable with a, b, c and d as
#' covariates. By default all such models also include an intercept (regression constant) term.
#' If all you need to fit are straightforward models such as these, you do not need to read the
#' remainder of this information about "formula". But if you need to fit a more complex model in a
#' customised way, the next paragraph gives a few additional pointers.
#'
#' As an example, the formula: "EVENT~1+TID+SEXF*AGE.60" denotes fit a glm with the variable "EVENT" as
#' its outcome with covariates TID (in this case a 6 level factor [categorical] variable denoting
#' "time period" with values between 1 and 6), SEXF (also a factor variable denoting sex
#' and AGE.60 (a quantitative variable representing age-60 in years). The term "1" forces
#' the model to include an intercept term which it would also have done by default (see above)
#' but using "1" may usefully be contrasted with using "0" (as explained below).
#' The "*" between SEXF and AGE.60
#' means fit all possible main effects and interactions for and between those two covariates.
#' As SEXF is a factor this is equivalent to writing SEXF+AGE.60+SEXF1:AGE.60 (the last element being)
#' the interaction term representing the product of SEXF level 1 (in this case female) and AGE.60.
#' If the formula had instead been written as : "EVENT~0+TID+SEXF*AGE.60" the 0 would mean do NOT fit
#' an intercept term and, because TID happens to be a six level factor this would mean
#' that the first six model parameters which were originally intercept+TID2+TID3+TID4+TID5+TID6
#' using the first formula will now become TID1+TID2+TID3+TID4+TID5+TID6. Conveniently, this means
#' that the effect of each
#' time period may now be estimated directly. For example, the effect of time period 3 is now obtained
#' directly as TID3 rather than intercept+TID3 as was the case using the original formula. 
#' @param family This argument identifies the error distribution function to use in the model. At present
#' ds.glm has been written to fit family="gaussian" (i.e. a conventional linear model, family="binomial"
#' (i.e. a conventional [unconditional] logistic regression model), and family = "poisson" (i.e. a
#' Poisson regression model - of which perhaps the most commonly used application is for survival analysis
#' using Piecewise Exponential Regression (PER) which typically closely approximates Cox regression in its
#' main estimates and standard errors. More information about PER can be found in the help folder for
#' the ds.lexis function which sets up the data structure for a PER. At present the gaussian family is
#' automatically coupled with an 'identity' link function, the binomial family with a 'logistic' link
#' function and the poisson family with a 'log' link function. For the majority of applications typically
#' encountered in epidemiology and medical statistics, one  these three classes of models will
#' generally be what you need. However, if a particular user wishes us to implement an alternative family
#' (e.g. 'gamma') or an alternative family/link combination (e.g. binomial with probit) we can discuss
#' how best to meet that request: it will almost certainly be possible, but we may seek a small amount
#' of funding or practical programming support from the user in order to ensure that it can be carried out
#' in a timely manner 
#' @param startBetas A vector specifying starting values for the regression coefficients which (formally)
#' make up the linear predictor. By default, a vector of zeros (see above) 
#' @param offset  A character string specifying the name of a variable to be used as an offset (effectively
#' a component of the glm which has a known coefficient a-priori and so does not need to be 
#' estimated by the model). As an example, an offset is needed to fit a piecewise
#' exponential regression model. Unlike the standard glm() function in R, ds.glm() only allows an offset
#' to be set using the offset= argument, it CANNOT be included directly in the formula via notation
#' such as  "y~a+b+c+d+offset(offset.vector.name)". In ds.glm this must be specified as:
#' formula="y~a+b+c+d", ..., offset="offset.vector.name" and ds.glm then incorporates it appropriately
#' into the formula itself.
#' @param weights A character string specifying the name of a variable containing prior regression
#' weights for the fitting process. Like offset, ds.glm does not allow a weights vector to be
#' written directly into the glm formula. Using weights provides an alternative way to fit PER models
#' if you want to avoid using an offset, but this approach may be viewed as less 'elegant'
#' @param data A character string specifying the name of an (optional) dataframe that contains
#' all of the variables in the glm formula. This avoids you having to specify the name of the
#' dataframe in front of each covariate in the formula e.g. if the dataframe is called 'DataFrame' you 
#' avoid having to write: "DataFrame$y~DataFrame$a+DataFrame$b+DataFrame$c+DataFrame$d"
#' Processing stops if a non existing data frame is indicated. 
#' @param checks This argument is a boolean. If TRUE ds.glm then undertakes a series of checks of
#' the structural integrity of the model that can take several minutes. Specifically it verifies that the 
#' variables in the model are all defined (exist) on the server site at every study
#' and that they have the correct characteristics required to fit a GLM. The default value is FALSE
#  because the checks markedly increase and so it is suggested that they are only made TRUE
#' if an unexplained problem in the model fit is encountered.
#' @param maxit A numeric scalar denoting the maximum number of iterations that are permitted
#' before ds.glm declares that the model has failed to converge. Logistic regression and Poisson regression
#' models can require many iterations, particularly if the starting value of the regression constant is
#' far away from its actual value that the glm is trying to estimate. In consequence we often set
#' maxit=30 - but depending on the nature of the models you wish to fit, you may wish to be alerted
#' much more quickly than this if there is a delay in convergence, or you may wish to all MORE iterations.
#' @param CI a numeric, the confidence interval.
#' @param viewIter a boolean, tells whether the results of the intermediate iterations
#' should be printed on screen or not. Default is FALSE (i.e. only final results are shown).
#' @param datasources a list of opal object(s) obtained after login to opal servers;
#' these objects also hold the data assigned to R, as a \code{dataframe}, from opal datasources.
#' @return The main elements of the output returned by ds.glm are
#' listed (below) as Example 1 under 'examples'. 
#' 
#' @author Burton,P;Gaye,A;Laflamme,P
#' @seealso \link{ds.glmREMA} 
#' @seealso \link{ds.lexis} for survival analysis using piecewise exponential regression
#' @seealso \link{ds.gee} for generalized estimating equation models
#' @export
#' @examples {
#' #  Example 1:
#' #The components of the standard output from ds.glm are listed below. Additional explanatory material for some
#' #components appears in square brackets in CAPITALIZED font. The model fitted here is a poisson regression model -
#' #a piecewise exponential regression model. It regresses a 1/0 ([died,failed]/[survived,censored]) numeric outcome held in the variable 'EVENT'
#' #on SEXF (a factor with male=0 and female=1) and AGE.60 ([age - 60] in years). The '*' in the formula denotes inclusion of both main effects
#' #and interactions for these covariates. When this formula is interpreted by ds.glm itself it separates these out for clarity
#' #when constructing the list of regression coefficients in the model. So, e.g., the final coefficient is named 'SEXF1:AGE.60' where
#' #the notation ':' indicates the interaction between SEXF and AGE.60. TID.f is a six level factor that allows the baseline hazard
#' #to vary across six different time periods that are precisely the same in each study: in this particular example the periods
#' #are 0-2 years, 2-3 years, 3-6 yrs, 6-6.5 years, 6.5-8 years and 8-10 years. The '1' in the formula explicitly forces the model
#' #to include an intercept, which means that the baseline (log) rate of death in each period may be obtained as the sum of
#' #the coefficient for the intercept + the coefficient for the corresponding period. So, for example, the baseline (log)rate
#' #of death in period 1 is approximately -0.988 + (-1.114) = -2.102. If this is exponentiated one obtains the value 0.1222
#' #an estimated event rate of 0.1222 events per annum. If the notation '0' had been used instead of '1', no intercept would 
#' #have been fitted and the coefficient for period 2 would have been -2.102 directly estimating the baseline log rate in
#' #period 2.
#' #
#' #$Nvalid [TOTAL NUMBER OF VALID OBSERVATIONAL UNITS ACROSS ALL STUDIES]
#' #[1] 6254
#' #
#' #$Nmissing [TOTAL NUMBER OF OBSERVATIONAL UNITS ACROSS ALL STUDIES WITH AT LEAST ONE DATA ITEM MISSING]
#' #[1] 134
#' #
#' #$Ntotal [TOTAL OBSERVATIONAL UNITS ACROSS ALL STUDIES - SUM OF VALID AND MISSING UNITS]
#' #[1] 6388
#' #
#' #$disclosure.risk
#' #       RISK OF DISCLOSURE [THE VALUE 1 INDICATES THAT ONE OF THE DISCLOSURE TRAPS HAS BEEN TRIGGERED IN THAT STUDY]
#' #study1                  0
#' #study2                  0
#' #study3                  0
#' #
#' #$errorMessage
#' #       ERROR MESSAGES [EXPLANATION FOR ANY ERRORS OR DISCLOSURE RISKS IDENTIFIED]
#' #study1 "No errors"   
#' #study2 "No errors"   
#' #study3 "No errors"   
#' #
#' #$nsubs [TOTAL NUMBER OF OBSERVATIONAL UNITS USED BY ds.glm - NB USUALLY SAME AS $Nvalid]
#' #[1] 6254
#' #
#' #$iter [TOTAL NUMBER OF ITERATIONS BEFORE CONVERGENCE ACHIEVED]
#' #[1] 9
#' #
#' #$family [ERROR FAMILY AND LINK FUNCTION]
#' #Family: poisson
#' #Link function: log 
#' #
#' #
#' #$formula [MODEL FORMULA]
#' #[1] "EVENT ~ 1 + TID.f + SEXF * AGE.60 + offset(LOG.SURV)"
#' #
#' #
#' #[ESTIMATED COEFFICIENTS AND STANDARD ERRORS etc EXPANDED WITH ESTIMATED CONFIDENCE INTERVALS
#' #WITH % COVERAGE SPECIFIED BY CI= ARGUMENT. FOR POISSON MODEL, OUTPUT IS GENERATED ON SCALE OF LINEAR PREDICTOR (LOG RATES AND LOG RATE RATIOS)
#' #AND ON THE NATURAL SCALE AFTER EXPONENTIATION (RATES AND RATE RATIOS)]
#' #$coefficients 
#' #                  Estimate  Std. Error      z-value       p-value low0.95CI.LP high0.95CI.LP EXPONENTIATED RR low0.95CI.EXP high0.95CI.EXP
#' #(Intercept)  -0.9883668593 0.040709331 -24.27863194 3.296958e-130 -1.068155681  -0.908578037       0.37218402    0.34364172      0.4030970
#' #TID.f2       -1.1135769332 0.113156582  -9.84102663  7.494215e-23 -1.335359758  -0.891794109       0.32838226    0.26306352      0.4099197
#' #TID.f3       -2.4132187489 0.145110041 -16.63026714  4.207143e-62 -2.697629203  -2.128808294       0.08952667    0.06736503      0.1189790
#' #TID.f4       -0.9834247291 0.202670688  -4.85232836  1.220204e-06 -1.380651979  -0.586197480       0.37402796    0.25141458      0.5564391
#' #TID.f5       -1.2752840562 0.152704468  -8.35132119  6.750358e-17 -1.574579313  -0.975988799       0.27935161    0.20709466      0.3768196
#' #TID.f6       -1.0459413608 0.141295558  -7.40250701  1.336371e-13 -1.322875566  -0.769007156       0.35136091    0.26636824      0.4634730
#' #SEXF1        -0.6239830856 0.060097042 -10.38292508  2.965473e-25 -0.741771124  -0.506195048       0.53580602    0.47626964      0.6027848
#' #AGE.60        0.0428295263 0.002671504  16.03198917  7.639759e-58  0.037593474   0.048065578       1.04375995    1.03830905      1.0492395
#' #SEXF1:AGE.60 -0.0003408707 0.004111513  -0.08290639  9.339260e-01 -0.008399288   0.007717547       0.99965919    0.99163589      1.0077474
#' #
#' #$dev [RESIDUAL DEVIANCE]
#' #[1] 5266.551 
#' #
#' #$df [RESIDUAL DEGREES OF FREEDOM - NB RESIDUAL DEGREES OF FREEDOM + NUMBER OF PARAMETERS IN MODEL = $nsubs]
#' #[1] 6245
#' #
#' #$output.information [REMINDER TO USER THAT THERE IS MORE INFORMATION AT THE TOP OF THE OUTPUT]
#' #[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
#' #
#' #  Example 2:
#' #EXAMPLE 2 DIFFERS FROM EXAMPLE 1 PRIMARILY IN HAVING '0' RATHER THAN '1' IN THE MODEL FORMULA. THIS FORCES THE MODEL
#' #NOT TO INCLUDE A REGRESSION INTERCEPT. BECAUSE THE SECOND COVARIATE TID.f IS A FACTOR, REMOVAL OF THE INTERCEPT IN
#' #THIS WAY MEANS THAT THE REGRESSION PARAMETERS ASSOCIATED WITH THE TID.F COVARIATE EACH DIRECTLY ESTIMATE THE LOG RATE IN
#' #EACH TIME PERIOD. SO, FOR EXAMPLE, TID.f2 DIRECTLY ESTIMATES THE LOG RATE IN PERIOD 2, I.E. 2.1019 WHICH IS PRECISELY
#' #EQUIVALENT TO THE ESTIMATE DERIVED FROM THE SUM OF INTERCEPT AND TID.f2 IN EXAMPLE 1 (ABOVE).
#' #
#' #
#' #$nsubs
#' #[1] 6254
#' #
#' #$iter
#' #[1] 9
#' #
#' #$family
#' #Family: poisson 
#' #Link function: log 
#' #
#' #
#' #$formula
#' #[1] "EVENT ~ 0 + TID.f + SEXF * AGE.60 + offset(LOG.SURV)"
#' #
#' #$coefficients
#' #                  Estimate  Std. Error      z-value       p-value low0.95CI.LP high0.95CI.LP EXPONENTIATED RR low0.95CI.EXP high0.95CI.EXP
#' #TID.f1       -0.9883668593 0.040709331 -24.27863194 3.296958e-130 -1.068155681  -0.908578037       0.37218402    0.34364172     0.40309701
#' #TID.f2       -2.1019437924 0.111730815 -18.81257014  5.957806e-79 -2.320932166  -1.882955419       0.12221863    0.09818202     0.15213980
#' #TID.f3       -3.4015856082 0.144060220 -23.61224772 2.884936e-123 -3.683938452  -3.119232765       0.03332039    0.02512383     0.04419106
#' #TID.f4       -1.9717915884 0.201948314  -9.76384278  1.609388e-22 -2.367603011  -1.575980166       0.13920723    0.09370507     0.20680475
#' #TID.f5       -2.2636509154 0.151818841 -14.91021073  2.828585e-50 -2.561210376  -1.966091454       0.10397020    0.07721123     0.14000300
#' #TID.f6       -2.0343082201 0.140582544 -14.47056064  1.859503e-47 -2.309844942  -1.758771498       0.13077092    0.09927664     0.17225635
#' #SEXF1        -0.6239830856 0.060097042 -10.38292508  2.965473e-25 -0.741771124  -0.506195048       0.53580602    0.47626964     0.60278479
#' #AGE.60        0.0428295263 0.002671504  16.03198917  7.639759e-58  0.037593474   0.048065578       1.04375995    1.03830905     1.04923946
#' #SEXF1:AGE.60 -0.0003408707 0.004111513  -0.08290639  9.339260e-01 -0.008399288   0.007717547       0.99965919    0.99163589     1.00774740
#' #
#' #$dev
#' #[1] 5266.551
#' #
#' #$df
#' #[1] 6245
#' #
#' #$output.information
#' #[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
#' #
#' #  Example 3:
#' #EXAMPLE 3 DIFFERS FROM EXAMPLE 2 PRIMARILY BECAUSE IT INCLUDES A REGRESSION WEIGHT VECTOR. IN THIS PARTICULAR
#' #CASE THE VECTOR WEIGHTS4 IS SIMPLY A VECTOR OF 4s AND, FROM A THEORETICAL PERSPECTIVE, THIS SHOULD SIMPLY MAKE EACH SINGLE
#' #OBSERVATION EQUIVALENT TO 4 OBSERVATIONS. THIS SHOULD INCREASE THE DEVIANCE AND RESIDUAL DEVIANCE BY A FACTOR OF FOUR
#' #WHICH IT DOES (5266.551*4=21066.2). FURTHERMORE, BECAUSE THE STANDARD ERROR OF PARAMETER ESTIMATES IS INVERSELY RELATED TO THE
#' #SQUARE ROOT OF THE NUMBER OF OBSERVATIONS, EACH STANDARD ERROR SHOULD BE HALVED (1/sqrt(4) = 0.5) WHICH CAN AGAIN BE SEEN
#' #TO BE TRUE, AND THE ESTIMATED VALUE OF z-STATISTICS DOUBLED.
#' #
#' #$iter
#' #[1] 9
#' #
#' #$family
#' #
#' #Family: poisson 
#' #Link function: log 
#' #
#' #
#' #$formula
#' #[1] "EVENT ~ 0 + TID.f + SEXF * AGE.60 + offset(LOG.SURV) + weights(WEIGHTS4)"
#' #
#' #$coefficients
#' #                  Estimate  Std. Error     z-value       p-value low0.95CI.LP high0.95CI.LP EXPONENTIATED RR low0.95CI.EXP high0.95CI.EXP
#' #TID.f1       -0.9883668593 0.020354665 -48.5572639  0.000000e+00  -1.02826127  -0.948472448       0.37218402    0.35762824     0.38733224
#' #TID.f2       -2.1019437924 0.055865407 -37.6251403  0.000000e+00  -2.21143798  -1.992449606       0.12221863    0.10954301     0.13636099
#' #TID.f3       -3.4015856082 0.072030110 -47.2244954  0.000000e+00  -3.54276203  -3.260409187       0.03332039    0.02893330     0.03837269
#' #TID.f4       -1.9717915884 0.100974157 -19.5276856  6.386915e-85  -2.16969730  -1.773885877       0.13920723    0.11421218     0.16967238
#' #TID.f5       -2.2636509154 0.075909421 -29.8204215 2.123826e-195  -2.41243065  -2.114871185       0.10397020    0.08959725     0.12064883
#' #TID.f6       -2.0343082201 0.070291272 -28.9411213 3.629743e-184  -2.17207658  -1.896539859       0.13077092    0.11394076     0.15008704
#' #SEXF1        -0.6239830856 0.030048521 -20.7658502  8.815359e-96  -0.68287710  -0.565089067       0.53580602    0.50516150     0.56830953
#' #AGE.60        0.0428295263 0.001335752  32.0639783 1.401856e-225   0.04021150   0.045447552       1.04375995    1.04103093     1.04649612
#' #SEXF1:AGE.60 -0.0003408707 0.002055757  -0.1658128  8.683043e-01  -0.00437008   0.003688338       0.99965919    0.99563946     1.00369515
#' #
#' #$dev
#' #[1] 21066.2
#' #
#' #$df
#' #[1] 6245
#' #
#' #$output.information
#' #[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
#' #
#' # load the file that contains the login details
#' #data(glmLoginData)
#' #
#' # login and assign all the variables to R
#' #opals <- datashield.login(logins=glmLoginData, assign=TRUE)
#' #
#' # EXAMPLE 4: RUN A LOGISTIC REGRESSION WITHOUT INTERACTION (E.G. DIABETES PREDICTION USING BMI AND HDL LEVELS AND GENDER)
#' #mod <- ds.glm(formula='D$DIS_DIAB~D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='binomial')
#' #mod
#' #
#' # EXAMPLE 5: RUN THE SAME GLM MODEL WITHOUT AN INTERCEPT
#' # (produces separate baseline estimates for Male and Female)
#' #mod <- ds.glm(formula='D$DIS_DIAB~0+D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='binomial')
#' #mod
#' #
#' # EXAMPLE 6: RUN THE SAME GLM MODEL WITH AN INTERACTION BETWEEN GENDER AND PM_BMI_CONTINUOUS
#' #mod <- ds.glm(formula='D$DIS_DIAB~D$GENDER*D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='binomial')
#' #mod
#' #
#' # Example 7: FIT A STANDARD GAUSSIAN LINEAR MODEL WITH AN INTERACTION
#' #mod <- ds.glm(formula='D$PM_BMI_CONTINUOUS~D$DIS_DIAB*D$GENDER+D$LAB_HDL', family='gaussian')
#' #mod
#' #
#' # clear the Datashield R sessions and logout
#' #datashield.logout(opals) 
#' }
#'
ds.glm.b <- function(formula=NULL, data=NULL, family=NULL, offset=NULL, weights=NULL, checks=FALSE, maxit=15, CI=0.95, viewIter=FALSE, datasources=NULL) {
  
 # if no opal login details are provided look for 'opal' objects in the environment
  if(is.null(datasources)){
    datasources <- findLoginObjects()
  }
  
  # verify that 'formula' was set
  if(is.null(formula)){
    stop(" Please provide a valid regression formula!", call.=FALSE)
  }

# check if user gave offset or weights directly in formula, if so the argument 'offset' or 'weights'
# to provide name of offset or weights variable
    if(sum(as.numeric(grepl('offset', formula, ignore.case=TRUE)))>0 ||
       sum(as.numeric(grepl('weights', formula, ignore.case=TRUE)))>0)
	{
       cat("\n\n WARNING: you may have specified an offset or regression weights")
       cat("\n as part of the model formula. In ds.glm (unlike the usual glm in R)")
       cat("\n you must specify an offset or weights separately from the formula")
       cat("\n using the offset or weights argument.\n\n")
	}

  formula <- as.formula(formula)

  
  # check that 'family' was set
  if(is.null(family)){
    stop(" Please provide a valid 'family' argument!", call.=FALSE)
  }
  
  # if the argument 'data' is set, check that the data frame is defined (i.e. exists) on the server site
  if(!(is.null(data))){
    defined <- isDefined(datasources, data)
  }
  
  # beginning of optional checks - the process stops if any of these checks fails #
  if(checks){
    message(" -- Verifying the variables in the model")
    # call the function that checks the variables in the formula are defined (exist) on the server site and are not missing at complete
 
   glmChecks(formula, data, offset, weights, datasources)
  }else{
    #message("WARNING:'checks' is set to FALSE; variables in the model are not checked and error messages may not be intelligible!")
  }

#MOVE ITERATION COUNT BEFORE ASSIGNMENT OF beta.vect.next  
#Iterations need to be counted. Start off with the count at 0
  #and increment by 1 at each new iteration
  iteration.count<-0

  # number of 'valid' studies (those that passed the checks) and vector of beta values
  numstudies <- length(datasources)
 

#ARBITRARY LENGTH FOR START BETAs AT THIS STAGE BUT IN LEGAL TRANSMISSION FORMAT ("0,0,0,0,0")
 beta.vect.next <- rep(0,5)
 beta.vect.temp <- paste0(as.character(beta.vect.next), collapse=",")
  

#IDENTIFY THE CORRECT DIMENSION FOR START BETAs VIA CALLING FIRST COMPONENT OF glmDS
 
   cally1 <- call('glmDS1.b', formula, family, weights, data)
   
   study.summary.0 <- datashield.aggregate(datasources, cally1)
   num.par.glm<-study.summary.0[[1]][[1]][[2]]

coef.names<-study.summary.0[[1]][[2]]

y.invalid<-NULL
Xpar.invalid<-NULL
w.invalid<-NULL
glm.saturation.invalid<-NULL
errorMessage<-NULL

	for(ss in 1:numstudies)
	{
   	y.invalid<-c(y.invalid,study.summary.0[[ss]][[3]])
	Xpar.invalid<-rbind(Xpar.invalid,study.summary.0[[ss]][[4]])
   	w.invalid<-c(w.invalid,study.summary.0[[ss]][[5]])
      glm.saturation.invalid <-c(glm.saturation.invalid,study.summary.0[[ss]][[6]])
      errorMessage<-c(errorMessage,study.summary.0[[ss]][[7]])
	}

y.invalid<-as.matrix(y.invalid)
sum.y.invalid<-sum(y.invalid)
dimnames(y.invalid)<-list(names(datasources),"Y VECTOR")

Xpar.invalid<-as.matrix(Xpar.invalid)
sum.Xpar.invalid<-sum(Xpar.invalid)
dimnames(Xpar.invalid)<-list(names(datasources),coef.names)


w.invalid<-as.matrix(w.invalid)
sum.w.invalid<-sum(w.invalid)
dimnames(w.invalid)<-list(names(datasources),"WEIGHT VECTOR")

glm.saturation.invalid<-as.matrix(glm.saturation.invalid)
sum.glm.saturation.invalid<-sum(glm.saturation.invalid)
dimnames(glm.saturation.invalid)<-list(names(datasources),"MODEL OVERPARAMETERIZED")

errorMessage<-as.matrix(errorMessage)
dimnames(errorMessage)<-list(names(datasources),"ERROR MESSAGES")



output.blocked.information.1<-"MODEL FITTING TERMINATED AT FIRST ITERATION:"
output.blocked.information.2<-"ANY VALUES OF 1 IN THE FOLLOWING TABLES"
output.blocked.information.3<-"DENOTE POTENTIAL DISCLOSURE RISKS"


if(sum.y.invalid>0||sum.Xpar.invalid>0||sum.w.invalid>0||sum.glm.saturation.invalid>0){
    return(list(
		    output.blocked.information.1,
		    output.blocked.information.2,
		    output.blocked.information.3,
		    y.vector.error=y.invalid,
                X.matrix.error=Xpar.invalid,
                weight.vector.error=w.invalid,
                glm.overparamterized=glm.saturation.invalid,
	    	    errorMessage=errorMessage
                ))
   }

   

   beta.vect.next <- rep(0,num.par.glm)
   beta.vect.temp <- paste0(as.character(beta.vect.next), collapse=",")
 

  #Provide arbitrary starting value for deviance to enable subsequent calculation of the
  #change in deviance between iterations
  dev.old<-9.99e+99
  
  #Convergence state needs to be monitored.
  converge.state<-FALSE
  
  #Define a convergence criterion. This value of epsilon corresponds to that used
  #by default for GLMs in R (see section S3 for details)
  epsilon<-1.0e-08
  
  f<-NULL


  
  while(!converge.state && iteration.count < maxit) {
    
    iteration.count<-iteration.count+1
    
    message("Iteration ", iteration.count, "...")

#NOW CALL SECOND COMPONENT OF glmDS TO GENERATE SCORE VECTORS AND INFORMATION MATRICES
    cally2 <- call('glmDS2.b', formula, family, beta.vect=beta.vect.temp, offset, weights, data)

      study.summary <- datashield.aggregate(datasources, cally2)

  

#INTEGRATE RETURNED OUTPUT
      .select <- function(l, field) {
      lapply(l, function(obj) {obj[[field]]})
    }

    disclosure.risk.total<-Reduce(f="+", .select(study.summary, 'disclosure.risk'))

	disclosure.risk<-NULL
      errorMessage2<-NULL

	for(ss2 in 1:numstudies){
         disclosure.risk<-c(disclosure.risk,study.summary[[ss]][[9]])
         errorMessage2<-c(errorMessage2,study.summary[[ss]][[10]])
	}

    disclosure.risk<-as.matrix(disclosure.risk)
    dimnames(disclosure.risk)<-list(names(datasources),"RISK OF DISCLOSURE")


    errorMessage2<-as.matrix(errorMessage2)
    dimnames(errorMessage2)<-list(names(datasources),"ERROR MESSAGES")

    if(disclosure.risk.total>0){
	message("DISCLOSURE RISK IN y.vect, X.mat OR w.vect \nOR MODEL OVERPARAMETERIZED IN AT LEAST ONE STUDY")
	output.blocked.information.1<-"POTENTIAL DISCLOSURE RISK IN y.vect, X.mat OR w.vect"
	output.blocked.information.2<-"OR MODEL OVERPARAMETERIZED IN AT LEAST ONE STUDY."
	output.blocked.information.3<-"FURTHERMORE, CLIENTSIDE FUNCTION MAY HAVE BEEN MODIFIED"
	output.blocked.information.4<-"SO SCORE VECTOR AND INFORMATION MATRIX DESTROYED IN ALL AT-RISK STUDIES"

   	return(list(output.blocked.information.1,
			output.blocked.information.2,
			output.blocked.information.3,
			output.blocked.information.4,
	            ))
	}

   
    info.matrix.total<-Reduce(f="+", .select(study.summary, 'info.matrix'))
    score.vect.total<-Reduce(f="+", .select(study.summary, 'score.vect'))
    dev.total<-Reduce(f="+", .select(study.summary, 'dev'))
    Nvalid.total<-Reduce(f="+", .select(study.summary, 'Nvalid'))
    Nmissing.total<-Reduce(f="+", .select(study.summary, 'Nmissing'))
    Ntotal.total<-Reduce(f="+", .select(study.summary, 'Ntotal'))
 
    message("CURRENT DEVIANCE:      ", dev.total)
    
    if(iteration.count==1) {
      # Sum participants only during first iteration.
      nsubs.total<-Reduce(f="+", .select(study.summary, 'numsubs'))
      # Save family
      f <- study.summary[[1]]$family
    }
    
    #Create variance covariance matrix as inverse of information matrix
    variance.covariance.matrix.total<-solve(info.matrix.total)
    
    # Create beta vector update terms
    beta.update.vect<-variance.covariance.matrix.total %*% score.vect.total
    
    #Add update terms to current beta vector to obtain new beta vector for next iteration
	if(iteration.count==1)
	{
	 beta.vect.next<-rep(0,length(beta.update.vect))
	}
	
	beta.vect.next<-beta.vect.next+beta.update.vect

 	beta.vect.temp <- paste0(as.character(beta.vect.next), collapse=",")


    
    #Calculate value of convergence statistic and test whether meets convergence criterion
    converge.value<-abs(dev.total-dev.old)/(abs(dev.total)+0.1)
    if(converge.value<=epsilon)converge.state<-TRUE
    if(converge.value>epsilon)dev.old<-dev.total
    
    if(viewIter){
      #For ALL iterations summarise model state after current iteration
      message("SUMMARY OF MODEL STATE after iteration ", iteration.count)
      message("Current deviance ", dev.total," on ",(nsubs.total-length(beta.vect.next)), " degrees of freedom")
      message("Convergence criterion ",converge.state," (", converge.value,")")
      
      message("\nbeta: ", paste(as.vector(beta.vect.next), collapse=" "))
      
      message("\nInformation matrix overall:")
      message(paste(capture.output(info.matrix.total), collapse="\n"))
      
      message("\nScore vector overall:")
      message(paste(capture.output(score.vect.total), collapse="\n"))
      
      message("\nCurrent deviance: ", dev.total, "\n")
    }
  }
  if(!viewIter){
    #For ALL iterations summarise model state after current iteration
    message("SUMMARY OF MODEL STATE after iteration ", iteration.count)
    message("Current deviance ", dev.total," on ",(nsubs.total-length(beta.vect.next)), " degrees of freedom")
    message("Convergence criterion ",converge.state," (", converge.value,")")
    
    message("\nbeta: ", paste(as.vector(beta.vect.next), collapse=" "))
    
    message("\nInformation matrix overall:")
    message(paste(capture.output(info.matrix.total), collapse="\n"))
    
    message("\nScore vector overall:")
    message(paste(capture.output(score.vect.total), collapse="\n"))
    
    message("\nCurrent deviance: ", dev.total, "\n")
  }
  
  #If convergence has been obtained, declare final (maximum likelihood) beta vector,
  #and calculate the corresponding standard errors, z scores and p values
  #(the latter two to be consistent with the output of a standard GLM analysis)
  #Then print out final model summary
  if(converge.state)
  {
    family.identified<-0
    
    beta.vect.final<-beta.vect.next
    
    scale.par <- 1
    if(f$family== 'gaussian') {
      scale.par <- dev.total / (nsubs.total-length(beta.vect.next))
    }
    
    family.identified<-1
    se.vect.final <- sqrt(diag(variance.covariance.matrix.total)) * sqrt(scale.par)
    z.vect.final<-beta.vect.final/se.vect.final
    pval.vect.final<-2*pnorm(-abs(z.vect.final))
    parameter.names<-names(score.vect.total[,1])
    model.parameters<-cbind(beta.vect.final,se.vect.final,z.vect.final,pval.vect.final)
    dimnames(model.parameters)<-list(parameter.names,c("Estimate","Std. Error","z-value","p-value"))
    
    if(CI > 0)
    {
      ci.mult <- qnorm(1-(1-CI)/2)
      low.ci.lp <- model.parameters[,1]-ci.mult*model.parameters[,2]
      hi.ci.lp <- model.parameters[,1]+ci.mult*model.parameters[,2]
      estimate.lp <- model.parameters[,1]
      
      
      
      if(family=="gaussian"){
        estimate.natural <- estimate.lp
        low.ci.natural <- low.ci.lp
        hi.ci.natural <- hi.ci.lp
        name1 <- paste0("low",CI,"CI")
        name2 <- paste0("high",CI,"CI")
        ci.mat <- cbind(low.ci.lp,hi.ci.lp)
        dimnames(ci.mat) <- list(NULL,c(name1,name2))   
      }
      
      if(family=="binomial"){
        family.identified  <-  1
        num.parms <- length(low.ci.lp)
          name1 <- paste0("low",CI,"CI.LP")
          name2 <- paste0("high",CI,"CI.LP")
          name3 <- paste0("P_OR")
          name4 <- paste0("low",CI,"CI.P_OR")
          name5 <- paste0("high",CI,"CI.P_OR")
         estimate.natural <- exp(estimate.lp)/(1+exp(estimate.lp))
        low.ci.natural <- exp(low.ci.lp)/(1+exp(low.ci.lp))
        hi.ci.natural <- exp(hi.ci.lp)/(1+exp(hi.ci.lp))
        if(num.parms > 1){
          estimate.natural[2:num.parms] <- exp(estimate.lp[2:num.parms])
          low.ci.natural[2:num.parms] <- exp(low.ci.lp[2:num.parms])
          hi.ci.natural[2:num.parms] <- exp(hi.ci.lp[2:num.parms])
       }       
        ci.mat <- cbind(low.ci.lp,hi.ci.lp,estimate.natural,low.ci.natural,hi.ci.natural)
        dimnames(ci.mat) <- list(NULL,c(name1,name2,name3,name4,name5))
        
      }
      
      if(family=="poisson"){
        family.identified <- 1
        num.parms <- length(low.ci.lp)
        estimate.natural <- exp(estimate.lp)
        low.ci.natural <- exp(low.ci.lp)
        hi.ci.natural <- exp(hi.ci.lp)
        name1 <- paste0("low",CI,"CI.LP")
        name2 <- paste0("high",CI,"CI.LP")
        name3 <- paste0("EXPONENTIATED RR")
        name4 <- paste0("low",CI,"CI.EXP")
        name5 <- paste0("high",CI,"CI.EXP")
        ci.mat <- cbind(low.ci.lp,hi.ci.lp,estimate.natural,low.ci.natural,hi.ci.natural)
        dimnames(ci.mat) <- list(NULL,c(name1,name2,name3,name4,name5))        
      }
      
      if(family.identified==0)
      {
        estimate.natural <- estimate.lp
        low.ci.natural <- low.ci.lp
        hi.ci.natural <- hi.ci.lp
        name1 <- paste0("low",CI,"CI")
        name2 <- paste0("high",CI,"CI")
        ci.mat <- cbind(low.ci.lp,hi.ci.lp)
        dimnames(ci.mat) <- list(NULL,c(name1,name2))   
      }
      
    }
    
    model.parameters<-cbind(model.parameters,ci.mat)
    
    if(!is.null(offset)&&!is.null(weights)){
       formulatext <- paste0(Reduce(paste, deparse(formula)), paste0(" + offset(", offset, ")"), paste0(" + weights(", weights, ")"))
       }
    if(!is.null(offset)&&is.null(weights)){
       formulatext <- paste0(Reduce(paste, deparse(formula)), paste0(" + offset(", offset, ")"))
       }
    if(is.null(offset)&&!is.null(weights)){
       formulatext <- paste0(Reduce(paste, deparse(formula)), paste0(" + weights(", weights, ")"))
       }

    if(is.null(offset)&&is.null(weights)){
       formulatext <- Reduce(paste, deparse(formula))
       }
    
    
      glmds <- list(
			Nvalid=Nvalid.total,
			Nmissing=Nmissing.total,
			Ntotal=Ntotal.total,
			disclosure.risk=disclosure.risk,
                  errorMessage=errorMessage2,
	  		nsubs=nsubs.total,
 			iter=iteration.count,
	  		family=f,
      		formula=formulatext,
    		      coefficients=model.parameters,
      		dev=dev.total,
      		df=(nsubs.total-length(beta.vect.next)),
 
			output.information="SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
    			)
    
    class(glmds) <- 'glmds'
                                                                           
    return(glmds)
  } else {
    warning(paste("Did not converge after", maxit, "iterations. Increase maxit parameter as necessary."))
    return(NULL)
  }
  
}
#ds.glm.b
datashield/dsBetaTestClient5 documentation built on May 14, 2019, 7:49 p.m.