README.md
In distribglm: Distributed Generalized Linear Models

`distribglm`: Distributed Computing Model from a Synced Folder

The goal of distribglm is to provide an example of a Distributed Computing Model from a Synced Folder. In this case, “synced folder” usually means Dropbox or Box Sync.

You can install the released version of distribglm from CRAN with:

install.packages("distribglm")
remotes::install_github("muschellij2/distribglm")

This is a basic example which shows you how to solve a common problem:

library(distribglm)
## basic example code

To set up a model, run setup_model.R. Currently, it’s only been tested on logistic regression. This will ensure the correct folders are set up on the synced folder. Once the folders are synced, each hospital/site should use the secondary.R file.

For each site, a few things need to be specified. First, the site_name needs to be specified in the document. This can be something like site1 or something like my_hospital_name. Second, the site must specify the synced_folder path. Though this folder is synced across the sites and the computing site, it may be located on different sub-folders on each computer. For example ~/Dropbox/Projects/distributed_model, C:/My Documents/distributed_model, etc.

Additionally, the site may need to specify all_site_names, which is the names of all the other sites that are computing this model as well. This can be relaxed, but is currently there so that each site can see what’s going on, whether they’re waiting for another site to finish computing or for the computing site to update the beta coefficients.

The third thing is the site needs to specify the data_source. In the cases above, we use a CSV for the data. The important thing is that this data can be in a secure place. The data from this is used for computation but is not in the synced folder. Lastly, the site needs to specify the model_name to indicate which model is being analyzed. This could be relaxed where once a model is finalized/converged, all iterations and estimates are zipped up and then the formula is removed. Thus, if you check that a formula no longer exists, the site code will exit.

All sites need to have the data structured so the column names match exactly across sites for the formula specification to work. There could be an initial data check step too.

The computing/master site first needed to setup the model from above. (Maybe specify all site names here too?). Similar to the sites, the computing site needs to specify the model_name to indicate which model is being analyzed, the synced_folder path, and all_site_names. That should be it to start the process until the model converges.

What actually happens?

Currently, the process that happens is as follows:

The data is read into R. The model design matrix X is created
Current β estimates are pulled for the specified model. If no estimates exist, β is initialized to a vector of zeros.
Xβ is computed, which is ŷ, and ε̂ = y − ŷ. Then δ̂ = Xε̂ is created and δ̄ is used as the vector of gradients to update β`. Each site returns δ̂ and the sample size at that site so that they can be combined to δ̄.
The compute site gathers the δ̂ and the sample sizes, and computes δ̄, and then writes the updated β.
The compute site checks to see if δ̄ is below some tolerance level to determine convergence.

Outstanding issues

Standard errors
Better diagnostics
Use Fisher scoring vs. gradient descent for better convergence rates (https://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdf). This will solve the standard errors todo as well.