In difuture-lmu/dsROCGLM: ROC and calibration analysis for DataSHIELD

options(width = 80)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "Readme_files/"
)

pkgs = c("here", "opalr", "DSI", "DSOpal", "dsBaseClient")
for (pkg in pkgs) {
  if (! requireNamespace(pkg, quietly = TRUE))
    install.packages(pkg, repos = c(getOption("repos"), "https://cran.obiba.org"))
}
devtools::install(quiet = TRUE, upgrade = "always")

## Install packages on the DataSHIELD test machine:
surl     = "https://opal-demo.obiba.org/"
username = "administrator"
password = "password"

opal = opalr::opal.login(username = username, password = password, url = surl)

pkgs = c("dsPredictBase", "dsROCGLM")
for (pkg in pkgs) {
  check1 = opalr::dsadmin.install_github_package(opal = opal, pkg = pkg, username = "difuture-lmu")
  if (! check1)
    stop("[", Sys.time(), "] Was not able to install ", pkg, "!")

  check2 = opalr::dsadmin.publish_package(opal = opal, pkg = pkg)
  if (! check2)
    stop("[", Sys.time(), "] Was not able to publish methods of ", pkg, "!")
}

ROC-GLM for DataSHIELD

The package provides functionality to conduct and visualize ROC analysis on decentralized data. The basis is the DataSHIELD](https://www.datashield.org/) infrastructure for distributed computing. This package provides the calculation of the ROC-GLM as well as AUC confidence intervals. In order to calculate the ROC-GLM it is necessry to push models and predict them at the servers. This is done automatically by the base package dsPredictBase. Note that DataSHIELD uses an option datashield.privacyLevel to indicate the minimal amount of numbers required to be allowed to share an aggregated value of these numbers. Instead of setting the option, we directly retrieve the privacy level from the DESCRIPTION file each time a function calls for it. This options is set to 5 by default.

Installation

At the moment, there is no CRAN version available. Install the development version from GitHub:

remotes::install_github("difuture-lmu/dsROCGLM")

Register methods

It is necessary to register the assign and aggregate methods in the OPAL administration. These methods are registered automatically when publishing the package on OPAL (see DESCRIPTION).

Note that the package needs to be installed at both locations, the server and the analysts machine.

Usage

A more sophisticated example is available here.

library(DSI)
library(DSOpal)
library(dsBaseClient)

library(dsPredictBase)
library(dsROCGLM)

Log into DataSHIELD server

builder = newDSLoginBuilder()

surl     = "https://opal-demo.obiba.org/"
username = "administrator"
password = "password"

builder$append(
  server   = "ds1",
  url      = surl,
  user     = username,
  password = password
)
builder$append(
  server   = "ds2",
  url      = surl,
  user     = username,
  password = password
)

connections = datashield.login(logins = builder$build(), assign = TRUE)

Assign iris and validation vector at DataSHIELD (just for testing)

datashield.assign(connections, "iris", quote(iris))
datashield.assign(connections, "y", quote(c(rep(1, 50), rep(0, 100))))

Load test model, push to DataSHIELD, and calculate predictions

# Model predicts if species of iris is setosa or not.
iris$y = ifelse(iris$Species == "setosa", 1, 0)
mod = glm(y ~ Sepal.Length, data = iris, family = binomial())

# Push the model to the DataSHIELD servers using `dsPredictBase`:
pushObject(connections, mod)

# Calculate scores and save at the servers using `dsPredictBase`:
pfun =  "predict(mod, newdata = D, type = 'response')"
predictModel(connections, mod, "pred", "iris", predict_fun = pfun)

datashield.symbols(connections)

Calculate l2-sensitivity

# In order to securely calculate the ROC-GLM, we have to assess the
# l2-sensitivity to set the privacy parameters of differential
# privacy adequately:
l2s = dsL2Sens(connections, "iris", "pred")
l2s

# Due to the results presented in https://arxiv.org/abs/2203.10828, we set the privacy parameters to
# - epsilon = 0.2, delta = 0.1 if i      l2s <= 0.01
# - epsilon = 0.3, delta = 0.4 if 0.01 < l2s <= 0.03
# - epsilon = 0.5, delta = 0.3 if 0.03 < l2s <= 0.05
# - epsilon = 0.5, delta = 0.5 if 0.05 < l2s <= 0.07
# - epsilon = 0.5, delta = 0.5 if 0.07 < l2s BUT results may be not good!

Calculate ROC-GLM

roc_glm = dsROCGLM(connections, truth_name = "y", pred_name = "pred",
  dat_name = "iris", seed_object = "y")
roc_glm

plot(roc_glm)

Deploy information:

Build by r Sys.info()[["login"]] (r Sys.info()[["sysname"]]) on r as.character(Sys.time()).

This readme is built automatically after each push to the repository. Hence, it also is a test if the functionality of the package works also on the DataSHIELD servers. We also test these functionality in tests/testthat/test_on_active_server.R. The system information of the local and remote servers are as followed:

ri_l  = sessionInfo()
ri_ds = datashield.aggregate(connections, quote(getDataSHIELDInfo()))
client_pkgs = c("DSI", "DSOpal", "dsBaseClient", "dsPredictBase", "dsROCGLM")
remote_pkgs = c("dsBase", "resourcer", "dsPredictBase", "dsROCGLM")

Local machine:
- R version: r ri_l$R.version$version.string
- Version of DataSHELD client packages:

dfv = installed.packages()[client_pkgs, ]
dfv = data.frame(Package = rownames(dfv), Version = unname(dfv[, "Version"]))
knitr::kable(dfv)

Remote DataSHIELD machines:
- R version of r names(ri_ds)[1]: r ri_ds[[1]]$session$R.version$version.string
- R version of r names(ri_ds)[2]: r ri_ds[[2]]$session$R.version$version.string
- Version of server packages:

dfv = do.call(cbind, lapply(names(ri_ds), function(nm) {
  out = ri_ds[[nm]]$pcks[remote_pkgs, "Version", drop = FALSE]
  colnames(out) = paste0(nm, ": ", colnames(out))
  as.data.frame(out)
}))
dfv = cbind(Package = rownames(dfv), dfv)
rownames(dfv) = NULL
knitr::kable(dfv)