README.md

SSL Validation

Overview

Phenotyping algorithms based on Electronic Health Records (EHR) aim to identify a patient’s disease status using the information in the health record.

Algorithm evaluation in EHR data is often based on a small number of gold-standard labeled data. Two main issues are the difficulty in selecting the appropriate cut and the high variance in the accuracy parameter estimates.

This package provides a semi-supervised learning approach that incorporates unlabeled data into the estimation of the receiver operating characteristic (ROC) curve parameters to address these issues.

The data consists of :

The main function roc.semi.superv takes as arguments S and Y, where Y contains a large amount of missing values, and provides a semi-supervised estimation of the ROC parameters.

The function roc.superv takes as arguments S and Y from the small validation set only and provides a classic supervised method to estimate the ROC parameters.

Installation

Install development version from GitHub.

# install.packages("remotes")
remotes::install_github("celehs/SSL.validation")

Load the package into R.

library(SSL.validation)

Simulated Example

set.seed(1234)
dat <- read.csv("https://raw.githubusercontent.com/celehs/SSL.validation/master/data-raw/data.csv")
p.0 <- mean(dat$Y , na.rm = TRUE)
id.v <- which(is.na(dat$Y) != 1)
dat.v <- dat[id.v, ] # Labeled Data 

Semi-Supervised Learning (SSL)

system.time(res.ssl <- roc.semi.superv(dat$S,dat$Y))
## [1] 263

##    user  system elapsed 
##  21.446   0.553  23.255
auc.ssl <- res.ssl$auc
roc.ssl <- res.ssl$roc
auc.ssl
## [1] 0.7467197
tail(roc.ssl)
##          cut  p.pos  fpr       tpr       ppv       npv
## [94,] 0.0471 0.9614 0.94 0.9909063 0.2591625 0.9518820
## [95,] 0.0386 0.9688 0.95 0.9930972 0.2574391 0.9554319
## [96,] 0.0312 0.9785 0.96 0.9949573 0.2559512 0.9597191
## [97,] 0.0215 0.9885 0.97 0.9973425 0.2540214 0.9691953
## [98,] 0.0115 1.0000 0.98 0.9997318 0.2520541 0.9941873
## [99,] 0.0115 1.0000 0.98 0.9997318 0.2520541 0.9941873

Supervised Learning (SL)

system.time(res.sl <- roc.superv(dat.v$S,dat.v$Y))
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

##    user  system elapsed 
##   0.012   0.000   0.012
auc.sl <- res.sl$auc
roc.sl <- res.sl$roc
colnames(roc.sl) <- colnames(roc.ssl)
auc.sl
## [1] 0.7016632
tail(roc.sl)
##               cut p.pos       fpr tpr       ppv npv
## [94,] 0.015460938  0.96 0.9459459   1 0.2708333   1
## [95,] 0.014389420  0.96 0.9459459   1 0.2708333   1
## [96,] 0.013239677  0.98 0.9729730   1 0.2653061   1
## [97,] 0.010720983  0.98 0.9729730   1 0.2653061   1
## [98,] 0.008202289  0.98 0.9729730   1 0.2653061   1
## [99,] 0.006568542  0.99 0.9864865   1 0.2626263   1


celehs/SSL.validation documentation built on Oct. 28, 2020, 8:15 a.m.