PDN Vignette"

knitr::opts_chunk$set(echo = TRUE)

Introduction

A Personalized Disease Network (PDN) is a connected graph that represents the disease evolution of an individual patient. We use diagnoses, medical interventions, and their dates to build a PDN. Each node in the PDN represents an event (e.g. Heart failure) and each node is connected to another by a directed edge if the dates fulfill a certain criterion. PDN were introduced by Javier Cabrera and proved to be effective for improving goodness of fit of survival models, dominating individual comorbidity variables. (Cabrera et. al. 2016)

Installation

Like many other R packages, the simplest way to obtain glmnet is to install it directly from CRAN. Type the following command in R console:

install.packages("PDN", repos = "https://cran.us.r-project.org")

Users may change the repos options depending on their locations and preferences. Other options such as the directories where to install the packages can be altered in the command. For more details, see help(install.packages). Here the R package has been downloaded and installed to the default directories. Alternatively, users can download the package source at "https://cran.r-project.org/package=PDN" and type Unix commands to install it to the desired location.

Quick Overview

The purpose of this section is to give users a general sense of the package, including the components, what they do and some basic usage. We will briefly go over the main functions, see the basic operations and have a look at the outputs. Users may have a better idea after this section what functions are available, which one to choose, or at least where to seek help. More details are given in later sections.

First, we load the glmnet package:

library(PDN)

The package contains five functions: buildnetworks, datecut, draw.PDN, draw.PDN.circle and draw.PDN.cluster, and three sample data sets: comorbidity_data, demographic_data and survival_data.

You can visit the help page of the package using the following command:

help(package = "PDN")

For comorbidity_data, its columns represent different diagnoses, its rows represent different patients, and the entries represent the time difference between the respective event and January 1st 1970. NA means the patient does not have the respective diagnosis.

summary(comorbidity_data)

For demographic_data, it contains the demographic information of each patients It has five columns which are sex, race, hispan, dshyr and prime.

summary(demographic_data)

For survival_data, it contains information of survival days from admission to death, as well as censor information regarding whether the patient is dead or not.

summary(survival_data)

Datecuts

In order to create a PDN based on comorbidity_data, we need to find the criterion for each pair of diagnoses that define the connection of the network.

Generally, it is important to choose an appropriate cutoff point such that it captures information of the interaction between the two diagnoses.

For CMTHY and HTN, the distribution of the time difference is shown below:

abc = comorbidity_data$CMTHY - comorbidity_data$HTN
abc = abc[!is.na(abc)]
bins=seq(floor(min(abc)),ceiling(max(abc)+1000),500)
hist(abc,breaks=bins,
     yaxt="n",
     xaxt="n",
     main = "",
     ylab="Number of Patients",                         
     xlab="Time difference between CMTHY and HTN in days", 
     col="grey"                                      
)
axis(side=2, at=axTicks(2), labels=formatC(axTicks(2), format="d", big.mark=','))
axis(side=1, at=axTicks(1), labels=formatC(axTicks(1), format="d", big.mark=','))
box()

We used Cox PH regression to find the best criterion for cutoff point following the rule proposed by (Cabrera et. al. 2016).

k1 = datecut(comorbidity_data,survival_data[,1],survival_data[,2])
k1[1:20]

Build Networks

By using function buildnetworks, we past comorbidity_data and criterion information we get from the datecut into the function, then we will get the adjacency matrix for the network shown below:

a = buildnetworks(comorbidity_data,k1)
a[1:10,1:10]

Visualize Networks

In our case, the nodes have chronological order, and we opt to incorporate this order in the network visualization for a more informative and clean result. Below left is the example PDN from function draw.PDN.circle.

PDN for individual patient:

datark = t(apply(comorbidity_data,1,rank))
dak = sort(datark[1,])
draw.PDN.circle(a[1,],dak)
title("PDN for Patient 1")

Here is the example using the functioin draw.PDN, which utilize network and ggplot2 for better visualization:

nn = names(comorbidity_data)
draw.PDN(a[7,],nn)

By looping each patient through draw.PDN.circle, we obtain the visualizations for the set of patients:

par(mfrow=c(4,5))
for(i in 1 : 20){
  dak = apply(datark,2,sort)
  draw.PDN.circle(a[i,],dak[i,])
  title(main=paste("Patient",i))
}

Here is an example of using draw.PDN.cluster to visulaize network of cluster, we use different colors to represent weighted connection of each nodes.

##Clustering Example
k1 = datecut(comorbidity_data,survival_data[,1],survival_data[,2])
a = buildnetworks(comorbidity_data,k1)
datark = t(apply(comorbidity_data,1,rank))
require(survival)
zsq = NULL
for(i in 1:ncol(a)){
  a1 = (summary(coxph(Surv(as.numeric(survival_data[,1]),survival_data[,2]) ~ a[,i],
  data=as.data.frame(a)))$coefficient[,4])^2
  zsq = cbind(zsq,a1)
}
zsq = as.numeric(zsq)
wi=zsq/sum(zsq,na.rm=TRUE)
wi[wi<10^ -3]=10^ -3
wi[is.na(wi)]=10^ -3
#weighted matrix
wa = NULL
for(i in 1:ncol(a)){
 wa = cbind(wa,a[,i]*wi[i])
}
#PCA
pr.out=prcomp(wa)
x.svd=svd(wa)
x.score1 <- wa %*% x.svd$v
x.score2 <- x.svd$u %*% diag(x.svd$d)
##HC cluster using the first 8 PCA scores
dp<-dist(x.score2[,1:8])
hcp<-hclust(dp, method="ward.D")
##Plot Network
s1=rev(sort(apply(a[cutree(hcp,3)==3,],2,mean)))[1:50]
dak = sort(apply(datark[cutree(hcp,3)==3,],2,mean,na.rm=TRUE))
names(dak) = unlist(strsplit(names(dak),"DAT"))
draw.PDN.cluster(s1,dak)

Analyze Networks

In the below example using PDN network information to perform regression, we combine comorbidity information and the adjacency matrix from buildnetworks as our design matrix which is shown below:

x = cbind(!is.na(comorbidity_data),a)
x[1:10,1:10]

We then use the glmnet package to perform k-fold cross-validation for the Cox model. We extract two optimal lambdas, that is lambda.min: the ?? at which the minimal MSE is achieved, and lambda.1se: the largest ?? at which the MSE is within one standard error of the minimal MSE. We can check the active covariates in our model and see their coefficients based on the lambda we have chosen.

require(survival)
require(glmnet)
y=Surv(survival_data[,1],survival_data[,2])
glm1 = cv.glmnet(x,y,family="cox",alpha=0.8)
b = glmnet(x,y,family="cox",alpha=0.8,lambda=c(glm1$lambda.min,glm1$lambda.1se))$beta
b[1:20,]

Through stepwise model selection based on the AIC criterion, we can see that the network variables dominate individual comorbidity variables. We conclude that the network information is significant for improving the model compared to just comorbidity information.

isel = b[,2]!=0
bcox = coxph(y~.,data=data.frame(x[,isel]))
step1 = step(bcox)
step1$anova


Try the PDN package in your browser

Any scripts or data that you put into this service are public.

PDN documentation built on May 2, 2019, 3:10 p.m.