In elincho/ihclust: Iterative Hierarchical Clustering (IHC)

Introduction

Drug utilization data have been widely explored by researchers as the patterns of the data are associated with various factors such as health outcomes and disparities, regulatory policies, and social issues. Despite the depth of the data and potential public health impact, drug utilization data are often analyzed using simple descriptive analyses (e.g., presentation of monthly utilization trend at the national level). Methods and software to explore the time trend of drug utilization across granular geographic units have been limited.

Therefore, we introduce ihclust, an R package to:
i) identify geographic areas with significant change over time in drug utilization, and
ii) characterize common change over time patterns among the time series for multiple geographic areas.

Methods

Step 1: Method to Identify Areas with Change Over Time

We assume that prescription dispensing data are discrete observations from a smooth underlying function,
denoted as $X_{i}(t)$, where $i$ indicates a geographic area.
We identify a subset of geographic locations that change significantly with respect to time,
using hypothesis tests: $H_{0}:X_{i}(t)=0$ vs. $H_{a}:X_{i}(t)\neq0.$
For each geographic location, p-values are obtained using a permutation test.
P-values are adjusted by using Benjamini-Hochberg method.

Step 2: Method to Characterize Temporal Change

We obtain clusters of geogrpahic areas that share similar temporal change patterns using Iterative Hiearchical Clustering (IHC).

IHC implements the following three steps to the estimated smooth function, $\hat{X}_{i}(t)$:

1) Initialization: The initial clusters are determined by hiearchical clustering.
2) Merging: Clusters are re-assessed by examining the cluster centers. Any clusters with pairwise correlations greather than a pre-specified threshold $\alpha$ are merged.
3) Pruning: Clustering membership is re-examined. Any cluster member with correlations between the cluster center and member less than $\alpha$ is removed from each existing cluster and assigned into a single-element cluster.
4) Steps 2-3 are repeated until a pre-defined convergence criteria is satisfied.

knitr::opts_chunk$set(echo = TRUE)

```{css, echo=FALSE} .scroll-300 { max-height: 300px; overflow-y: auto; background-color: inherit; }

## Installation

ihclust can be installed using `install.packages("ihclust")`.

The development version from GitHub can be installed using the following code:

```r
# install.packages("devtools")
devtools::install_github("https://github.com/elincho/ihclust")

Example

Following is an example illustrating usage of two main functions testchange and ihclust. A simulated data will be used in this example. An additional example using real data of opioid dispensings in the United States is provied in the Appendix at the end of this page.

library(ihclust)

Data

We will use a simulated data generated by a function simcurve from the ihclust package. The following code randomly generates curves representing weekly prescription drug dispensing data from 900 areas over 52 weeks. p=0.05 represents a proportion of areas that had no significant change over time.

set.seed(1)
mydata <- simcurve(numareas = c(300, 300, 300), p=0.05, type="random")

# create a function for plotting
plotsim <- function(curve){
  if (curve$parameters[3] == "random"){
    mysimdata <- curve$data

    par(mfrow=c(1,3))
    time=seq(1,52,1)
    plot(time, mysimdata[1,],type='l', lwd = 0.2, ylim=c(-3.5,3.5),
         ylab="Standardized number of prescriptions dispensed",
         xlab="",
         main="All Areas Over Time \n (n=900)",
         cex.lab=1.2, cex.axis=0.8, cex.main=1, col = "red")
    for(i in 2:45){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "red")}
    for(i in 46:900){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "deepskyblue")}

    plot(time, mysimdata[1,],type='l',ylim=c(-3.5,3.5),
         ylab="",
         xlab="Time (Weeks)",
         main="Areas with Change Over Time \n (n=45)",
         cex.lab=1.2, cex.axis=0.8, cex.main=1, col = "red")
    for(i in 2:45){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "red")}

    plot(time, mysimdata[46,],type='l',ylim=c(-3.5,3.5),
         ylab="",
         xlab="",
         main="Areas with No Change \n (n=855)",
         cex.lab=1.2, cex.axis=0.8, cex.main=1, col = "deepskyblue")
    for(i in 47:900){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "deepskyblue")}
  }
  else if (curve$parameters[3] == "fixed"){
    simulated_clusters <- curve$data

    par(mfrow=c(1,3))
    time <- seq(1,52,1)
    plot(time, simulated_clusters[1,],type='l',ylim=c(-3.5,3.5),
         ylab="Standardized number of prescriptions dispensed",
         xlab = "",
         main="Simulated data for cluster 1",
         cex.lab=0.8, cex.axis=0.8, cex.main=1)
    for(i in 2:300){lines(time,simulated_clusters[i,],ylim=c(-3.5,3.5))}

    plot(time, simulated_clusters[301,],type='l',ylim=c(-3.5,3.5),
         ylab = "",
         xlab="Time (Weeks)",
         main="Simulated data for cluster 2",
         cex.lab=0.8, cex.axis=0.8, cex.main=1)
    for(i in 301:600){lines(time,simulated_clusters[i,],ylim=c(-3.5,3.5))}

    plot(time, simulated_clusters[601,],type='l',ylim=c(-3.5,3.5),
         ylab = "",
         xlab = "",
         main="Simulated data for cluster 3",
         cex.lab=0.8, cex.axis=0.8, cex.main=1)
    for(i in 601:900){lines(time,simulated_clusters[i,],ylim=c(-3.5,3.5))}
  }}

plotsim(mydata)

In the figure above, the red lines represent the areas that have disensing data with significant change over time. The blue lines represent the areas that have dispenisng data with no significant change over time. There are 45 areas with change, and 855 areas with no change.

Step 1: Identifying Areas with Change over Time

In the first step, testchange function below runs 100 permutation tests (nperm = 100) for each geograpghic area. The option time = seq(1,52,1) specifies the number of time points in the data, representing 52 weeks in this example.

The number of $\geq10000$ permutations is recommended, but we recommend starting with a smaller size initially as it can be time-consuming to run permutation tests.

Adjusted p-values are measured against the significant level of 0.05 to sort out the areas that have significant change over time patterns in dispensing data. Among 900 areas, 40 areas were identified as having significant change (40 out of 45 areas correctly identified).

sim_change_results <- testchange(data=mydata$data,perm=TRUE, nperm=100,time=seq(1,52,1)) 
sim_change <- mydata$data[sim_change_results$p.adjusted<.05, ]
nrow(sim_change) # 40 selected

Step 2: Clustering Common Change over Time Patterns among the Time-series for Areas

In the second step, ihclust function performs Iterative Hiearchical Clustering to generate multiple clusters, each containing members with similar change over time patterns. The pre-specified correlation criteria is set as 0.75. The maximum number of iteration can be set as 100 to prevent the possible situation where merging and pruning steps keep repeating endlessly with two different final results.

sim_clust_results <- ihclust(data=sim_change, smooth = TRUE, cor_criteria = 0.75, max_iteration = 100, verbose = TRUE)

The numbers below represent clustering memberships of prescription drug dispensing data with change over time patterns for 40 geographic areas.

sim_clust_results[["Cluster_Label"]]

Clustering Results

A total of 6 clusters were identified by the IHC algorithm. The largest cluster (Cluster 2) identified 13 areas sharing similar change over time patterns. The second largest cluster (Cluster 3) identified 12 areas. All clusters are shown by figures below.

# center the data for plotting
sim_centered <- matrix(rep(NA,nrow(sim_change)*52),ncol=52)

for(i in 1:nrow(sim_centered)){
   obsData.centered <- as.numeric(scale(sim_change[i,], center = TRUE, scale = TRUE))
   sim_centered[i,] <- obsData.centered
   rm(obsData.centered)
}

# merge centered data and clustering memberships
memb_sim <- sim_clust_results$Cluster_Label
clusters_sim <- cbind(sim_centered,memb_sim)

# sort clusters in decreasing order
memb_sim <- sort(table(memb_sim),decreasing=T)
# check number of members in each cluster
memb_sim

par(mfrow=c(2,3))
  for(i in 1:length(memb_sim)){
    cluster_c <- clusters_sim[which(clusters_sim[,53]==names(memb_sim)[i]),]
    time <- seq(1,52,1)
    if(is.null(ncol(cluster_c))==F){
      plot.title <- paste("Cluster: ",names(memb_sim)[i]," (",memb_sim[i]," Areas)",sep="")
      plot(time, cluster_c[1,1:52],type='l',ylim=c(-5,5), xlab="Week",col='lightgrey',
           ylab="",
           cex.axis=0.8, main=plot.title,cex.lab=1,cex.main=1)
      title(ylab="Standardized number of\nprescriptions dispensed", line=2, cex.lab=1)
      for(i in 2:nrow(cluster_c)){lines(time,cluster_c[i,1:52],ylim=c(-5,5),col='lightgrey')}
      lines(time,colMeans(cluster_c[,1:52]),col='red',lwd=2)
      }
    else{
      plot.title <- paste("Cluster: ",names(memb_sim)[i]," (",memb_sim[i]," Area)",sep="")
      plot(time, cluster_c[1:52],type='l',ylim=c(-5,5),xlab="Week",col='black',
           ylab="",
           cex.axis=0.8, main=plot.title,cex.lab=1,cex.main=1)
      title(ylab="Standardized number of\nprescriptions dispensed", line=2, cex.lab=1)
    }
  }

The red lines are cluster centers obtained by averaging the patterns of prescription dispensings in member areas, and the gray lines represent the patterns of prescription dispensings in areas within each cluster.

Appendix

Example using a dataset `opioidData` within the package

The package includes a sample dataset from the CDC U.S.Opioid Dispensing Rate Maps (https://www.cdc.gov/drugoverdose/rxrate-maps/index.html). The data is publicly available and includes retail opioid prescription dispensing data at the state and county level from 2006 –2020. The data are obtained as rates of opioid dispensing per 100 persons. Further details about the data source are available from the CDC data resource page.

data(opioidData)
opioid_data_noNA <- opioidData[complete.cases(opioidData), ] 
opioid_data <- as.matrix(opioid_data_noNA[,4:18])

Step 1: Identifying Areas with Change over Time

opioid_change_results <- testchange(data=opioid_data,perm=TRUE,nperm=100,time=seq(1,15,1)) 
opioid_change <- opioid_data[opioid_change_results$p.adjusted<.05, ]

Step 2: Clustering Common Change over Time Patterns among the Time-series for Areas

opioid_clust_results <- ihclust(data=opioid_change, smooth = TRUE, cor_criteria = 0.75, max_iteration = 100, verbose = TRUE)

Clustering Results

# center the data for plotting
opioid_centered <- matrix(rep(NA,nrow(opioid_change)*15),ncol=15)

for(i in 1:nrow(opioid_centered)){
   obsData.centered <- as.numeric(scale(opioid_change[i,], center = TRUE, scale = TRUE))
   opioid_centered[i,] <- obsData.centered
   rm(obsData.centered)
}

# merge centered data and clustering memberships
memb_opioid <- opioid_clust_results$Cluster_Label
clusters_opioid <- cbind(opioid_centered,memb_opioid)

# sort clusters in decreasing order
memb_opioid_sorted <- sort(table(memb_opioid),decreasing=T)
# check number of members in each cluster
memb_opioid_sorted

par(mfrow=c(3,5))
  for(i in 1:length(memb_opioid_sorted)){
    cluster_c <- clusters_opioid[which(clusters_opioid[,16]==names(memb_opioid_sorted)[i]),]
    time <- seq(2006,2020,1)
    if(is.null(ncol(cluster_c))==F){
      plot.title <- paste("Cluster: ",names(memb_opioid_sorted)[i],"\n(",memb_opioid_sorted[i]," Counties)",sep="")
      plot(time, cluster_c[1,1:15],type='l',ylim=c(-5,5), xlab="Week",col='lightgrey',
           ylab="",
           cex.axis=0.7, main=plot.title,cex.lab=1,cex.main=0.9,las=2)
      title(ylab="Standardized Opioid\nDispensing Rate", line=2, cex.lab=0.7)
      for(i in 2:nrow(cluster_c)){lines(time,cluster_c[i,1:15],ylim=c(-5,5),col='lightgrey')}
      lines(time,colMeans(cluster_c[,1:15]),col='red',lwd=2)
      }
    else{
      plot.title <- paste("Cluster: ",names(memb_opioid_sorted)[i],"\n(",memb_opioid_sorted[i]," County)",sep="")
      plot(time, cluster_c[1:15],type='l',ylim=c(-5,5),xlab="Week",col='black',
           ylab="",
           cex.axis=0.7, main=plot.title,cex.lab=1,cex.main=0.9,las=2)
      title(ylab="Standardized Opioid\nDispensing Rate", line=2, cex.lab=0.7)
    }
  }

The red lines are cluster centers obtained by averaging the patterns of prescription dispensing rates in member counties, and the gray lines represent the patterns of prescription dispensing rates in counties within each cluster.

List of counties in each cluster

opioid_county <- opioid_data_noNA[,3:18]
opioid_change_county <- opioid_county[opioid_change_results$p.adjusted<.05, ]
memb_county <- cbind(opioid_change_county,memb_opioid)

# make a list
clusters_list <- list()
for(i in 1:length(memb_opioid_sorted)){
  county <- memb_county$County[which(memb_county$memb_opioid==names(memb_opioid_sorted)[i])]
  clusters_list[[i]] <- county
  names(clusters_list)[i] <- paste0("Cluster",names(memb_opioid_sorted)[i])
}
clusters_list

elincho/ihclust documentation built on July 2, 2022, 1:18 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

elincho/ihclust
Iterative Hierarchical Clustering (IHC)

In elincho/ihclust: Iterative Hierarchical Clustering (IHC)

Introduction

Methods

Example

Data

Step 1: Identifying Areas with Change over Time

Step 2: Clustering Common Change over Time Patterns among the Time-series for Areas

Clustering Results

Appendix

Example using a dataset `opioidData` within the package

Step 1: Identifying Areas with Change over Time

Step 2: Clustering Common Change over Time Patterns among the Time-series for Areas

Clustering Results

List of counties in each cluster

R Package Documentation

Browse R Packages

We want your feedback!

elincho/ihclust Iterative Hierarchical Clustering (IHC)

In elincho/ihclust: Iterative Hierarchical Clustering (IHC)

Introduction

Methods

Example

Data

Step 1: Identifying Areas with Change over Time

Step 2: Clustering Common Change over Time Patterns among the Time-series for Areas

Clustering Results

Appendix

Example using a dataset opioidData within the package

Step 1: Identifying Areas with Change over Time

Step 2: Clustering Common Change over Time Patterns among the Time-series for Areas

Clustering Results

List of counties in each cluster

R Package Documentation

Browse R Packages

We want your feedback!

elincho/ihclust
Iterative Hierarchical Clustering (IHC)

Example using a dataset `opioidData` within the package