Drug utilization data have been widely explored by researchers as the patterns of the data are associated with various factors such as health outcomes and disparities, regulatory policies, and social issues. Despite the depth of the data and potential public health impact, drug utilization data are often analyzed using simple descriptive analyses (e.g., presentation of monthly utilization trend at the national level). Methods and software to explore the time trend of drug utilization across granular geographic units have been limited.
Therefore, we introduce ihclust
, an R package to:
i) identify geographic areas with significant change over time in drug utilization, and
ii) characterize common change over time patterns among the time series for multiple geographic areas.
Step 1: Method to Identify Areas with Change Over Time
Step 2: Method to Characterize Temporal Change
We obtain clusters of geogrpahic areas that share similar temporal change patterns using Iterative Hiearchical Clustering (IHC).
IHC implements the following three steps to the estimated smooth function, $\hat{X}_{i}(t)$:
1) Initialization: The initial clusters are determined by hiearchical clustering.
2) Merging: Clusters are re-assessed by examining the cluster centers. Any clusters with pairwise correlations greather than a pre-specified threshold $\alpha$ are merged.
3) Pruning: Clustering membership is re-examined. Any cluster member with correlations between the cluster center and member less than $\alpha$ is removed from each existing cluster and assigned into a single-element cluster.
4) Steps 2-3 are repeated until a pre-defined convergence criteria is satisfied.
knitr::opts_chunk$set(echo = TRUE)
```{css, echo=FALSE} .scroll-300 { max-height: 300px; overflow-y: auto; background-color: inherit; }
## Installation ihclust can be installed using `install.packages("ihclust")`. The development version from GitHub can be installed using the following code: ```r # install.packages("devtools") devtools::install_github("https://github.com/elincho/ihclust")
Following is an example illustrating usage of two main functions testchange
and ihclust
. A simulated data will be used in this example. An additional example using real data of opioid dispensings in the United States is provied in the Appendix at the end of this page.
library(ihclust)
We will use a simulated data generated by a function simcurve
from the ihclust
package. The following code randomly generates curves representing weekly prescription drug dispensing data from 900 areas over 52 weeks. p=0.05
represents a proportion of areas that had no significant change over time.
set.seed(1) mydata <- simcurve(numareas = c(300, 300, 300), p=0.05, type="random")
# create a function for plotting plotsim <- function(curve){ if (curve$parameters[3] == "random"){ mysimdata <- curve$data par(mfrow=c(1,3)) time=seq(1,52,1) plot(time, mysimdata[1,],type='l', lwd = 0.2, ylim=c(-3.5,3.5), ylab="Standardized number of prescriptions dispensed", xlab="", main="All Areas Over Time \n (n=900)", cex.lab=1.2, cex.axis=0.8, cex.main=1, col = "red") for(i in 2:45){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "red")} for(i in 46:900){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "deepskyblue")} plot(time, mysimdata[1,],type='l',ylim=c(-3.5,3.5), ylab="", xlab="Time (Weeks)", main="Areas with Change Over Time \n (n=45)", cex.lab=1.2, cex.axis=0.8, cex.main=1, col = "red") for(i in 2:45){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "red")} plot(time, mysimdata[46,],type='l',ylim=c(-3.5,3.5), ylab="", xlab="", main="Areas with No Change \n (n=855)", cex.lab=1.2, cex.axis=0.8, cex.main=1, col = "deepskyblue") for(i in 47:900){lines(time,mysimdata[i,],ylim=c(-3.5,3.5), col = "deepskyblue")} } else if (curve$parameters[3] == "fixed"){ simulated_clusters <- curve$data par(mfrow=c(1,3)) time <- seq(1,52,1) plot(time, simulated_clusters[1,],type='l',ylim=c(-3.5,3.5), ylab="Standardized number of prescriptions dispensed", xlab = "", main="Simulated data for cluster 1", cex.lab=0.8, cex.axis=0.8, cex.main=1) for(i in 2:300){lines(time,simulated_clusters[i,],ylim=c(-3.5,3.5))} plot(time, simulated_clusters[301,],type='l',ylim=c(-3.5,3.5), ylab = "", xlab="Time (Weeks)", main="Simulated data for cluster 2", cex.lab=0.8, cex.axis=0.8, cex.main=1) for(i in 301:600){lines(time,simulated_clusters[i,],ylim=c(-3.5,3.5))} plot(time, simulated_clusters[601,],type='l',ylim=c(-3.5,3.5), ylab = "", xlab = "", main="Simulated data for cluster 3", cex.lab=0.8, cex.axis=0.8, cex.main=1) for(i in 601:900){lines(time,simulated_clusters[i,],ylim=c(-3.5,3.5))} }} plotsim(mydata)
In the figure above, the red lines represent the areas that have disensing data with significant change over time. The blue lines represent the areas that have dispenisng data with no significant change over time. There are 45 areas with change, and 855 areas with no change.
In the first step, testchange
function below runs 100 permutation tests (nperm = 100
) for each geograpghic area. The option time = seq(1,52,1)
specifies the number of time points in the data, representing 52 weeks in this example.
The number of $\geq10000$ permutations is recommended, but we recommend starting with a smaller size initially as it can be time-consuming to run permutation tests.
Adjusted p-values are measured against the significant level of 0.05 to sort out the areas that have significant change over time patterns in dispensing data. Among 900 areas, 40 areas were identified as having significant change (40 out of 45 areas correctly identified).
sim_change_results <- testchange(data=mydata$data,perm=TRUE, nperm=100,time=seq(1,52,1)) sim_change <- mydata$data[sim_change_results$p.adjusted<.05, ] nrow(sim_change) # 40 selected
In the second step, ihclust
function performs Iterative Hiearchical Clustering to generate multiple clusters, each containing members with similar change over time patterns. The pre-specified correlation criteria is set as 0.75. The maximum number of iteration can be set as 100 to prevent the possible situation where merging and pruning steps keep repeating endlessly with two different final results.
sim_clust_results <- ihclust(data=sim_change, smooth = TRUE, cor_criteria = 0.75, max_iteration = 100, verbose = TRUE)
The numbers below represent clustering memberships of prescription drug dispensing data with change over time patterns for 40 geographic areas.
sim_clust_results[["Cluster_Label"]]
A total of 6 clusters were identified by the IHC algorithm. The largest cluster (Cluster 2) identified 13 areas sharing similar change over time patterns. The second largest cluster (Cluster 3) identified 12 areas. All clusters are shown by figures below.
# center the data for plotting sim_centered <- matrix(rep(NA,nrow(sim_change)*52),ncol=52) for(i in 1:nrow(sim_centered)){ obsData.centered <- as.numeric(scale(sim_change[i,], center = TRUE, scale = TRUE)) sim_centered[i,] <- obsData.centered rm(obsData.centered) } # merge centered data and clustering memberships memb_sim <- sim_clust_results$Cluster_Label clusters_sim <- cbind(sim_centered,memb_sim) # sort clusters in decreasing order memb_sim <- sort(table(memb_sim),decreasing=T) # check number of members in each cluster memb_sim
par(mfrow=c(2,3)) for(i in 1:length(memb_sim)){ cluster_c <- clusters_sim[which(clusters_sim[,53]==names(memb_sim)[i]),] time <- seq(1,52,1) if(is.null(ncol(cluster_c))==F){ plot.title <- paste("Cluster: ",names(memb_sim)[i]," (",memb_sim[i]," Areas)",sep="") plot(time, cluster_c[1,1:52],type='l',ylim=c(-5,5), xlab="Week",col='lightgrey', ylab="", cex.axis=0.8, main=plot.title,cex.lab=1,cex.main=1) title(ylab="Standardized number of\nprescriptions dispensed", line=2, cex.lab=1) for(i in 2:nrow(cluster_c)){lines(time,cluster_c[i,1:52],ylim=c(-5,5),col='lightgrey')} lines(time,colMeans(cluster_c[,1:52]),col='red',lwd=2) } else{ plot.title <- paste("Cluster: ",names(memb_sim)[i]," (",memb_sim[i]," Area)",sep="") plot(time, cluster_c[1:52],type='l',ylim=c(-5,5),xlab="Week",col='black', ylab="", cex.axis=0.8, main=plot.title,cex.lab=1,cex.main=1) title(ylab="Standardized number of\nprescriptions dispensed", line=2, cex.lab=1) } }
The red lines are cluster centers obtained by averaging the patterns of prescription dispensings in member areas, and the gray lines represent the patterns of prescription dispensings in areas within each cluster.
opioidData
within the packageThe package includes a sample dataset from the CDC U.S.Opioid Dispensing Rate Maps (https://www.cdc.gov/drugoverdose/rxrate-maps/index.html). The data is publicly available and includes retail opioid prescription dispensing data at the state and county level from 2006 –2020. The data are obtained as rates of opioid dispensing per 100 persons. Further details about the data source are available from the CDC data resource page.
data(opioidData) opioid_data_noNA <- opioidData[complete.cases(opioidData), ] opioid_data <- as.matrix(opioid_data_noNA[,4:18])
opioid_change_results <- testchange(data=opioid_data,perm=TRUE,nperm=100,time=seq(1,15,1)) opioid_change <- opioid_data[opioid_change_results$p.adjusted<.05, ]
opioid_clust_results <- ihclust(data=opioid_change, smooth = TRUE, cor_criteria = 0.75, max_iteration = 100, verbose = TRUE)
# center the data for plotting opioid_centered <- matrix(rep(NA,nrow(opioid_change)*15),ncol=15) for(i in 1:nrow(opioid_centered)){ obsData.centered <- as.numeric(scale(opioid_change[i,], center = TRUE, scale = TRUE)) opioid_centered[i,] <- obsData.centered rm(obsData.centered) } # merge centered data and clustering memberships memb_opioid <- opioid_clust_results$Cluster_Label clusters_opioid <- cbind(opioid_centered,memb_opioid) # sort clusters in decreasing order memb_opioid_sorted <- sort(table(memb_opioid),decreasing=T) # check number of members in each cluster memb_opioid_sorted
par(mfrow=c(3,5)) for(i in 1:length(memb_opioid_sorted)){ cluster_c <- clusters_opioid[which(clusters_opioid[,16]==names(memb_opioid_sorted)[i]),] time <- seq(2006,2020,1) if(is.null(ncol(cluster_c))==F){ plot.title <- paste("Cluster: ",names(memb_opioid_sorted)[i],"\n(",memb_opioid_sorted[i]," Counties)",sep="") plot(time, cluster_c[1,1:15],type='l',ylim=c(-5,5), xlab="Week",col='lightgrey', ylab="", cex.axis=0.7, main=plot.title,cex.lab=1,cex.main=0.9,las=2) title(ylab="Standardized Opioid\nDispensing Rate", line=2, cex.lab=0.7) for(i in 2:nrow(cluster_c)){lines(time,cluster_c[i,1:15],ylim=c(-5,5),col='lightgrey')} lines(time,colMeans(cluster_c[,1:15]),col='red',lwd=2) } else{ plot.title <- paste("Cluster: ",names(memb_opioid_sorted)[i],"\n(",memb_opioid_sorted[i]," County)",sep="") plot(time, cluster_c[1:15],type='l',ylim=c(-5,5),xlab="Week",col='black', ylab="", cex.axis=0.7, main=plot.title,cex.lab=1,cex.main=0.9,las=2) title(ylab="Standardized Opioid\nDispensing Rate", line=2, cex.lab=0.7) } }
The red lines are cluster centers obtained by averaging the patterns of prescription dispensing rates in member counties, and the gray lines represent the patterns of prescription dispensing rates in counties within each cluster.
opioid_county <- opioid_data_noNA[,3:18] opioid_change_county <- opioid_county[opioid_change_results$p.adjusted<.05, ] memb_county <- cbind(opioid_change_county,memb_opioid) # make a list clusters_list <- list() for(i in 1:length(memb_opioid_sorted)){ county <- memb_county$County[which(memb_county$memb_opioid==names(memb_opioid_sorted)[i])] clusters_list[[i]] <- county names(clusters_list)[i] <- paste0("Cluster",names(memb_opioid_sorted)[i]) } clusters_list
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.