In npeters1322/hospEpi: Perform Basic Hospital Epidemiology Tasks

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction/Background

The main goal of hospEpi is to provide useful functions for epidemiological analyses in a hospital setting. While many think of disease as coming from the outside to the inside of a hospital, some diseases (hospital-acquired infections) actually occur during patient stays, making it important for there to be teams who are ready to analyze and stop the spread before more are infected.

The motivation behind this package came during my time as an intern for the Indiana University Health Infection Prevention team, which is made up of healthcare workers who sort of play the roles of hospital epidemiologists (among many other things). My supervisor, Josh Sadowski, is one of the data analysts on the team, so many of these functions can aid him in the future (and hopefully many others, too!). Some examples of what this package does are that it allows people to easily study patient location history/network data and also disease-exposure/disease-risk factor data.

As of now, there are multiple other epidemiology R packages, like epiR and epitools. While those are very useful for epidemiologists, this package works a little differently, as it focuses more on applying tools for analyses in a hospital setting. The functions could certainly be adapted for use outside of a hospital, but the functions might be most useful in a hospital. It also addresses a few challenges that make it stand out as its own project. Those are:

It provides multiple plot functions to quickly visualize some of your data. In other packages for epidemiology, plotting seems to mostly be up to the user to do. However, this package provides multiple plotting functions. For example, you can visualize the network of patient location history data or see how many of those with a certain exposure/risk factor developed a disease. This functionality sets this package apart from others made for epidemiology.
It provides multiple cleaning functions, which can help you get your data in the correct format before using the rest of the functions. Oftentimes, packages are made to work with data that is already in a specific format. While coming in with clean data is the best case scenario, some cleaning usually needs to be done, and the cleaning functions in this package could help depending on what the data is like. Providing cleaning functions sets this package apart from others, as the user might not have to clean their data perfectly before using the rest of the package.
It can easily assist in analyzing multiple exposures/risk factors against a disease, and can quickly report multiple statistics for all of those combinations. In a hospital, you might often look at multiple exposures/risk factors against a disease, like having diabetes, having heart disease, or having surgery during a stay against developing a hospital-acquired disease. Rather than only allowing someone to look at one disease-exposure combination at a time, this package has built-in functions that let one study all combinations at once and get many statistics all at once. As a result, comparing all your exposures against each other to see which seem the most associated with the disease/outcome is easy and makes this package stand out.

As seen, while there are other packages for epidemiological analyses, this package adds other functionality that might be helpful. While initially created for working with hospital data, it could easily be used for other epidemiological analyses. It is not an all-encompassing package, so other epidemiology packages should be checked out, but this package does provide a couple useful tools that are not provided elsewhere.

How to Use hospEpi

After installing the package, you can load it like this:

library(hospEpi)

As of now, there are two main use cases for this package: analyzing patient location/hospital network data and analyzing disease-exposure/disease-risk factor data. Below is a walk-through of the process of doing each.

Working With Patient Location/Hospital Network Data

A good way to start analyzing hospital network data using this package is to look at the example dataset that comes with the package and to go through the examples using it to see if you need to manipulate your data before using the functions. The example data can be loaded in like so:

hn_data <- hosp_network_data
head(hn_data)

To use many of the functions, the data needs to have columns corresponding to the patient's starting room and next room or starting unit and next unit. Looking at the data above, we do not have those columns yet, so let's use the cleaning function to get everything in the right format.

cleaned_hn_data <- clean_hosp_network(data = hn_data, uniqueID = UniqueEncountID, startDate = BeginDate, 
                                      endDate = EndDate, unitName = UnitName, 
                                      roomNum = RoomNumber)

head(cleaned_hn_data[,4:9])

Now that we have clean data, we can move on to creating an object of class hosp_network, which will allow the plotting and summary functions to be used. To create the object, you will use the below function.

hn_object <- hosp_network(x = cleaned_hn_data, fromUnit = UnitName, toUnit = 
                            next_unit, fromRoom = RoomNumber, toRoom = next_room)
class(hn_object)

If you do not have one of the two types of data, room or unit, then you can exclude it like so, as to use the rest of the functions, you just need to make sure to have at least one of the two:

hn_object2 <- hosp_network(x = cleaned_hn_data, fromUnit = UnitName, toUnit = next_unit)

Now, you can move on to plotting the data. This package creates network graphs using the igraph package, which, for example, would allow you to see the network of your patients and how they have moved throughout the hospital. Two examples are:

plot(hn_object, by = "room", type = "simple")
plot(hn_object, by = "unit", type = "hub score", vertex.color = "red", vertex.shape = "square")

They might be hard to read right now, but that is only because they are plotted next to each other, so they do not have as much space to take up. In the example on the left, the data is plotted by the room data and all points are the same size. In the example on the right, the specifications were changed to using unit data and making the size of the points be based off the hub score (statistic related to graphs, search if wanting to know more). More arguments were also added, which are arguments that can be passed to plot.igraph.

Plotting this type of data could be helpful if you want to visualize how patients have moved throughout the hospital. Perhaps patients have recently been contracting a specific disease in the hospital, and you want to find if there are any common rooms or units among them. If that is the case, these plots can help to visualize that.

You can also summarize your data.

hn_summ <- summary(hn_object, by = "room")
hn_summ[5:6]

This provides many different statistics related to network graphs, which might be helpful if you are interested in knowing more about your network. You can also change it to work with your unit data, and you can also add more arguments that can be passed to hub_score and authority_score.

hn_summ2 <- summary(hn_object, by = "unit", scale = FALSE)
hn_summ2[5:6]

While network statistics might not be useful to everyone, they can provide numerical information about your network.

Overall, this half of the package provides useful functions for working with patient location history/patient network data.

Working With Disease-Exposure/Disease-Risk Factor Data

To begin analyzing disease-exposure/disease-risk factor data, it might be helpful to look at the example dataset that comes with the package and go through the below examples with it. The dataset can be loaded by doing the following:

de_data <- disease_expose_data
head(de_data)

To use most of the functions, your data must be made of binary columns (0s and 1s). As seen above, the data does not follow that right now, which is why a cleaner function is provided to help, which might also be helpful for you.

cleaned_de_data <- clean_disease_expose(data = de_data, disease = "disease", 
                                        noDisease = "No", 
                                        exposures = c("exposure1", "exposure2", "exposure3"))
head(cleaned_de_data)

As seen above, all columns are now binary variables, and we are ready to move on to the rest of the functions. Note that although all columns are binary, you might not want every column created by the cleaner function. For example, you might only want one of the two exposure2_... columns, as they are just opposites of each other. For now, we will just keep them all, but you can subset your data on your own in the future.

The next step is creating an object of class disease_expose, which will allow you to plot and summarize your data very easily. The best way to do this is by calling the helper function to create the object.

de_object <- disease_expose(cleaned_de_data)

This pulls up a Shiny gadget and allows you to select your disease column and any exposure/risk factor columns you want to include from your data. After providing your input, an object of class disease_expose will be created, barring any errors (most likely that there are columns that are not binary). If Shiny gadgets are not your thing and you would rather manually type in everything you want, you can do that with the constructor below.

de_object <- new_disease_expose(cleaned_de_data, disease = 1, exposures = 2:8)
class(de_object)

Now that you have an object of class disease_expose, you can move on to plotting and summarizing the data, simply by calling plot and summary and including the object. After the object is created, it is best to not edit it in any way; it should already be in a good format, so no changes are necessary.

Here is how to plot the data:

#subset used so viewing is better on document
plot(de_object[1:5])

As you can see, it produces bar charts for each disease-exposure combination in your data. If you want a little more customization, you can also add arguments that can be used in geom_bar. One that might look good is position = 'dodge'. Making these plots allows you to visualize how the diseased and non-diseased individuals are distributed across exposure status.

#subset used so viewing is better on document
plot(de_object[1:5], position = 'dodge')

In addition to plotting the data, you can also summarize the data like so:

summary(de_object)

As seen above, you now have many different statistics available for each disease-exposure combination in your data, like odds ratios, incidence in the exposed and unexposed, and confidence intervals for various statistics. Having these statistics available so quickly would be beneficial when seeing if there are associations between any exposures and the disease.

While the disease_expose functions are not overly sophisticated, they do provide some simple and quick tools to analyze disease-exposure data that you might have.

As seen above, all the functions have multiple use cases, as they can help to visualize and summarize patient location history/networks and also visualize and summarize disease-exposure/disease-risk factor data, which could both be useful in a hospital setting.

Future Work and Plans

Multiple ideas to improve the package in the future are already being drawn up. Those are:

To create printing methods for the summary methods of the classes. Right now, when you print the output of the summary methods, they are not the most pleasing to the eye, as they are just a data.frame and a list. Therefore, the plan is to create a printing method which makes viewing the results quick and easy to understand, like when viewing the summary output of an lm object. The plan is to start with the disease_expose class, as it should be easier to customize output from a data.frame. It will most likely resemble the data.frame already created, but one idea is to highlight the rows with the most significant results (like the highest odds ratio or another statistic). Also, for the hosp_network class, the summary output is quite long, so the plan is to just select small details that might be the most important when wanting a small overview of the summary. By creating printing methods for both of the summary methods in the package so far, the summary output will look cleaner and more understandable when printed, which will be an upgrade.
To write functions to help with survival/time-to-event analysis. These types of analyses could be helpful in a hospital setting, as you might want to study how long a certain hospital-acquired infection takes to infect patients, for example. To do this, some research into survival/time-to-event analysis will be necessary to gain more knowledge of the subject to make functions better. Additionally, collaborating with others will be needed because it will allow for a better understanding of what to include in functions related to this topic. Most likely, similar things compared to the other classes will be done, as plotting and summary functions will be created.
To provide a more useful plotting function for the disease_expose class. Right now, the plotting function plots simple bar charts that show the number of diseased and non-diseased individuals by exposure status. It is a very quick and easy way to visualize disease-exposure data, but it might not be the most useful. Later on, the plan is to research common plots that individuals make when studying this type of data, as it might be more useful to use other types of plots. Researching ideas will be the best option to learn more from people who are experts in the field. If nothing is found while researching, the function will be kept as is, but if ideas are found, the plan will be to implement them into a plotting function, while also keeping the simple bar chart plotting function.