In davidbaniadam/rispacs: Functions for Working with RISPACS Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

# options(tibble.print_min = 5, tibble.print_max = 5)

rispacs package is under development

Most of the functions in this package are general function and can be used with any data set and use data masking (basically one can refer to column names without using quotation marks) but the defaults are the names from RIS-PACS data.

There are some rules to the prifixes used: The prefixes dog_ and cat_, are the general functions that can be used with any dataset. Functions that start with dog_ are for preproccesing a data frame and data manipulation, e.g. creating new columns, grouped counts etc. and functions that start with cat_ are plotting functions. A few functions are desinged only for RIS-PACS data and they have the prefix ris_.

Getting started

Since this package is higly dependent on the tidyverse packages, I recomment installing the following packages from CRAN first:

install.packages("tidyverse")
install.packages("lubridate")
install.packages("devtools")

When you have these packages then install this package with the following code:

devtools::install_github("davidbaniadam/rispacs")

Motivation

The purpose of this README (originaly a vignette) is to summarize how one should work with RIS-PACS data and restrict the analysis to different areas within Radiology or Clinical Physiology and Nuclear Medicine. We will also see some of the functions in this package.

Joining RIS and PACS

RIS data is in long format i.e. one Accession number ("the bar code of a scanning") can have multiple rows, and each row contains only one billing code. RIS can be compared to "procedurertabel" in DRG or LPR. This form is optimal if one wants to count billing codes. Bellow is an example:

library(rispacs)
library(dplyr)
ris_test

We se that every accession_number can have multiple rows, because every billing_code has one row. With real data we might have NA in e.g. Procedure_priority. We should see if it is a real missing value or caused by the structure of the data.

The structure of PACS data (for the above RIS example) is as follows:

pacs_test

PACS registration are from the machines and not manually registered. This is why the timestamps from PACS are more accurate than the timestamps in RIS. We can join RIS with PACS by accession_number. The most apporpriate join is a LEFT JOIN with RIS being the LEFTtable.

As with any join, care most be taken:

The variable which we join by, must be a primary key in at least one of the tabels.
Although possible, keeping observations where accession number is NULL in RIS does not make sense.

So in order to to join RIS and PACS we need to:

Remove Accession numbers in RIS that are NA.
Make sure that the Accession numbers are unique in PACS. If not, we can choose to keep the Accession number with the most recent timestamp.

The PACS data set should only contain one row per Accession number, although this is not always the case and a few Accesssion numbers migth have two rows in PACS (very rare). In this scenario we only keep the row with the newest timestamp, since keeping both will lead to dublicates when we join by Accession number. We could examine why we have a case like this, but for now we are forced to only keep one row and I choose the row with the latest Study_date_time_PACS.

One can either choose to subset the data for the analysis with RIS and then join with PACS or join RIS with PACS and then subset the data.

What can we find in RIS-PACS?

RIS-PACS data contains 5 areas of radiology:

CT (UXC)
MR (UXM)
Ultrasound (UXU)
X-ray (UXR)
Angiographies although I have not seen any yet (UXA).

And also Clinical Physiology and Nuclear Medicine.

Bellow is a list of the importatant columns that we know exist in RIS and PACS: RIS:

patient_status_1
procedure_name
billing_code
billing_code_modifier
accession_number
procedure_priority
latest_ris_status
preliminary_report_date_time
first_final_report_date_time
order_business_key
medical_record_number

PACS:

Accession_number
Study_date_time_PACS
Actual_Room_number
Source
Modality_type

Subsetting the data for our analysis

We subset the data with RIS data and there is not difference if we have joined with PACS or not.

The best way to subset the data to the area we want to examine, is to use billing_code. E.g. if we want to look at CT we can use a filter on all Billing codes starting with UXC. The next is billing_code_modifier which includes information that contains supplementary information for the billing_code, such as contrast, left side or right side etc. For CT and MR we are only interested in the rows where billing_code_modifier starts with UXZ.

With ultrasound and X-ray things are more complicated since we might have billing codes for a accession number that start with "K". The easiest way to subset the data in this case is to use a subquery in SQL.

Bellow we will create a cookbooks on how to subset for different areas.

Ultrasound (UXU)

An ultrasound examinatin has other billing_codes than UXU, e.g. billing_codes that start with K. So we need to subset for ultrasound like this:

```{sql, eval= FALSE} SELECT * FROM RIS WHERE accession_number IN (SELECT accession_number FROM RIS WHERE billing_code LIKE 'UXU%')

So this query keeps all the rows where the accession number has had a billing_code that started with UXU.

### CT
To make RIS data ready for analysis for CT we must subset RIS as follows:

```{sql, eval=FALSE}

SELECT *
FROM RIS
WHERE billing_code LIKE 'UXC%' AND billing_code_modifier LIKE 'UXZ%'

Basically we exclude codes TUL from billing_code_modifier as they will create dublicates for billing_code.

MR

MR is as CT: ```{sql, eval=FALSE}

SELECT * FROM RIS WHERE billing_code LIKE 'UXM%' AND billing_code_modifier LIKE 'UXZ%'

## Preparing RIS-PACS data with the rispacs package 
The main steps in preparing RIS-PACS data are the following:

* After the left join we create the definition of "a scan" or "a visit" from Accession numbers. 
* Spread billing_code and billing_code modifier and join them (billing_code_modifier is not included here). 
* Make the appropriate joins so we have one row for every scan or visit. 

Include: Harcode missing values with informative text.  

## Joining RIS with PACS: Example


```r
pacs_clean <- pacs_test %>% 
  arrange(Accession_number, Study_date_time_PACS) %>% 
  distinct(Accession_number, .keep_all = TRUE) # Nothing happens here because we don't have duplicates. 

pacs_clean

We are now ready to join the tables:

rispacs_left <- left_join(ris_test, pacs_clean, by = "Accession_number") %>% 
  arrange(Accession_number, Procedure_priority)

rispacs_left

New definition of a scan from Accession numbers

We can create a new definition af a "scan" based on Accession number. If the same person has had a new Accession number within a specified time period (e.g. 30 minuttes) we can combine those Accession numbers. As an example person X gets a Accession number at 11:05 and a new Accession number at 11:25. Since the two Accession numbers are within 30 minuttes from eachother we can say they comprise of one scan, which I call as Accession number2. To give an analogy with a supermarket: An Accession number is an item with a barcode and "a scan" is when we are inline; usually we only buy one item but we can buy more that one item.

For this we use the function dog_combine_accession_numbers() on rispacs_left

 # Create new def for a scan data is still in long format.  
a <- dog_combine_accession_numbers(rispacs_left, CPR = Patient_RK) 
a

# Spread the column whihc causes data to be in long format. 
b <- dog_spread_SKS(a, Billing_code = Billing_code, Accession_number = Accession_number2, col_name = SKS)

b

# Join a and b to transfer rest on columns to the b data frame. I.e. this should contain all the 
# original rows but the data is now one row for every accession number.  
one_row <- left_join(a, b, by = "Accession_number2") %>% 
  select(Accession_number2, SKS, Procedure_priority, Patient_RK, 
         Study_date_time_PACS, Source) %>% 
  distinct(Accession_number2, .keep_all = TRUE)  %>% 
  mutate(Procedure_priority = if_else(is.na(Procedure_priority), "Priority missing", Procedure_priority),
         Source = if_else(is.na(Source), "Scanner ID missing", Source))
one_row

rm(a, b)