R interface to clinicaltrials.gov

Clinicaltrials.gov ClinicalTrials.gov is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world. Users can search for information about and results from those trials. This package provides a set of functions to interact with the search and download features. Results are downloaded to temporary directories and returned as R objects.

Installation

The package is available on CRAN and can be installed as usual. To install the latest version from github, use devtools::install_github(), as follows:

install.packages("devtools")
library(devtools)
install_github("sachsmc/rclinicaltrials")

Basic usage

The main function is clinicaltrials_search(). Here's an example of its use:

library(rclinicaltrials)
library(ggplot2)
library(dplyr)

z <- clinicaltrials_search(query = 'lime+disease')
str(z)

This gives you basic information about the trials. Before searching or downloading, you can determine how many results will be returned using the clinicaltrials_count() function:

clinicaltrials_count(query = "myeloma")
clinicaltrials_count(query = "29485tksrw@")

The query can be a single string which will be passed to the "search terms" field on clinicaltrials.gov. Terms can be combined using the logical operators AND, OR, and NOT. Advanced searches can be performed by passing a vector of key=value pairs as strings. For example, to search for cancer interventional studies,

clinicaltrials_count(query = c("type=Intr", "cond=cancer"))

The possible advance search terms are included in the advanced_search_terms data frame which comes with the package. The data frame has the keys, description, and a link to the help webpage which will explain the possible values of the search terms. To open the help page for cond, for instance, run browseURL(advanced_search_terms["cond", "help"]).

head(advanced_search_terms)

To download detailed study information, including results, use clinicaltrials_download():

y <- clinicaltrials_download(query = 'myeloma', count = 10, include_results = TRUE)
str(y)

This returns a list of dataframes that have a common key variable: nct_id. Optionally, you can get the long text fields and/or study results (if available). Study results are also returned as a list of dataframes, contained within the list.

How to use the results

The data come from a relational database with lots of text fields, so it may take some effort to get the data into a flat format for analysis. For that reason, results come back from the clinicaltrials_download function as a list of dataframes. Each dataframe has a common key variable: nct_id. To merge dataframes, use this key. Otherwise, you can analyze the dataframes separately. They are organized into study information, locations, outcomes, interventions, results, and textblocks. Results, where available, is itself a list with three dataframes: participant flow, baseline data, and outcome data.

Results tables are stored in long format, so there are often multiple rows per study, each corresponding to a different group or outcome. Let's look at an example, the cumulative enrollment of men and women in phase III, melanoma, interventional studies over time. We can also pass the query as a list of named items.

melanom <- clinicaltrials_search(query = c("cond=melanoma", "phase=2", 
                                           "type=Intr", "rslt=With"), 
                                 count = 1e6)
nrow(melanom)
table(melanom$status.text)

melanom2 <- clinicaltrials_search(query = list(cond = "melanoma", phase = "2", 
                                           type = "Intr", rslt = "With"), 
                                 count = 1e6)
nrow(melanom)

Now to download the data and summarize it:

melanom_information <- clinicaltrials_download(query = c("cond=melanoma", "phase=2", 
                                                         "type=Intr", "rslt=With"), 
                                               count = 1e6, include_results = TRUE)
load("melanom_info.RData")
summary(melanom_information$study_results$baseline_data)

gend_data <- subset(melanom_information$study_results$baseline_data, 
                    title == "Gender" & arm != "Total")

gender_counts <- gend_data %>% group_by(nct_id, subtitle) %>% 
  do( data.frame(
    count = sum(as.numeric(paste(.$value)), na.rm = TRUE)
    ))

dates <- melanom_information$study_information$study_info[, c("nct_id", "start_date")]
dates$year <- sapply(strsplit(paste(dates$start_date), " "), function(d) as.numeric(d[2]))

counts <- merge(gender_counts, dates, by = "nct_id")

cts <- counts %>% group_by(year, subtitle) %>%
  summarize(count = sum(count))
colnames(cts)[2] <- "Gender"

ggplot(cts, aes(x = year, y = cumsum(count), color = Gender)) + 
  geom_line() + geom_point() + 
  labs(title = "Cumulative enrollment into Phase III, \n interventional trials in Melanoma, by gender") + 
  scale_y_continuous("Cumulative Enrollment") 


sachsmc/rclinicaltrials documentation built on May 29, 2019, 12:55 p.m.