knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 7, fig.align = "center" )
As part of the dystonia
dataset, there are measurements of 37 different biomarkers in 40 different people, 10 of whom have cervical dystonia, and 10 of whom have a combination of cervical dystonia and hypothyroidism. These data were provided by Larua Scorr, MD, at EMory.
library(JLmisc) library(tidyverse) library(cowplot) dystonia
The dataset is arranged in "long" format, with a column for:
We will get into grouping in more detail later. Note that there are typically two measurements of each biomarker for each participant. This duplication is performed as standard protocol at the Emory University core facility that analyzed the data, and ideally the two observations should be averaged.
The eventual point of the exercise is to identify biomarkers that are abnormally elevated or decreased in people with CD. However, it may be useful to inspect plots of each biomarker for outliers prior to formal analysis.
To do that, we can use ggplot
. In the call data = dystonia
, you are telling ggplot to use the dystonia
dataset as the variable it internally calls "data
." (It is smart enough to figure this out, so after a few times you will probably want to omit data =
.)
The other really important thing is the call mapping = aes(x = `Biomarker Level`)
. That tells ggplot to take each point in the dataset, and map the Biomarker Level
variable to the x, or horizontal, dimension. (The backticks let ggplot know that "Biomarker Level" is one single variable name.)
The other calls aren't that important to consider right now.
ggplot( data = dystonia, mapping = aes(x = `Biomarker Level`) ) + geom_histogram() + theme_minimal() + theme( panel.background = element_blank(), axis.text = element_blank() ) + labs( x = "Biomarker Level", y = "Count" ) + facet_wrap(~Assay, ncol = 4, scales = "free")
This is a lot of biomarkers, but a few things do look a little suspect. There are formal ways to assess fit to univariate distributions, but really the answer here is to collect more data to know whether these outlier points really represent outliers, or represent increased dispersion due to the presence of disease. (We could calculate and sort on maximum dispersion from median values, or something like that.)
Have a look at Eotaxin-3:
tempPlot = function(data = dystonia %>% filter(Assay %in% "Eotaxin-3")) ggplot( data = data, mapping = aes(x = `Biomarker Level`)) + geom_histogram() + theme_minimal_hgrid() + theme(panel.background = element_blank()) + labs(x = "Level", y = "Count") + facet_wrap(~Assay, ncol = 4, scales = "free") + geom_hline(yintercept = 1:5)+ scale_y_continuous(breaks = c(0, 1, 2, 3, 4, 5, 10, 15, 20, 30, 40)) tempPlot() tempPlot(dystonia %>% filter(Assay %in% "ICAM-1")) tempPlot(dystonia %>% filter(Assay %in% "MCP-1"))
In each of these histograms, there appear to be one or two single measurements fairly far outside of the range of the remainder of the data. Do these datapoints come from the same patient/subject? If so, it might make sense to average them.
To do that, let's make a point plot that codes each participant as a color. Let's standardize recordings of each biomarker to z-scores beforehand, as well.
To do the standardization, we can use the standardize
function from the psycho
package. We use the pipe operator %>%
to chain operations.
There are also several extra tables created here to enable ordering and sorting the biomarkers.
We will visualize this two ways:
1. With color designating the presence of CD. In this visualization, we can check whether the outliers tend to be CD.
1. With color designating each individual subject. In this visualization, we can check whether pairs of outlier values tend to be from the same subject, or whether they might represent anomalies on a given plate.
# create a sample number variable; this will make visualization easier. sampleNumberTable = dystonia %>% filter(Assay == "bFGF") %>% select(Group, Sample) %>% distinct() %>% arrange(Group) %>% group_by(Group) %>% mutate(SampleNumber = row_number()) %>% ungroup() %>% mutate(SampleNumber = ifelse(Group == "Cervical Dystonia (CD)", SampleNumber + 100, ifelse(Group == "CD/Hypothyroid", SampleNumber + 200, SampleNumber))) # standardize each biomarker dystoniaZ = dystonia %>% group_by(Assay) %>% psycho::standardize() %>% ungroup() %>% left_join(sampleNumberTable %>% select(Sample, SampleNumber), by = "Sample") # calculate maximum observed value of each biomarker dystoniaZMax = dystoniaZ %>% group_by(Assay) %>% summarize(maxVal = max(`Biomarker Level`, na.rm = TRUE)) %>% ungroup() %>% arrange(desc(maxVal)) # reorder the assay variable dystoniaZOrdered = dystoniaZ %>% mutate(Assay = ordered(Assay, levels = dystoniaZMax$Assay)) %>% mutate(SampleCode = numform::f_pad_zero(SampleNumber, width = 3)) # plot, arrange in descending order of maximum value ggplot( dystoniaZOrdered, aes(x = Assay, y = `Biomarker Level`) ) + geom_jitter( aes(shape = Group, color = CD), width = 0.05 ) + theme_minimal_hgrid() + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())
Note that the markers at the top, by color (a two-level representation), tend to be mostly, but not entirely the CD group. If this were not the case there might be a concern raised about sampling of some kind. By shape, a three-level representation, we see similar information. So overall there is not a lot of raw data suggesting that the extreme values are only from one group. The extreme values look like they come in pairs, which we would hope to be the case given the duplication procedures in use in the laboratory. To confirm this, we can code each subject by color:
ggplot( dystoniaZOrdered, aes(x = Assay, y = `Biomarker Level`) ) + geom_jitter( aes(shape = Group, color = SampleCode), width = 0.05 ) + theme_minimal_hgrid() + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())
Next, let's average the data of each biomarker of each subject and add the averages to a copy of the previous plot.
In this way we can make sure that the averaging process is not pulling in outlier data.
Additionally, we can call the viridis
color package, which looks nicer and is easier to see.
# calculate averages for each participant dystoniaZAverages = dystoniaZOrdered %>% group_by(Assay, SampleCode) %>% summarize(`Biomarker Level` = mean(`Biomarker Level`, na.rm = TRUE)) %>% ungroup() %>% left_join(dystoniaZOrdered %>% select(Group, SampleCode)) ggplot( dystoniaZOrdered, aes(x = Assay, y = `Biomarker Level`) ) + geom_point( aes(color = Group)) + theme_minimal_hgrid() + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank()) + viridis::scale_color_viridis(discrete = TRUE, option = "D") + geom_point(data = dystoniaZAverages, shape = 3, aes(color = Group) )
Overall it doesn't look like there are any outlier artifacts polluting these averages, although there are some extreme values. Zoom in to make sure.
# calculate averages for each participant dystoniaZAverages = dystoniaZOrdered %>% group_by(Assay, SampleCode) %>% summarize(`Biomarker Level` = mean(`Biomarker Level`, na.rm = TRUE)) %>% ungroup() %>% left_join(dystoniaZOrdered %>% select(Group, SampleCode)) p = ggplot( dystoniaZOrdered, aes(x = Assay, y = `Biomarker Level`) ) + geom_point( aes(color = Group)) + theme_minimal_hgrid() + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank()) + viridis::scale_color_viridis(discrete = TRUE, option = "D") + geom_point(data = dystoniaZAverages, shape = 3, aes(color = Group) ) p + scale_y_continuous(limits = c(2, 7))
Overall there are a few biomarkers with some wide spread, but the means appear representative. A study with appropriate sample sizes (>30/group) should be able to accomodate this spread in most cases.
In this vignette we have examined the dystonia
dataset.
Overall there are some strong outlier participants, in terms of them having extreme values of individual biomarkers (>4-6 SD from mean values of the overall sample of healthy and dystonia / dystonia/hypothyroid patients), but there appear to be no individual plate measurements.
This suggests that these data should be retained in preliminary analyses until a larger number of people can be enrolled.
In the next vignette, biomarkers-of-dystonia, we'll do some analysis on these data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.