In jlucasmckay/JLmisc: A Package of Miscellaneous R Functions

knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 7, fig.align = "center" )

Load dystonia dataset

As part of the dystonia dataset, there are measurements of 37 different biomarkers in 40 different people, 10 of whom have cervical dystonia, and 10 of whom have a combination of cervical dystonia and hypothyroidism. These data were provided by Larua Scorr, MD, at EMory.

library(JLmisc)
library(tidyverse)
library(cowplot)

dystonia

The dataset is arranged in "long" format, with a column for:

CD, a two-level character variable coding for the presence of Cervical Dystonia (CD)
Group, a three-level character variable coding for Control vs. CD vs. CD and hypothyroid
Assay, a 37-level character variable coding for the biomarker
Sample, a 40-level character variable coding for the participant
Biomarker Level, the measured biomarker level.

We will get into grouping in more detail later. Note that there are typically two measurements of each biomarker for each participant. This duplication is performed as standard protocol at the Emory University core facility that analyzed the data, and ideally the two observations should be averaged.

Inspect biomarker distributions

The eventual point of the exercise is to identify biomarkers that are abnormally elevated or decreased in people with CD. However, it may be useful to inspect plots of each biomarker for outliers prior to formal analysis.

To do that, we can use ggplot. In the call data = dystonia, you are telling ggplot to use the dystonia dataset as the variable it internally calls "data." (It is smart enough to figure this out, so after a few times you will probably want to omit data =.)

The other really important thing is the call mapping = aes(x = `Biomarker Level`). That tells ggplot to take each point in the dataset, and map the Biomarker Level variable to the x, or horizontal, dimension. (The backticks let ggplot know that "Biomarker Level" is one single variable name.) The other calls aren't that important to consider right now.

ggplot(
  data = dystonia,
  mapping = aes(x = `Biomarker Level`)
  ) +
  geom_histogram() +
  theme_minimal() +
  theme(
    panel.background = element_blank(),
    axis.text = element_blank()
  ) + 
  labs(
    x = "Biomarker Level",
    y = "Count"
  ) + 
  facet_wrap(~Assay, ncol = 4, scales = "free")

Inspect a few biomarkers with outliers

This is a lot of biomarkers, but a few things do look a little suspect. There are formal ways to assess fit to univariate distributions, but really the answer here is to collect more data to know whether these outlier points really represent outliers, or represent increased dispersion due to the presence of disease. (We could calculate and sort on maximum dispersion from median values, or something like that.)

Have a look at Eotaxin-3:

tempPlot = function(data = dystonia %>% filter(Assay %in% "Eotaxin-3")) ggplot( data = data, mapping = aes(x = `Biomarker Level`)) + geom_histogram() + theme_minimal_hgrid() + theme(panel.background = element_blank()) + labs(x = "Level", y = "Count") + facet_wrap(~Assay, ncol = 4, scales = "free") + geom_hline(yintercept = 1:5)+ scale_y_continuous(breaks = c(0, 1, 2, 3, 4, 5, 10, 15, 20, 30, 40))

tempPlot()
tempPlot(dystonia %>% filter(Assay %in% "ICAM-1"))
tempPlot(dystonia %>% filter(Assay %in% "MCP-1"))

In each of these histograms, there appear to be one or two single measurements fairly far outside of the range of the remainder of the data. Do these datapoints come from the same patient/subject? If so, it might make sense to average them.

Inspect patterns of outlier occurrence across biomarkers

To do that, let's make a point plot that codes each participant as a color. Let's standardize recordings of each biomarker to z-scores beforehand, as well. To do the standardization, we can use the standardize function from the psycho package. We use the pipe operator %>% to chain operations. There are also several extra tables created here to enable ordering and sorting the biomarkers. We will visualize this two ways: 1. With color designating the presence of CD. In this visualization, we can check whether the outliers tend to be CD. 1. With color designating each individual subject. In this visualization, we can check whether pairs of outlier values tend to be from the same subject, or whether they might represent anomalies on a given plate.

# create a sample number variable; this will make visualization easier.
sampleNumberTable = dystonia %>% filter(Assay == "bFGF") %>% select(Group, Sample) %>% distinct() %>% arrange(Group) %>% group_by(Group) %>% mutate(SampleNumber = row_number()) %>% ungroup() %>%
  mutate(SampleNumber = ifelse(Group == "Cervical Dystonia (CD)", SampleNumber + 100, ifelse(Group == "CD/Hypothyroid", SampleNumber + 200, SampleNumber)))

# standardize each biomarker
dystoniaZ = dystonia %>% group_by(Assay) %>% psycho::standardize() %>% ungroup() %>% left_join(sampleNumberTable %>% select(Sample, SampleNumber), by = "Sample")

# calculate maximum observed value of each biomarker
dystoniaZMax = dystoniaZ %>% group_by(Assay) %>% summarize(maxVal = max(`Biomarker Level`, na.rm = TRUE)) %>% ungroup() %>% arrange(desc(maxVal))

# reorder the assay variable
dystoniaZOrdered = dystoniaZ %>% mutate(Assay = ordered(Assay, levels = dystoniaZMax$Assay)) %>% mutate(SampleCode = numform::f_pad_zero(SampleNumber, width = 3))


# plot, arrange in descending order of maximum value
ggplot(
  dystoniaZOrdered,
  aes(x = Assay, y = `Biomarker Level`)
  ) +
  geom_jitter( aes(shape = Group, color = CD), width = 0.05 ) +
                theme_minimal_hgrid() +
                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())

Note that the markers at the top, by color (a two-level representation), tend to be mostly, but not entirely the CD group. If this were not the case there might be a concern raised about sampling of some kind. By shape, a three-level representation, we see similar information. So overall there is not a lot of raw data suggesting that the extreme values are only from one group. The extreme values look like they come in pairs, which we would hope to be the case given the duplication procedures in use in the laboratory. To confirm this, we can code each subject by color:

ggplot(
  dystoniaZOrdered,
  aes(x = Assay, y = `Biomarker Level`)
  ) +
  geom_jitter( aes(shape = Group, color = SampleCode), width = 0.05 ) +
                theme_minimal_hgrid() +
                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())

Inspect averaging of outlier scores

Next, let's average the data of each biomarker of each subject and add the averages to a copy of the previous plot. In this way we can make sure that the averaging process is not pulling in outlier data. Additionally, we can call the viridis color package, which looks nicer and is easier to see.

# calculate averages for each participant
dystoniaZAverages = dystoniaZOrdered %>% group_by(Assay, SampleCode) %>% summarize(`Biomarker Level` = mean(`Biomarker Level`, na.rm = TRUE)) %>% ungroup() %>% left_join(dystoniaZOrdered %>% select(Group, SampleCode))
ggplot(
  dystoniaZOrdered,
  aes(x = Assay, y = `Biomarker Level`)
  ) +
  geom_point( aes(color = Group)) +
                theme_minimal_hgrid() +
                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank()) +
  viridis::scale_color_viridis(discrete = TRUE, option = "D") + 
  geom_point(data = dystoniaZAverages, shape = 3, aes(color = Group) )

Overall it doesn't look like there are any outlier artifacts polluting these averages, although there are some extreme values. Zoom in to make sure.

# calculate averages for each participant
dystoniaZAverages = dystoniaZOrdered %>% group_by(Assay, SampleCode) %>% summarize(`Biomarker Level` = mean(`Biomarker Level`, na.rm = TRUE)) %>% ungroup() %>% left_join(dystoniaZOrdered %>% select(Group, SampleCode))
p = ggplot(
  dystoniaZOrdered,
  aes(x = Assay, y = `Biomarker Level`)
  ) +
  geom_point( aes(color = Group)) +
                theme_minimal_hgrid() +
                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank()) +
  viridis::scale_color_viridis(discrete = TRUE, option = "D") + 
  geom_point(data = dystoniaZAverages, shape = 3, aes(color = Group) )
p + scale_y_continuous(limits = c(2, 7))

Overall there are a few biomarkers with some wide spread, but the means appear representative. A study with appropriate sample sizes (>30/group) should be able to accomodate this spread in most cases.

Conclusion

In this vignette we have examined the dystonia dataset. Overall there are some strong outlier participants, in terms of them having extreme values of individual biomarkers (>4-6 SD from mean values of the overall sample of healthy and dystonia / dystonia/hypothyroid patients), but there appear to be no individual plate measurements. This suggests that these data should be retained in preliminary analyses until a larger number of people can be enrolled.

In the next vignette, biomarkers-of-dystonia, we'll do some analysis on these data.

jlucasmckay/JLmisc documentation built on Dec. 14, 2019, 9:33 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com