In matthew-seth-smith/replicateOutliers: Outlier Detection Methods for Replicated Data

1. Introduction

In this vignette, we give a guide for using the replicateOutliers package. Please see {CITE PAPER} for more details and derivations.^1^

The setting for these statistical methods is when we have replicated (duplicate, triplcate, etc.) positive data, called "replicates," and want to determine which sets of replicates are too different from each other to be useful. We developed these methods in the context of biological data obtained at Fox Chase Cancer Center: inhibition of kinases by different ligands, proteomics, and RNA-seq. We can easily imagine other contexts. In public health, physicians and epidemiologists running a screening program should repeat measurements on each participant, and participants whose measurements are significantly different from each other should be flagged for another, conclusive measurement. If an engineer is building a measurement device, they should test that the device can measure the same values multiple times. In any application, assuming the observed variables follow certain positive distributions, we can get a sense of the replicability of the measurement technique by seeing how many sets of replicates in this "well-behaved" set get flagged as outliers.

We will reproduce the data from the paper {CITE PAPER} that we generated using generalized gamma (GG) distributions, which I included in this package as the Sim_GG data.frame.^1^ We created this data in this way because it looks like a volcano plot (turned on its side) usually seen when comparing fold-change to significance. We then look for outliers in Sim_GG using the joint exponential and marginal exponential methods and describe how we would use the remaining $3$ methods. We lastly discuss briefly how we would analyze data with multiple replicates.

2. Generating the Data

In any context, we assume we have two repeated (duplicated) measurements that are i.i.d.: $X_1$ and $X_2$. The useful variables we want for examining their difference are their absoulte difference: $$ \Delta = X_1 - X_2 $$ and their relative difference: $$ Z = \frac{\sqrt{2}|X_1-X_2|}{X_1+X_2}. $$ The variable $Z$ (uppercase zeta) is the coefficient of variation ($CV$) between the two replicates, which is equal to their mean divided by their standard deviation. Because $0 \leq |X_1-X_2| \leq X_1+X_2$ for $X_1 \geq 0, \ X_2 \geq 0,$ we know $Z \in [0,1]$.

When we look for outliers among duplicated data, we want to look at the plot of $\Delta$ and $Z$ between the two replicates. Such a plot often looks like a volcano plot seen in biology. I created the Sim_GG data to look like a sideways volcano plot by adding GG distributions with different parameters and then fiddling with the parameters until the $(\Delta,Z)$-plot looked right. I will repeat the steps here, but the plot may look different depending on the starting values.

rm(list=ls())
set.seed(1729)
# Set the seed so we can replicate the random results, using Ramanujan's number for the seed
# I removed some of the steps I used in making the original Sim_GG data.frame, so this seed will not
# exactly match the one used to make Sim_GG

# library(devtools) #For document
# document()
# library(replicateOutliers) #To get the functions for these methods
library(flexsurv) #For GG functions
library(ggplot2) #For ggplot plotting functions

n_total <- 10000 #Number of data points. Doing full simulation
df <- data.frame(
  section=c(rep(0, 0.2*n_total), rep(1, 0.39*n_total), rep(2, 0.39*n_total),
            rep(3, 0.01*n_total), rep(4, 0.01*n_total)),
  X_1=c(rexp(0.2*n_total, 1/5), rgengamma.orig(0.39*n_total, shape=1.1, scale=1.6, k=120),
        rgengamma.orig(0.39*n_total, shape=0.95, scale=2/3, k=200),
        rgengamma.orig(0.01*n_total, shape=0.898, scale=2/3, k=100),
        rgengamma.orig(0.01*n_total, shape=0.89, scale=1.6, k=125)),
  X_2=c(rexp(0.2*n_total, 1/5), rgengamma.orig(0.39*n_total, shape=0.95, scale=2/3, k=200),
        rgengamma.orig(0.39*n_total, shape=1.1, scale=1.6, k=120),
        rgengamma.orig(0.01*n_total, shape=0.89, scale=1.6, k=125),
        rgengamma.orig(0.01*n_total, shape=0.898, scale=2/3, k=100))
)
df$D <- df$X_1 - df$X_2
df$Z <- sqrt(2) * abs(df$X_1 - df$X_2) / (df$X_1 + df$X_2)

head(df)

ggplot(df, aes(x=D, y=Z, color=section)) + geom_point() +
  ggtitle("Volcano Plot Using Generalized Gamma Distribution\n(New Simulated Data)")

In this plot, we have labelled by color the "sections" of the data: $0$ corresponds to the middle band, $1$ and $2$ correspond to the "wings," and $3$ and $4$ correspond to the points with large $\Delta$ and large $Z$. We would expect our methods, if working correctly, to identify sections $3$ and $4$ as the outliers.

For comparison, here is a similar plot for the Sim_GG data.frame included in this package:

data(Sim_GG)
head(Sim_GG)
ggplot(df, aes(x=D, y=Z, color=section)) + geom_point() +
  ggtitle("Volcano Plot Using Generalized Gamma Distribution\n(Sim_GG Data Frame)")

The plots seem to be almost (or maybe even exactly) identical. Notice also that I included more information in Sim_GG besides what we made here.

3. The Function `q_exp_joint_DZ`

The first function we can use to find outliers in Sim_GG is q_exp_joint_DZ, which is my implementation of the joint exponential method. In this method, we assume $X_1$ and $X_2$ are independent exponential random variables. We fit $\Delta = X_1 - X_2$ to an asymmetric Laplace distribution using the VGAM package. If $D \sim \mathcal{AL^*}(\theta,\kappa,\sigma)$ is an asymmetric Laplace distribution, it has CDF

$$F_D(d) = \left{ \begin{array}{cl} \frac{\kappa^2}{1+\kappa^2}e^{\frac{\sqrt{2}}{\sigma\kappa}(d-\theta)} & \textrm{if }d < \theta\ 1 - \frac{1}{1+\kappa^2}e^{-\frac{\sqrt{2}\kappa}{\sigma}(d-\theta)} & \textrm{if }d \geq \theta.\ \end{array}\right.$$

The star in $\mathcal{AL^}$ comes from the choice of parameterization used by Kotz et al.^2^ If $\Delta = X_1 - X_2$, $X_1 \sim Exp(\lambda_1)$, and $X_2 \sim Exp(\lambda_2)$, then $\Delta \sim \mathcal{AL^}(0, \sqrt{\frac{\lambda_1}{\lambda_2}}, \sqrt{\frac{2}{\lambda_1\lambda_2}}).$

The vglm function in VGAM allows us to get a maximum likelihood estimate (MLE) and a $95\%$-confidence interval for $\kappa$. If the confidence interval of $\log\kappa$ includes $0$, then we can let $\kappa = 1$ and $\lambda_1 = \lambda_2 = \lambda = \frac{\sqrt{2}}{\sigma}$, after getting the MLE for $\sigma$. We call this the symmetric case. If the $95\%$-confidence interval does not contain $0$, then we let $\lambda_1 \neq \lambda_2$, called the asymmetric case.

Once we know what case we have, we use the estimates for $(\lambda_1,\lambda_2)$ or $\lambda$ calculated from the MLE of $\kappa$ and $\sigma$ as parameters for the joint distribution of $\Delta$ and $Z$. If $X_1, X_2 \sim Exp(\lambda)$ in the symmetric case, then $(\Delta,Z)$ has the joint cumulative density function (CDF)

$$ F_{\Delta Z}(\delta, \zeta) = \left{ \begin{array}{cl} \frac{1-\gamma}{2(1+\gamma)}e^{\frac{\lambda(1+\gamma)\delta}{1-\gamma}} & \textrm{if }\delta \leq 0, 0 \leq \zeta \leq \sqrt{2}\ & \ \frac{1}{2}e^{\lambda \delta} & \textrm{if }\delta \leq 0, \zeta > \sqrt{2}\ & \ \frac{1}{2}\lambda \delta e^{-\lambda \frac{\sqrt{2}\delta}{\zeta}} + (\frac{1}{2} - \frac{1}{\gamma+1})(-1 + e^{-\lambda \frac{\sqrt{2}\delta}{\zeta}} + \lambda \frac{\sqrt{2}\delta}{\zeta} e^{-\lambda \frac{\sqrt{2}\delta}{\zeta}}) + \frac{1-\gamma}{2(1+\gamma)} & \textrm{if }\delta > 0, 0 \leq \zeta \leq \sqrt{2}\ & \ 1 - \frac{1}{2}e^{-\lambda \delta} & \textrm{if }\delta > 0, \zeta > \sqrt{2}\ & \ 0 & \textrm{if }\zeta < 0,\ \end{array}\right. $$

where $\gamma = \frac{\sqrt{2}-\zeta}{\sqrt{2}+\zeta}$. If $X_1 \sim Exp(\lambda_1)$ and $X_2 \sim Exp(\lambda_2)$ with $\lambda_1 \neq \lambda_2$ in the asymmetric case, then $(\Delta,Z)$ has the joint CDF

$$ F_{\Delta Z}(\delta, \zeta) = \left{ \begin{array}{cl} \frac{\lambda_1\lambda_2(1-\gamma)e^{\frac{(\lambda_1\gamma+\lambda_2)\delta}{1-\gamma}}}{(\lambda_1\gamma+\lambda_2)(\lambda_1+\lambda_2)} & \textrm{if }\delta \leq 0, 0 \leq \zeta \leq \sqrt{2}\ & \ \frac{\lambda_1e^{\lambda_2\delta}}{\lambda_1+\lambda_2} & \textrm{if }\delta \leq 0, \zeta > \sqrt{2}\ & \ -\frac{2\lambda_1\lambda_2}{\lambda_1^2-\lambda_2^2}e^{-((1+\frac{\sqrt{2}}{\zeta})\lambda_1+(\frac{\sqrt{2}}{\zeta}-1)\lambda_2)\frac{\delta}{2}} - \frac{\lambda_1\lambda_2(1+\gamma)}{(\lambda_1+\lambda_2\gamma)(\lambda_2-\lambda_1)}e^{-\frac{(\lambda_1+\lambda_2\gamma)\sqrt{2}\delta}{\zeta(\gamma+1)}}\ +\ \frac{\lambda_1\lambda_2(1-\gamma^2)}{(\lambda_1+\lambda_2\gamma)(\lambda_1\gamma+\lambda_2)} & \textrm{if }\delta > 0, 0 \leq \zeta \leq \sqrt{2}\ & \ 1 - \frac{\lambda_2}{\lambda_1+\lambda_2}e^{-\lambda_1\delta} & \textrm{if }\delta > 0, \zeta > \sqrt{2}\ & \ 0 & \textrm{if }\zeta < 0.\ \end{array}\right. $$

The joint CDFs represent $F_{\Delta Z}(\delta,\zeta) = \mathbb{P}(\Delta \leq \delta, \ Z \leq \zeta)$. We define the outlier probability $q$ (for the joint exponential method) to be

$$ q(\delta, \zeta) = \left{ \begin{array}{rcll} \mathbb{P}(\Delta \leq \delta, \ Z \geq \zeta) & = & F_{\Delta}(\delta) - F_{\Delta Z}(\delta,\zeta) & \textrm{if }\delta \leq 0\ \mathbb{P}(\Delta \geq \delta, \ Z \geq \zeta) & = & 1 - F_{\Delta}(\delta) - F_Z(\zeta) + F_{\Delta Z}(\delta,\zeta) & \textrm{if }\delta > 0\ \end{array} \right., $$

so we also need to know the marginal CDFs $F_{\Delta}(\delta)$ and $F_Z(\zeta)$. The marginal CDFs are related to the joint CDF by $F_{\Delta}(\delta) = \lim_{\zeta\to\infty} F_{\Delta Z}(\delta,\zeta)$ and $F_Z(\zeta) = \lim_{\delta\to\infty} F_{\Delta Z}(\delta,\zeta)$.^3^ For the symmetric case, the marginal CDFs are

$$ F_{\Delta}(\delta) = \left{ \begin{array}{cl} \frac{1}{2}e^{\lambda \delta} & \textrm{if }\delta \leq 0\ 1 - \frac{1}{2}e^{-\lambda \delta} & \textrm{if }\delta > 0\ \end{array}\right. $$

and

$$ F_Z(\zeta) = \left{ \begin{array}{cl} 0 & \textrm{if }\zeta < 0\ \frac{1-\gamma}{1+\gamma} = \frac{\zeta}{\sqrt{2}} & \textrm{if }0 \leq \zeta \leq \sqrt{2}\ 1 & \textrm{if }\zeta > \sqrt{2}.\ \end{array}\right. $$

In the asymmetric case, the marginal CDFs are

$$ F_{\Delta}(\delta) = \left{ \begin{array}{cl} \frac{\lambda_1}{\lambda_1+\lambda_2}e^{\lambda_2\delta} & \textrm{if }\delta \leq 0\ 1 - \frac{\lambda_2}{\lambda_1+\lambda_2}e^{-\lambda_1\delta} & \textrm{if }\delta > 0,\ \end{array}\right. $$

and

$$ F_Z(\zeta) = \left{ \begin{array}{cl} 0 & \textrm{if }\zeta < 0\ \frac{\lambda_1\lambda_2(1-\gamma^2)}{(\lambda_1+\lambda_2\gamma)(\lambda_1\gamma+\lambda_2)} & \textrm{if }0 \leq \zeta \leq \sqrt{2}\ 1 & \textrm{if }\zeta > \sqrt{2}.\ \end{array}\right. $$

Notice that $\Delta$ is marginally distributed as a symmetric (asymmetric) Laplace random variable in the symmetric (asymmetric) case, as we would expect. We now know that $Z$ is marginally uniformly distributed over $[0,\sqrt{2}]$ in the symmetric case, and that its marginal distribution in the asymmetric case appears to be a new beast.

The function q_exp_joint_DZ does all the steps above to calculate $q(\delta,\zeta)$ for each point of $(\Delta,Z)$-data: fitting the absolute difference $\Delta$ to an asymmetric Laplace distribution, determining whether we are in the symmetric or asymmetric case, calculating the MLE for $\lambda$ or $(\lambda_1,\lambda_2)$, and plugging these MLE into the joint and marginal CDFs to get $q(\delta,\zeta)$ for every point. It also checks to see if there is an offset to $\Delta$ by some parameter $\theta \in \mathbb{R}$, again using the $95\%$-confidence interval to test whether $\theta \neq 0$ is significant. There are parameters for q_exp_joint_DZ to change the $p$-values for these tests, but we can leave them as the default values of $0.05$.

Let's apply q_exp_joint_DZ to our simulated data and color the plot by the $q$-value. The color scheme comes from the package viridis and is supposed to be easier for people who are colorblind to decipher.

library(viridis) #For scale_colour_viridis in ggplot
df$q_exp_joint <- q_exp_joint_DZ(df$D, df$Z)

ggplot(df, aes(x=D, y=Z, color=q_exp_joint)) + scale_colour_viridis() +
  geom_point() + labs(color="q-Value") + ggtitle("Joint Exponential Method")

To find the outliers, we define a cutoff value $q^$ for probability and consider any points with $q<q^$ to be outliers. We do so because these points have such low probability of occurring given our assumption of independently distributed exponential data that something must be wrong with them.

We do not know how many outliers we should be looking for, but it is useful to try different cutoffs and see where the outliers are on the $(\Delta,Z)$-plot and how many there are. You can place a bunch of ggplot objects next to each other nicely using the grid.arrange function in the package gridExtra.

library(gridExtra) #For grid.arrange
ej_1 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.05))) +
  geom_point() + labs(color=expression(q<0.05)) + ggtitle("Joint Exponential")
ej_2 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.01))) +
  geom_point() + labs(color=expression(q<0.01)) + ggtitle("Joint Exponential")
ej_3 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.005))) +
  geom_point() + labs(color=expression(q<0.005)) + ggtitle("Joint Exponential")
ej_4 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.001))) +
  geom_point() + labs(color=expression(q<0.001)) + ggtitle("Joint Exponential")
ej_5 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.0005))) +
  geom_point() + labs(color=expression(q<5%*%10^-4)) + ggtitle("Joint Exponential")
ej_6 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.0001))) +
  geom_point() + labs(color=expression(q<1%*%10^-4)) + ggtitle("Joint Exponential")
grid.arrange(ej_1, ej_2, ej_3, ej_4, ej_5, ej_6, nrow=3, ncol=2)


sum(df$q_exp_joint < 0.05) / n_total #0.0415
sum(df$q_exp_joint < 0.01) / n_total #0.0232
sum(df$q_exp_joint < 0.005) / n_total #0.0182
sum(df$q_exp_joint < 0.001) / n_total #0.0035
sum(df$q_exp_joint < 5e-4) / n_total #7e-04
sum(df$q_exp_joint < 1e-4) / n_total #0

If we prioritize removing points in the regions with large $\Delta$ and $Z$ while still keepking the tops of the middle bands, then we should use a cutoff between $q^ = 510^{-4}$ and $q^ = 0.001$, giving us between $7$ ($0.07\%$) and $35$ ($0.35\%$) outliers. We could be stricter with the non-outliers if we want and use a larger $q^$, but then some points get marked as outliers at the top of the middle band.

4. The Function `q_exp_marg_DZ`

The function q_exp_marg_DZ is my implementation of the marginal exponential method. Instead of using the joint CDF of $\Delta$ and $Z$, we use each one's marginal CDF independently.

This method starts out the same way as the joint exponential method did. We assume $X_1$ and $X_2$ are independent exponential random variables, possibly with the same parameter. We fit the absolute difference $\Delta = X_1 - X_2$ to an asymmetric Laplace distribution using the vglm function in VGAM, create confidence intervals for $\kappa$ and $\theta$, determine whether the data is asymmetric and/or shifted by a constant $\theta$ for $\Delta$, and use our resulting MLEs to calculate the fitted parameters $\lambda$ or $(\lambda_1,\lambda2$). From here, this method diverges.

If $D \sim \mathcal{AL^*}(\theta,\kappa,\sigma)$, then the expected value of $D$ is $E[D] = \mu + \theta$, where $\mu = \frac{\sigma}{\sqrt{2}}(\frac{1}{\kappa} - \kappa)$, and its variance is $Var(D) = \sigma^2 + \mu^2$.^2^ Using our MLE for $(\theta,\kappa,\sigma)$, we determine all the points that are within $k$ ($k = 1$ by default) standard deviations away from the expected mean (not the expected mode, which is just $\theta$). We consider these points to be in the middle band and not outliers by definitiion, regardless of what their $Z$-values are. For consistency, we can assign them $q = 1$.

For the remaining points, we use our MLE for $\lambda$ or $(\lambda_1,\lambda_2)$ as parameters for the marginal CDF of $Z$. We then assign

$$q(\zeta) = \mathbb{P}(Z \geq \zeta) = 1 - F_Z(\zeta)$$

for each point, regardless of what their $\Delta$-values are (as long as they are outside the middle band). This method, as opposed to the joint exponential method, prioritizes preserving the middle band and only uses $Z$ for finding outliers outside of there, which can be useful if there is a known tolerance of (absolute) experimental error in the measurements.

We can apply this method to the simulated data too, using the same procedure as before to see how we should pick outliers. The default parameters are $p$-values of $0.05$ for testing $\log\kappa = 0$ and $\theta = 0$ and $k = 1$ standard deviations away from the mean to define the middle band. We will not change them in this example.

df$q_exp_marg <- q_exp_marg_DZ(df$D, df$Z)

ggplot(df, aes(x=D, y=Z, color=q_exp_marg)) + scale_colour_viridis() +
  geom_point() + labs(color="q-Value") + ggtitle("Marginal Exponential Method")

mj_1 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.7))) +
  geom_point() + labs(color=expression(q<0.7)) + ggtitle("Marginal Exponential")
mj_2 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.6))) +
  geom_point() + labs(color=expression(q<0.6)) + ggtitle("Marginal Exponential")
mj_3 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.5))) +
  geom_point() + labs(color=expression(q<0.5)) + ggtitle("Marginal Exponential")
mj_4 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.4))) +
  geom_point() + labs(color=expression(q<0.4)) + ggtitle("Marginal Exponential")
grid.arrange(mj_1, mj_2, mj_3, mj_4, nrow=2, ncol=2)

sum(df$q_exp_marg < 0.7) / n_total #0.0263
sum(df$q_exp_marg < 0.6) / n_total #0.0199
sum(df$q_exp_marg < 0.5) / n_total #0.0131
sum(df$q_exp_marg < 0.4) / n_total #0.0012

We should use cutoffs that only leave outliers in the regions with high $Z$, not in the wings. We therefore choose $q^*$ between $0.4$ and $0.5$, for $131$ ($1.31\%$) or $12$ ($0.12\%$) outliers.

5. The Remaining Functions

There are $2$ remaining main functions in this package. They are generalizations of the exponential methods to GG-distributed data. The GG distribution a broad family of distributions that includes the exponential, gamma, Weibull, $\chi^2$, $\chi$, half-normal, Rayleigh, and Maxwell-Boltzmann distributions for different parameter values, among many others.^3,4,5,6,7^ This family is useful to us because with so many parameters to fit, we can capture many different shapes of $(\Delta,Z)$-plots. What we lose with the GG methods, however, is computational ease. Both of them required parallelized computations on the Fox Chase Cancer Center High-Performance Cluster to be used in the original paper.^1^ As a result, we only describe them here, instead of running them.

This package has both a joint GG method and a marginal GG method, implemented in the functions q_gg_joint_DZ and q_gg_marg_DZ, respectively. For the joint method, we fit $X_1$ and $X_2$ individually (not $\Delta$ like we did before) to GG distributions using the flexsurvreg function in the package flexsurv. Using the parameters for each one, we get the joint probability density function (PDF) for $(\Delta,Z)$ as a variable transform of the one for $(X_1,X_2)$:^3,6,7,8^

$$ \begin{array}{rcll} f_{\Delta Z}(\delta, \zeta|\alpha_1,\beta_1,c_1,\alpha_2,\beta_2,c_2) & = & f_{X_1X_2}(\Phi{\delta \choose \zeta}|\alpha_1,\beta_1,c_1,\alpha_2,\beta_2,c_2)|\det([\mathbf{D}\Phi{\delta \choose \zeta}])|\ & = & \left{ \begin{array}{cl} \frac{c_1}{\beta_1^{c_1\alpha_1}\Gamma(\alpha_1)}((\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2})^{c_1\alpha_1-1}\exp(-((\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2\beta_1})^{c_1})\times\ \frac{c_2}{\beta_2^{c_2\alpha_2}\Gamma(\alpha_2)}((\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2})^{c_2\alpha_2-1}\exp(-((\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2\beta_2})^{c_2})\times\ \frac{\sqrt{2}\delta}{2\zeta^2}\ \textrm{if }\delta \geq 0, 0 \leq \zeta \leq \sqrt{2}\ \ \ \frac{c_1}{\beta_1^{c_1\alpha_1}\Gamma(\alpha_1)}((-\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2})^{c_1\alpha_1-1}\exp(-((-\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2\beta_1})^{c_1})\times\ \frac{c_2}{\beta_2^{c_2\alpha_2}\Gamma(\alpha_2)}((-\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2})^{c_2\alpha_2-1}\exp(-((-\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2\beta_2})^{c_2})\times\ (-\frac{\sqrt{2}\delta}{2\zeta^2})\ \textrm{if }\delta < 0, 0 \leq \zeta \leq \sqrt{2}\ & \ & \ 0\ \textrm{else.}\ \end{array}\right. \end{array} $$

Please see the original paper for more details on this PDF.^1^ For the joint method, we use numerical integration to find $$ q(\delta, \zeta) = \left{ \begin{array}{rcll} \mathbb{P}(\Delta \leq \delta, Z \geq \zeta) & = & \int_{-\infty}^{\delta}\int_{\zeta}^{\sqrt{2}}f_{\Delta Z}(a,b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}b\textrm{d}a & \textrm{if }\delta < 0\ \mathbb{P}(\Delta \geq \delta, Z \geq \zeta) & = & \int_{\delta}^{\infty}\int_{\zeta}^{\sqrt{2}}f_{\Delta Z}(a,b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}b\textrm{d}a & \textrm{if }\delta \geq 0,\ \end{array}\right. $$

using the package cubature for numerical integration and parallel for running this procedure in parallel. There are no parameters to specify for q_gg_joint_DZ besides the number of processor cores to use for parallel, which I set to be $1$ less than the total number of cores available.

The marginal GG method also fits $X_1$ and $X_2$ independently to GG distributions, but it also fits $\Delta = X_1 - X_2$ to an asymmetric Laplace distribution in the same way the marginal exponential method did. It does the creates the same confidence intervals to test $\log\kappa = 0$ and $\theta = 0$ as the marginal exponential method did, and similarly sets $q = 1$ for all of the points within $k$ (default of $1$) standard deviations from the expected mean of $\Delta$. For the remaining points, we use the maximum likelihood estimates of the GG distributions's parameters to numerically integrate

$$ \begin{array}{rcll} q(\zeta) & = & \mathbb{P}(Z \geq \zeta)\ & = & \int_{\zeta}^{\sqrt{2}}f_Z(b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}b\ & = & \int_{\zeta}^{\sqrt{2}}\int_{-\infty}^{\infty} f_{\Delta Z}(a,b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}a\textrm{d}b, \end{array} $$

using^3^

$$ f_Z(\zeta|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2}) = \int_{-\infty}^\infty f_{\Delta Z}(a,\zeta|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2}) \textrm{d}a. $$

This function has the parameters of a $p$-value for testing significance of $\log\kappa$ and $\theta$, $k$, and the number of cores for parallel to use, with the same default values as before. We can choose between using the joint and marginal GG methods using the same goals or preferences that we used when choosing between the two types of exponential methods. The choice of using either of the GG methods over either of the exponential methods comes when we desire more parameters to fit messier data and have the computational resources for the task.

In the original paper, we also presented a simpler method for fitting an asymmetric Laplace distribution to $\Delta$ and either a Weibull or log-normal distribution to $Z$ for the points outside of the middle band. We can call this method the "original" or "asymmetric Laplace-Weibull" method. I wrote a function for this procedure called outlier_DZ, but it depends on the getOutliersI function in the extremevalues package. That package depends on the package tcltk, which is now part of base R and cannot be properly installed with extremevalues. I was not able to get replicateOutliers to work on other computers or Travis-CI with this dependency, so I removed outlier_DZ altogether. Here it is, along with all the roxygen2 documentation:

#' Original Asymmetric Laplace-Weibull Method for Outlier Detection
#' 
#' @description This function implements the original method for outlier detection among replicated data. It first fits the aboslute difference Delta between two replicates to an Asymmetric Laplace Distribution using MLE. Then among the points outside of some central band, we fit their values for the coefficient of variation Zeta to a Weibull (or lognormal) distribution (also using MLE). The points from this set with significantly large Zeta values are outliers.
#' @param D The absolute difference (Delta) between two vectors of (positive) replicated data: D = X_1 - X_2
#' @param Z The coefficient of variation (Zeta) between two vectors of (positive) replicated data: Z = sqrt(2) * abs(X_1 - X_2) / (X_1 + X_2)
#' @param type The distribution to which we fit Zeta. By default this value is "weibull", but we can also set it to "lognormal"
#' @param p_theta We use the (1-p_theta)*100\% two-sided confidence interval for theta in Delta = X_1 - X_2 + theta to determine if there is a significant translation of the absolute difference Delta. If this interval contains 0, then we set theta = 0. We set p_theta = 0.05 by default
#' @param p_kappa We use the (1-p_kappa)*100\% two-sided confidence interval for the asymmetry parameter kappa in the Asymmetric Laplace Distribution to which we fit Delta. If this interval for log(kappa) contains 0, then we set kappa = 0 and use a Symmetric Laplace Distribution for Delta. We set p_kappa = 0.05 by default
#' @param k The number of standard deviations about the center (mean) of the Asymmetric Laplace Distribution for Delta that we use to define the "central band." We set k = 1 by default
#' @param t We remove the top t*100\% of points when fitting the coefficient of variation Zeta values to a Weibull or lognormal distribution, so that these points do not cause problems when fitting the distribution. We set t = 0.01 by default
#' @param b We remove the bottom b*100\% of points when fitting the coefficient of variation Zeta values to a Weibull or lognormal distribution, so that these points do not cause problems when fitting the distribution. We set b = 0.01 by default
#' @param N_u After we fit the Zeta values for the correct subset of points to a Weibull or lognormal distribution, we return as outliers all the points whose Zeta values are greater than the level above which we would not expect to find N_u*100\% of the points. This procedure is how the underlying function getOutliersI in the package extremevalues works. We set N_u = 1 by default
#' @param N_l After we fit the Zeta values for the correct subset of points to a Weibull or lognormal distribution, we return as outliers all the points whose Zeta values are less than the level below which we would not expect to find N_l*100\% of the points. This procedure is how the underlying function getOutliersI in the package extremevalues works. We set N_l = 1 by default
#' @return A numerical vector of equal length to the input D and Z vectors, with values 0 for points in the central band, 1 for points in the "wings" with small Zeta, and 2 for outliers
#' @import VGAM
#' @import extremevalues
#' @examples
#' # Assume X_1 and X_2 are positive data vectors of the same length. These are the replicates
#' df <- data.frame(X_1=Sim_GG$X_1, X_2=Sim_GG$X_2)
#' df$D <- df$X_1 - df$X_2
#' df$Z <- sqrt(2) * abs(df$X_1 - df$X_2) / (df$X_1 + df$X_2)
#' df$outlier_status <- outlier_DZ(df$D, df$Z)
#' @export
outlier_DZ <- function(D, Z, type="weibull", p_theta=0.05,
                       p_kappa=0.05, k=1, t=0.01, b=0.01, N_u=0.01, N_l=0.01){
  if(length(D) != length(Z)){
    stop("The lengths of D and Z must be equal.")
  }

  # First fit D to Asymmetric Laplace
  fit_ala <- vglm(as.matrix(D) ~ 1, alaplace3, trace=TRUE, crit="c",
                  control = vglm.control(maxit=50))

  CI_theta <- Coef(fit_ala)[1] + summary(fit_ala)@coef3[1,2] * qnorm(1-p_theta/2) *
    c(-1,1) #95% confidence interval for theta using approximate normal for MLE
  CI_kappa <- log(Coef(fit_ala)[3]) + summary(fit_ala)@coef3[3,2] *
    qnorm(1-p_kappa/2) * c(-1,1)
  # 95% confidence interval for log(kappa) using approximate normal for MLE

  if(CI_theta[1] > 0 || CI_theta[2] < 0){ #If there IS a significant location parameter
    theta_est <- Coef(fit_ala)[1] #Make sure to subtract this later when plugging data into F_exp
  }else{
    theta_est <- 0 #This way we always have a value to subtract from D
  }
  # Estimate kappa, mu, and standard deviation
  if(CI_kappa[1] > 0 || CI_kappa[2] < 0){ #If there IS significant asymmetry
    kappa_est <- Coef(fit_ala)[3]
    mu_est <- Coef(fit_ala)[2] * (1/kappa_est - kappa_est) / sqrt(2)
    sd_est <- sqrt(Coef(fit_ala)[2]^2 + mu_est^2)
  }else{
    kappa_est <- 1 #This way we always have a value of kappa, even if we won't use it
    mu_est <- 0 #Set equal to 0 to avoid numerical imprecision
    sd_est <- Coef(fit_ala)[2] #Set equal to this to avoid numerical imprecision
  }

  lap_range <- mu_est + theta_est + k * sd_est * c(-1,1) #Range of values to use for D

  df_temp <- data.frame(D=D, Z=Z, D_out=0, Z_out=0) #Initialize
  row.names(df_temp) <- 1:nrow(df_temp) #Explicitly label the rows
  df_temp$D_out <- as.numeric( #1 for TRUE and 0 for FALSE
    df_temp$D < lap_range[1] | df_temp$D > lap_range[2]
  )

  df_sub <- df_temp[df_temp$D_out==1,] #Subset of points with outlying D

  if(type=="weibull"){ #If using a Weibull distribution for Z
    ext_out <- getOutliersI(y=df_sub$Z, rho=c(N_u,N_l),
                            FLim=c(b,1-t), distribution="weibull")
  }else if(type=="lognormal"){
    ext_out <- getOutliersI(y=df_sub$Z, rho=c(N_u,N_l), 
                            FLim=c(b,1-t), distribution="lognormal")
  }else{
    stop("Distribution for Z must be \"weibull\" or \"lognormal\".")
  }

  df_sub$ext_out <- FALSE #Initialize
  df_sub$ext_out[ext_out$iLeft] <- TRUE #Small values of Z
  df_sub$ext_out[ext_out$iRight] <- TRUE #Large values of Z

  df_temp$Z_out[as.numeric(rownames(df_sub[df_sub$ext_out,]))] <- 1

  df_temp$D_out + df_temp$Z_out #Return combined measure of outlier status
}

6. More Than $2$ Replicates

The original paper presents adaptations of this method when there are more than $2$ replicates.^1^ We use the same $(\Delta,Z)$-methods from this package for every pairwise comparison of replicates. Since we want to be conservative with keeping data, we should be liberal with removing outliers. To every set of $p$ data points, we assign the minimum $q$ of the $n\choose2$ pairwise comparisons. We consider an entire set of replicates to be an outlier if this minimum $q_{min}$ is below some cutoff $q^$, which is the same as saying that the entire set is an outlier if any $2$ measurements are significantly different from each other. We can adjust for these multiple comparisons in a similar way that we would for $p$-values in significance testing: by using a smaller cutoff for significance or outlier status. For example, if we would use $q^$ for a single pairwise comparison, we should try $\frac{q^*}{n\choose2}$ as a cutoff for $q_{min}$ for each set of replicates, which is analogous to a Bonferroni Correction to a $p$-value.^3^

7. Citations

1. Smith MS, Devarajan K?

2. Kotz S, Kozubowski TJ, Podgorski K. The Laplace Distribution and Generalization: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Boston: Birkhäuser, 2001.

3. Rice JA. Mathematical Statistics and Data Analysis: Third Edition. Belmont, CA: Thomson Higher Education; 2007.

4. Kotz S, Johnson NL. Distributions in Statistics Volume 2: Continuous Univariate Distributions-1. New York: Wiley; 1970.

5. Kotz S, Johnson NL. Distributions in Statistics Volume 3: Continuous Univariate Distributions-2. New York: Wiley; 1970.

6. German, R. Parametric Survival Models. Princeton: 2010. http://data.princeton.edu/pop509/ParametricSurvival.pdf.

7. Stacy EW. A Generalization of the Gamma Distribution. The Annals of Mathematical Statistics: 1962. https://www.jstor.org/stable/2237889.

8. Hubbard JH, Hubbard BB. Vector Calculus, Linear Algebra, and Differential Forms: 4th Edition. Ithaca, NY: Matrix Editions; 2009.

8. `R` Packages Used

citation("devtools")

citation("flexsurv")

citation("ggplot2")

citation("VGAM")

citation("viridis")

citation("gridExtra")

library(cubature)
citation("cubature")

library(parallel)
citation("parallel")

matthew-seth-smith/replicateOutliers documentation built on Jan. 24, 2020, 9:34 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

matthew-seth-smith/replicateOutliers
Outlier Detection Methods for Replicated Data

In matthew-seth-smith/replicateOutliers: Outlier Detection Methods for Replicated Data

1. Introduction

2. Generating the Data

3. The Function `q_exp_joint_DZ`

4. The Function `q_exp_marg_DZ`

5. The Remaining Functions

6. More Than $2$ Replicates

7. Citations

8. `R` Packages Used

R Package Documentation

Browse R Packages

We want your feedback!

matthew-seth-smith/replicateOutliers Outlier Detection Methods for Replicated Data

In matthew-seth-smith/replicateOutliers: Outlier Detection Methods for Replicated Data

1. Introduction

2. Generating the Data

3. The Function q_exp_joint_DZ

4. The Function q_exp_marg_DZ

5. The Remaining Functions

6. More Than $2$ Replicates

7. Citations

8. R Packages Used

R Package Documentation

Browse R Packages

We want your feedback!

matthew-seth-smith/replicateOutliers
Outlier Detection Methods for Replicated Data

3. The Function `q_exp_joint_DZ`

4. The Function `q_exp_marg_DZ`

8. `R` Packages Used