In this vignette, we give a guide for using the replicateOutliers
package. Please see {CITE PAPER}
for more details and derivations.^1^
The setting for these statistical methods is when we have replicated (duplicate, triplcate, etc.) positive data, called "replicates," and want to determine which sets of replicates are too different from each other to be useful. We developed these methods in the context of biological data obtained at Fox Chase Cancer Center: inhibition of kinases by different ligands, proteomics, and RNA-seq. We can easily imagine other contexts. In public health, physicians and epidemiologists running a screening program should repeat measurements on each participant, and participants whose measurements are significantly different from each other should be flagged for another, conclusive measurement. If an engineer is building a measurement device, they should test that the device can measure the same values multiple times. In any application, assuming the observed variables follow certain positive distributions, we can get a sense of the replicability of the measurement technique by seeing how many sets of replicates in this "well-behaved" set get flagged as outliers.
We will reproduce the data from the paper {CITE PAPER} that we generated using generalized gamma (GG)
distributions, which I included in this package as the Sim_GG
data.frame
.^1^ We created this data in
this way because it looks like a volcano plot (turned on its side) usually seen when comparing
fold-change to significance. We then look for outliers in Sim_GG
using the joint exponential and
marginal exponential methods and describe how we would use the remaining $3$ methods. We lastly discuss
briefly how we would analyze data with multiple replicates.
In any context, we assume we have two repeated (duplicated) measurements that are i.i.d.: $X_1$ and $X_2$. The useful variables we want for examining their difference are their absoulte difference: $$ \Delta = X_1 - X_2 $$ and their relative difference: $$ Z = \frac{\sqrt{2}|X_1-X_2|}{X_1+X_2}. $$ The variable $Z$ (uppercase zeta) is the coefficient of variation ($CV$) between the two replicates, which is equal to their mean divided by their standard deviation. Because $0 \leq |X_1-X_2| \leq X_1+X_2$ for $X_1 \geq 0, \ X_2 \geq 0,$ we know $Z \in [0,1]$.
When we look for outliers among duplicated data, we want to look at the plot of $\Delta$ and $Z$ between
the two replicates. Such a plot often looks like a volcano plot seen in biology. I created the Sim_GG
data to look like a sideways volcano plot by adding GG distributions with different parameters and then
fiddling with the parameters until the $(\Delta,Z)$-plot looked right. I will repeat the steps here, but
the plot may look different depending on the starting values.
rm(list=ls()) set.seed(1729) # Set the seed so we can replicate the random results, using Ramanujan's number for the seed # I removed some of the steps I used in making the original Sim_GG data.frame, so this seed will not # exactly match the one used to make Sim_GG # library(devtools) #For document # document() # library(replicateOutliers) #To get the functions for these methods library(flexsurv) #For GG functions library(ggplot2) #For ggplot plotting functions n_total <- 10000 #Number of data points. Doing full simulation df <- data.frame( section=c(rep(0, 0.2*n_total), rep(1, 0.39*n_total), rep(2, 0.39*n_total), rep(3, 0.01*n_total), rep(4, 0.01*n_total)), X_1=c(rexp(0.2*n_total, 1/5), rgengamma.orig(0.39*n_total, shape=1.1, scale=1.6, k=120), rgengamma.orig(0.39*n_total, shape=0.95, scale=2/3, k=200), rgengamma.orig(0.01*n_total, shape=0.898, scale=2/3, k=100), rgengamma.orig(0.01*n_total, shape=0.89, scale=1.6, k=125)), X_2=c(rexp(0.2*n_total, 1/5), rgengamma.orig(0.39*n_total, shape=0.95, scale=2/3, k=200), rgengamma.orig(0.39*n_total, shape=1.1, scale=1.6, k=120), rgengamma.orig(0.01*n_total, shape=0.89, scale=1.6, k=125), rgengamma.orig(0.01*n_total, shape=0.898, scale=2/3, k=100)) ) df$D <- df$X_1 - df$X_2 df$Z <- sqrt(2) * abs(df$X_1 - df$X_2) / (df$X_1 + df$X_2) head(df) ggplot(df, aes(x=D, y=Z, color=section)) + geom_point() + ggtitle("Volcano Plot Using Generalized Gamma Distribution\n(New Simulated Data)")
In this plot, we have labelled by color the "sections" of the data: $0$ corresponds to the middle band, $1$ and $2$ correspond to the "wings," and $3$ and $4$ correspond to the points with large $\Delta$ and large $Z$. We would expect our methods, if working correctly, to identify sections $3$ and $4$ as the outliers.
For comparison, here is a similar plot for the Sim_GG
data.frame
included in this package:
data(Sim_GG) head(Sim_GG) ggplot(df, aes(x=D, y=Z, color=section)) + geom_point() + ggtitle("Volcano Plot Using Generalized Gamma Distribution\n(Sim_GG Data Frame)")
The plots seem to be almost (or maybe even exactly) identical. Notice also that I included more
information in Sim_GG
besides what we made here.
q_exp_joint_DZ
The first function we can use to find outliers in Sim_GG
is q_exp_joint_DZ
, which is my
implementation of the joint exponential method. In this method, we assume $X_1$ and $X_2$ are
independent exponential random variables. We fit $\Delta = X_1 - X_2$ to an asymmetric Laplace
distribution using the VGAM
package. If $D \sim \mathcal{AL^*}(\theta,\kappa,\sigma)$ is an
asymmetric Laplace distribution, it has CDF
$$F_D(d) = \left{ \begin{array}{cl} \frac{\kappa^2}{1+\kappa^2}e^{\frac{\sqrt{2}}{\sigma\kappa}(d-\theta)} & \textrm{if }d < \theta\ 1 - \frac{1}{1+\kappa^2}e^{-\frac{\sqrt{2}\kappa}{\sigma}(d-\theta)} & \textrm{if }d \geq \theta.\ \end{array}\right.$$
The star in $\mathcal{AL^}$ comes from the choice of parameterization used by Kotz et al.^2^ If $\Delta = X_1 - X_2$, $X_1 \sim Exp(\lambda_1)$, and $X_2 \sim Exp(\lambda_2)$, then $\Delta \sim \mathcal{AL^}(0, \sqrt{\frac{\lambda_1}{\lambda_2}}, \sqrt{\frac{2}{\lambda_1\lambda_2}}).$
The vglm
function in VGAM
allows us to get a maximum likelihood estimate (MLE) and a
$95\%$-confidence interval for $\kappa$. If the confidence interval of $\log\kappa$ includes $0$, then
we can let $\kappa = 1$ and $\lambda_1 = \lambda_2 = \lambda = \frac{\sqrt{2}}{\sigma}$, after getting
the MLE for $\sigma$. We call this the symmetric case. If the $95\%$-confidence interval does not
contain $0$, then we let $\lambda_1 \neq \lambda_2$, called the asymmetric case.
Once we know what case we have, we use the estimates for $(\lambda_1,\lambda_2)$ or $\lambda$ calculated from the MLE of $\kappa$ and $\sigma$ as parameters for the joint distribution of $\Delta$ and $Z$. If $X_1, X_2 \sim Exp(\lambda)$ in the symmetric case, then $(\Delta,Z)$ has the joint cumulative density function (CDF)
$$ F_{\Delta Z}(\delta, \zeta) = \left{ \begin{array}{cl} \frac{1-\gamma}{2(1+\gamma)}e^{\frac{\lambda(1+\gamma)\delta}{1-\gamma}} & \textrm{if }\delta \leq 0, 0 \leq \zeta \leq \sqrt{2}\ & \ \frac{1}{2}e^{\lambda \delta} & \textrm{if }\delta \leq 0, \zeta > \sqrt{2}\ & \ \frac{1}{2}\lambda \delta e^{-\lambda \frac{\sqrt{2}\delta}{\zeta}} + (\frac{1}{2} - \frac{1}{\gamma+1})(-1 + e^{-\lambda \frac{\sqrt{2}\delta}{\zeta}} + \lambda \frac{\sqrt{2}\delta}{\zeta} e^{-\lambda \frac{\sqrt{2}\delta}{\zeta}}) + \frac{1-\gamma}{2(1+\gamma)} & \textrm{if }\delta > 0, 0 \leq \zeta \leq \sqrt{2}\ & \ 1 - \frac{1}{2}e^{-\lambda \delta} & \textrm{if }\delta > 0, \zeta > \sqrt{2}\ & \ 0 & \textrm{if }\zeta < 0,\ \end{array}\right. $$
where $\gamma = \frac{\sqrt{2}-\zeta}{\sqrt{2}+\zeta}$. If $X_1 \sim Exp(\lambda_1)$ and $X_2 \sim Exp(\lambda_2)$ with $\lambda_1 \neq \lambda_2$ in the asymmetric case, then $(\Delta,Z)$ has the joint CDF
$$ F_{\Delta Z}(\delta, \zeta) = \left{ \begin{array}{cl} \frac{\lambda_1\lambda_2(1-\gamma)e^{\frac{(\lambda_1\gamma+\lambda_2)\delta}{1-\gamma}}}{(\lambda_1\gamma+\lambda_2)(\lambda_1+\lambda_2)} & \textrm{if }\delta \leq 0, 0 \leq \zeta \leq \sqrt{2}\ & \ \frac{\lambda_1e^{\lambda_2\delta}}{\lambda_1+\lambda_2} & \textrm{if }\delta \leq 0, \zeta > \sqrt{2}\ & \ -\frac{2\lambda_1\lambda_2}{\lambda_1^2-\lambda_2^2}e^{-((1+\frac{\sqrt{2}}{\zeta})\lambda_1+(\frac{\sqrt{2}}{\zeta}-1)\lambda_2)\frac{\delta}{2}} - \frac{\lambda_1\lambda_2(1+\gamma)}{(\lambda_1+\lambda_2\gamma)(\lambda_2-\lambda_1)}e^{-\frac{(\lambda_1+\lambda_2\gamma)\sqrt{2}\delta}{\zeta(\gamma+1)}}\ +\ \frac{\lambda_1\lambda_2(1-\gamma^2)}{(\lambda_1+\lambda_2\gamma)(\lambda_1\gamma+\lambda_2)} & \textrm{if }\delta > 0, 0 \leq \zeta \leq \sqrt{2}\ & \ 1 - \frac{\lambda_2}{\lambda_1+\lambda_2}e^{-\lambda_1\delta} & \textrm{if }\delta > 0, \zeta > \sqrt{2}\ & \ 0 & \textrm{if }\zeta < 0.\ \end{array}\right. $$
The joint CDFs represent $F_{\Delta Z}(\delta,\zeta) = \mathbb{P}(\Delta \leq \delta, \ Z \leq \zeta)$. We define the outlier probability $q$ (for the joint exponential method) to be
$$ q(\delta, \zeta) = \left{ \begin{array}{rcll} \mathbb{P}(\Delta \leq \delta, \ Z \geq \zeta) & = & F_{\Delta}(\delta) - F_{\Delta Z}(\delta,\zeta) & \textrm{if }\delta \leq 0\ \mathbb{P}(\Delta \geq \delta, \ Z \geq \zeta) & = & 1 - F_{\Delta}(\delta) - F_Z(\zeta) + F_{\Delta Z}(\delta,\zeta) & \textrm{if }\delta > 0\ \end{array} \right., $$
so we also need to know the marginal CDFs $F_{\Delta}(\delta)$ and $F_Z(\zeta)$. The marginal CDFs are related to the joint CDF by $F_{\Delta}(\delta) = \lim_{\zeta\to\infty} F_{\Delta Z}(\delta,\zeta)$ and $F_Z(\zeta) = \lim_{\delta\to\infty} F_{\Delta Z}(\delta,\zeta)$.^3^ For the symmetric case, the marginal CDFs are
$$ F_{\Delta}(\delta) = \left{ \begin{array}{cl} \frac{1}{2}e^{\lambda \delta} & \textrm{if }\delta \leq 0\ 1 - \frac{1}{2}e^{-\lambda \delta} & \textrm{if }\delta > 0\ \end{array}\right. $$
and
$$ F_Z(\zeta) = \left{ \begin{array}{cl} 0 & \textrm{if }\zeta < 0\ \frac{1-\gamma}{1+\gamma} = \frac{\zeta}{\sqrt{2}} & \textrm{if }0 \leq \zeta \leq \sqrt{2}\ 1 & \textrm{if }\zeta > \sqrt{2}.\ \end{array}\right. $$
In the asymmetric case, the marginal CDFs are
$$ F_{\Delta}(\delta) = \left{ \begin{array}{cl} \frac{\lambda_1}{\lambda_1+\lambda_2}e^{\lambda_2\delta} & \textrm{if }\delta \leq 0\ 1 - \frac{\lambda_2}{\lambda_1+\lambda_2}e^{-\lambda_1\delta} & \textrm{if }\delta > 0,\ \end{array}\right. $$
and
$$ F_Z(\zeta) = \left{ \begin{array}{cl} 0 & \textrm{if }\zeta < 0\ \frac{\lambda_1\lambda_2(1-\gamma^2)}{(\lambda_1+\lambda_2\gamma)(\lambda_1\gamma+\lambda_2)} & \textrm{if }0 \leq \zeta \leq \sqrt{2}\ 1 & \textrm{if }\zeta > \sqrt{2}.\ \end{array}\right. $$
Notice that $\Delta$ is marginally distributed as a symmetric (asymmetric) Laplace random variable in the symmetric (asymmetric) case, as we would expect. We now know that $Z$ is marginally uniformly distributed over $[0,\sqrt{2}]$ in the symmetric case, and that its marginal distribution in the asymmetric case appears to be a new beast.
The function q_exp_joint_DZ
does all the steps above to calculate $q(\delta,\zeta)$ for each point of
$(\Delta,Z)$-data: fitting the absolute difference $\Delta$ to an asymmetric Laplace distribution,
determining whether we are in the symmetric or asymmetric case, calculating the MLE for $\lambda$ or
$(\lambda_1,\lambda_2)$, and plugging these MLE into the joint and marginal CDFs to get
$q(\delta,\zeta)$ for every point. It also checks to see if there is an offset to $\Delta$ by some
parameter $\theta \in \mathbb{R}$, again using the $95\%$-confidence interval to test whether
$\theta \neq 0$ is significant. There are parameters for q_exp_joint_DZ
to change the $p$-values for
these tests, but we can leave them as the default values of $0.05$.
Let's apply q_exp_joint_DZ
to our simulated data and color the plot by the $q$-value. The color scheme
comes from the package viridis
and is supposed to be easier for people who are colorblind to decipher.
library(viridis) #For scale_colour_viridis in ggplot df$q_exp_joint <- q_exp_joint_DZ(df$D, df$Z) ggplot(df, aes(x=D, y=Z, color=q_exp_joint)) + scale_colour_viridis() + geom_point() + labs(color="q-Value") + ggtitle("Joint Exponential Method")
To find the outliers, we define a cutoff value $q^$ for probability and consider any points with $q<q^$ to be outliers. We do so because these points have such low probability of occurring given our assumption of independently distributed exponential data that something must be wrong with them.
We do not know how many outliers we should be looking for, but it is useful to try different cutoffs and
see where the outliers are on the $(\Delta,Z)$-plot and how many there are. You can place a bunch of
ggplot
objects next to each other nicely using the grid.arrange
function in the package gridExtra
.
library(gridExtra) #For grid.arrange ej_1 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.05))) + geom_point() + labs(color=expression(q<0.05)) + ggtitle("Joint Exponential") ej_2 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.01))) + geom_point() + labs(color=expression(q<0.01)) + ggtitle("Joint Exponential") ej_3 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.005))) + geom_point() + labs(color=expression(q<0.005)) + ggtitle("Joint Exponential") ej_4 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.001))) + geom_point() + labs(color=expression(q<0.001)) + ggtitle("Joint Exponential") ej_5 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.0005))) + geom_point() + labs(color=expression(q<5%*%10^-4)) + ggtitle("Joint Exponential") ej_6 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_joint<0.0001))) + geom_point() + labs(color=expression(q<1%*%10^-4)) + ggtitle("Joint Exponential") grid.arrange(ej_1, ej_2, ej_3, ej_4, ej_5, ej_6, nrow=3, ncol=2) sum(df$q_exp_joint < 0.05) / n_total #0.0415 sum(df$q_exp_joint < 0.01) / n_total #0.0232 sum(df$q_exp_joint < 0.005) / n_total #0.0182 sum(df$q_exp_joint < 0.001) / n_total #0.0035 sum(df$q_exp_joint < 5e-4) / n_total #7e-04 sum(df$q_exp_joint < 1e-4) / n_total #0
If we prioritize removing points in the regions with large $\Delta$ and $Z$ while still keepking the tops of the middle bands, then we should use a cutoff between $q^ = 510^{-4}$ and $q^ = 0.001$, giving us between $7$ ($0.07\%$) and $35$ ($0.35\%$) outliers. We could be stricter with the non-outliers if we want and use a larger $q^$, but then some points get marked as outliers at the top of the middle band.
q_exp_marg_DZ
The function q_exp_marg_DZ
is my implementation of the marginal exponential method. Instead of
using the joint CDF of $\Delta$ and $Z$, we use each one's marginal CDF independently.
This method starts out the same way as the joint exponential method did. We assume $X_1$ and $X_2$ are
independent exponential random variables, possibly with the same parameter. We fit the absolute
difference $\Delta = X_1 - X_2$ to an asymmetric Laplace distribution using the vglm
function in
VGAM
, create confidence intervals for $\kappa$ and $\theta$, determine whether the data is asymmetric
and/or shifted by a constant $\theta$ for $\Delta$, and use our resulting MLEs to calculate the fitted
parameters $\lambda$ or $(\lambda_1,\lambda2$). From here, this method diverges.
If $D \sim \mathcal{AL^*}(\theta,\kappa,\sigma)$, then the expected value of $D$ is $E[D] = \mu + \theta$, where $\mu = \frac{\sigma}{\sqrt{2}}(\frac{1}{\kappa} - \kappa)$, and its variance is $Var(D) = \sigma^2 + \mu^2$.^2^ Using our MLE for $(\theta,\kappa,\sigma)$, we determine all the points that are within $k$ ($k = 1$ by default) standard deviations away from the expected mean (not the expected mode, which is just $\theta$). We consider these points to be in the middle band and not outliers by definitiion, regardless of what their $Z$-values are. For consistency, we can assign them $q = 1$.
For the remaining points, we use our MLE for $\lambda$ or $(\lambda_1,\lambda_2)$ as parameters for the marginal CDF of $Z$. We then assign
$$q(\zeta) = \mathbb{P}(Z \geq \zeta) = 1 - F_Z(\zeta)$$
for each point, regardless of what their $\Delta$-values are (as long as they are outside the middle band). This method, as opposed to the joint exponential method, prioritizes preserving the middle band and only uses $Z$ for finding outliers outside of there, which can be useful if there is a known tolerance of (absolute) experimental error in the measurements.
We can apply this method to the simulated data too, using the same procedure as before to see how we should pick outliers. The default parameters are $p$-values of $0.05$ for testing $\log\kappa = 0$ and $\theta = 0$ and $k = 1$ standard deviations away from the mean to define the middle band. We will not change them in this example.
df$q_exp_marg <- q_exp_marg_DZ(df$D, df$Z) ggplot(df, aes(x=D, y=Z, color=q_exp_marg)) + scale_colour_viridis() + geom_point() + labs(color="q-Value") + ggtitle("Marginal Exponential Method") mj_1 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.7))) + geom_point() + labs(color=expression(q<0.7)) + ggtitle("Marginal Exponential") mj_2 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.6))) + geom_point() + labs(color=expression(q<0.6)) + ggtitle("Marginal Exponential") mj_3 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.5))) + geom_point() + labs(color=expression(q<0.5)) + ggtitle("Marginal Exponential") mj_4 <- ggplot(df, aes(x=D, y=Z, color=(q_exp_marg<0.4))) + geom_point() + labs(color=expression(q<0.4)) + ggtitle("Marginal Exponential") grid.arrange(mj_1, mj_2, mj_3, mj_4, nrow=2, ncol=2) sum(df$q_exp_marg < 0.7) / n_total #0.0263 sum(df$q_exp_marg < 0.6) / n_total #0.0199 sum(df$q_exp_marg < 0.5) / n_total #0.0131 sum(df$q_exp_marg < 0.4) / n_total #0.0012
We should use cutoffs that only leave outliers in the regions with high $Z$, not in the wings. We therefore choose $q^*$ between $0.4$ and $0.5$, for $131$ ($1.31\%$) or $12$ ($0.12\%$) outliers.
There are $2$ remaining main functions in this package. They are generalizations of the exponential methods to GG-distributed data. The GG distribution a broad family of distributions that includes the exponential, gamma, Weibull, $\chi^2$, $\chi$, half-normal, Rayleigh, and Maxwell-Boltzmann distributions for different parameter values, among many others.^3,4,5,6,7^ This family is useful to us because with so many parameters to fit, we can capture many different shapes of $(\Delta,Z)$-plots. What we lose with the GG methods, however, is computational ease. Both of them required parallelized computations on the Fox Chase Cancer Center High-Performance Cluster to be used in the original paper.^1^ As a result, we only describe them here, instead of running them.
This package has both a joint GG method and a marginal GG method, implemented in the functions
q_gg_joint_DZ
and q_gg_marg_DZ
, respectively. For the joint method, we fit $X_1$ and $X_2$
individually (not $\Delta$ like we did before) to GG distributions using the flexsurvreg
function in
the package flexsurv
. Using the parameters for each one, we get the
joint probability density function (PDF) for $(\Delta,Z)$ as a variable transform of the one for
$(X_1,X_2)$:^3,6,7,8^
$$ \begin{array}{rcll} f_{\Delta Z}(\delta, \zeta|\alpha_1,\beta_1,c_1,\alpha_2,\beta_2,c_2) & = & f_{X_1X_2}(\Phi{\delta \choose \zeta}|\alpha_1,\beta_1,c_1,\alpha_2,\beta_2,c_2)|\det([\mathbf{D}\Phi{\delta \choose \zeta}])|\ & = & \left{ \begin{array}{cl} \frac{c_1}{\beta_1^{c_1\alpha_1}\Gamma(\alpha_1)}((\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2})^{c_1\alpha_1-1}\exp(-((\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2\beta_1})^{c_1})\times\ \frac{c_2}{\beta_2^{c_2\alpha_2}\Gamma(\alpha_2)}((\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2})^{c_2\alpha_2-1}\exp(-((\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2\beta_2})^{c_2})\times\ \frac{\sqrt{2}\delta}{2\zeta^2}\ \textrm{if }\delta \geq 0, 0 \leq \zeta \leq \sqrt{2}\ \ \ \frac{c_1}{\beta_1^{c_1\alpha_1}\Gamma(\alpha_1)}((-\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2})^{c_1\alpha_1-1}\exp(-((-\frac{\sqrt{2}}{\zeta}+1)\frac{\delta}{2\beta_1})^{c_1})\times\ \frac{c_2}{\beta_2^{c_2\alpha_2}\Gamma(\alpha_2)}((-\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2})^{c_2\alpha_2-1}\exp(-((-\frac{\sqrt{2}}{\zeta}-1)\frac{\delta}{2\beta_2})^{c_2})\times\ (-\frac{\sqrt{2}\delta}{2\zeta^2})\ \textrm{if }\delta < 0, 0 \leq \zeta \leq \sqrt{2}\ & \ & \ 0\ \textrm{else.}\ \end{array}\right. \end{array} $$
Please see the original paper for more details on this PDF.^1^ For the joint method, we use numerical integration to find $$ q(\delta, \zeta) = \left{ \begin{array}{rcll} \mathbb{P}(\Delta \leq \delta, Z \geq \zeta) & = & \int_{-\infty}^{\delta}\int_{\zeta}^{\sqrt{2}}f_{\Delta Z}(a,b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}b\textrm{d}a & \textrm{if }\delta < 0\ \mathbb{P}(\Delta \geq \delta, Z \geq \zeta) & = & \int_{\delta}^{\infty}\int_{\zeta}^{\sqrt{2}}f_{\Delta Z}(a,b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}b\textrm{d}a & \textrm{if }\delta \geq 0,\ \end{array}\right. $$
using the package cubature
for numerical integration and parallel
for running this procedure in
parallel. There are no parameters to specify for q_gg_joint_DZ
besides the number of processor cores
to use for parallel
, which I set to be $1$ less than the total number of cores available.
The marginal GG method also fits $X_1$ and $X_2$ independently to GG distributions, but it also fits $\Delta = X_1 - X_2$ to an asymmetric Laplace distribution in the same way the marginal exponential method did. It does the creates the same confidence intervals to test $\log\kappa = 0$ and $\theta = 0$ as the marginal exponential method did, and similarly sets $q = 1$ for all of the points within $k$ (default of $1$) standard deviations from the expected mean of $\Delta$. For the remaining points, we use the maximum likelihood estimates of the GG distributions's parameters to numerically integrate
$$ \begin{array}{rcll} q(\zeta) & = & \mathbb{P}(Z \geq \zeta)\ & = & \int_{\zeta}^{\sqrt{2}}f_Z(b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}b\ & = & \int_{\zeta}^{\sqrt{2}}\int_{-\infty}^{\infty} f_{\Delta Z}(a,b|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2})\textrm{d}a\textrm{d}b, \end{array} $$
using^3^
$$ f_Z(\zeta|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2}) = \int_{-\infty}^\infty f_{\Delta Z}(a,\zeta|\hat{\alpha_1},\hat{\beta_1},\hat{c_1},\hat{\alpha_2},\hat{\beta_2},\hat{c_2}) \textrm{d}a. $$
This function has the parameters of a $p$-value for testing significance of $\log\kappa$ and $\theta$,
$k$, and the number of cores for parallel
to use, with the same default values as before. We can
choose between using the joint and marginal GG methods using the same goals or preferences that we used
when choosing between the two types of exponential methods. The choice of using either of the GG methods
over either of the exponential methods comes when we desire more parameters to fit messier data and have
the computational resources for the task.
In the original paper, we also presented a simpler method for fitting an asymmetric Laplace distribution
to $\Delta$ and either a Weibull or log-normal distribution to $Z$ for the points outside of the middle
band. We can call this method the "original" or "asymmetric Laplace-Weibull" method. I wrote a
function for this procedure called outlier_DZ
, but it depends on the getOutliersI
function in the
extremevalues
package. That package depends on the package tcltk
, which is now part of base R
and
cannot be properly installed with extremevalues
. I was not able to get replicateOutliers
to work on
other computers or Travis-CI with this dependency, so I removed outlier_DZ
altogether. Here it is,
along with all the roxygen2
documentation:
#' Original Asymmetric Laplace-Weibull Method for Outlier Detection #' #' @description This function implements the original method for outlier detection among replicated data. It first fits the aboslute difference Delta between two replicates to an Asymmetric Laplace Distribution using MLE. Then among the points outside of some central band, we fit their values for the coefficient of variation Zeta to a Weibull (or lognormal) distribution (also using MLE). The points from this set with significantly large Zeta values are outliers. #' @param D The absolute difference (Delta) between two vectors of (positive) replicated data: D = X_1 - X_2 #' @param Z The coefficient of variation (Zeta) between two vectors of (positive) replicated data: Z = sqrt(2) * abs(X_1 - X_2) / (X_1 + X_2) #' @param type The distribution to which we fit Zeta. By default this value is "weibull", but we can also set it to "lognormal" #' @param p_theta We use the (1-p_theta)*100\% two-sided confidence interval for theta in Delta = X_1 - X_2 + theta to determine if there is a significant translation of the absolute difference Delta. If this interval contains 0, then we set theta = 0. We set p_theta = 0.05 by default #' @param p_kappa We use the (1-p_kappa)*100\% two-sided confidence interval for the asymmetry parameter kappa in the Asymmetric Laplace Distribution to which we fit Delta. If this interval for log(kappa) contains 0, then we set kappa = 0 and use a Symmetric Laplace Distribution for Delta. We set p_kappa = 0.05 by default #' @param k The number of standard deviations about the center (mean) of the Asymmetric Laplace Distribution for Delta that we use to define the "central band." We set k = 1 by default #' @param t We remove the top t*100\% of points when fitting the coefficient of variation Zeta values to a Weibull or lognormal distribution, so that these points do not cause problems when fitting the distribution. We set t = 0.01 by default #' @param b We remove the bottom b*100\% of points when fitting the coefficient of variation Zeta values to a Weibull or lognormal distribution, so that these points do not cause problems when fitting the distribution. We set b = 0.01 by default #' @param N_u After we fit the Zeta values for the correct subset of points to a Weibull or lognormal distribution, we return as outliers all the points whose Zeta values are greater than the level above which we would not expect to find N_u*100\% of the points. This procedure is how the underlying function getOutliersI in the package extremevalues works. We set N_u = 1 by default #' @param N_l After we fit the Zeta values for the correct subset of points to a Weibull or lognormal distribution, we return as outliers all the points whose Zeta values are less than the level below which we would not expect to find N_l*100\% of the points. This procedure is how the underlying function getOutliersI in the package extremevalues works. We set N_l = 1 by default #' @return A numerical vector of equal length to the input D and Z vectors, with values 0 for points in the central band, 1 for points in the "wings" with small Zeta, and 2 for outliers #' @import VGAM #' @import extremevalues #' @examples #' # Assume X_1 and X_2 are positive data vectors of the same length. These are the replicates #' df <- data.frame(X_1=Sim_GG$X_1, X_2=Sim_GG$X_2) #' df$D <- df$X_1 - df$X_2 #' df$Z <- sqrt(2) * abs(df$X_1 - df$X_2) / (df$X_1 + df$X_2) #' df$outlier_status <- outlier_DZ(df$D, df$Z) #' @export outlier_DZ <- function(D, Z, type="weibull", p_theta=0.05, p_kappa=0.05, k=1, t=0.01, b=0.01, N_u=0.01, N_l=0.01){ if(length(D) != length(Z)){ stop("The lengths of D and Z must be equal.") } # First fit D to Asymmetric Laplace fit_ala <- vglm(as.matrix(D) ~ 1, alaplace3, trace=TRUE, crit="c", control = vglm.control(maxit=50)) CI_theta <- Coef(fit_ala)[1] + summary(fit_ala)@coef3[1,2] * qnorm(1-p_theta/2) * c(-1,1) #95% confidence interval for theta using approximate normal for MLE CI_kappa <- log(Coef(fit_ala)[3]) + summary(fit_ala)@coef3[3,2] * qnorm(1-p_kappa/2) * c(-1,1) # 95% confidence interval for log(kappa) using approximate normal for MLE if(CI_theta[1] > 0 || CI_theta[2] < 0){ #If there IS a significant location parameter theta_est <- Coef(fit_ala)[1] #Make sure to subtract this later when plugging data into F_exp }else{ theta_est <- 0 #This way we always have a value to subtract from D } # Estimate kappa, mu, and standard deviation if(CI_kappa[1] > 0 || CI_kappa[2] < 0){ #If there IS significant asymmetry kappa_est <- Coef(fit_ala)[3] mu_est <- Coef(fit_ala)[2] * (1/kappa_est - kappa_est) / sqrt(2) sd_est <- sqrt(Coef(fit_ala)[2]^2 + mu_est^2) }else{ kappa_est <- 1 #This way we always have a value of kappa, even if we won't use it mu_est <- 0 #Set equal to 0 to avoid numerical imprecision sd_est <- Coef(fit_ala)[2] #Set equal to this to avoid numerical imprecision } lap_range <- mu_est + theta_est + k * sd_est * c(-1,1) #Range of values to use for D df_temp <- data.frame(D=D, Z=Z, D_out=0, Z_out=0) #Initialize row.names(df_temp) <- 1:nrow(df_temp) #Explicitly label the rows df_temp$D_out <- as.numeric( #1 for TRUE and 0 for FALSE df_temp$D < lap_range[1] | df_temp$D > lap_range[2] ) df_sub <- df_temp[df_temp$D_out==1,] #Subset of points with outlying D if(type=="weibull"){ #If using a Weibull distribution for Z ext_out <- getOutliersI(y=df_sub$Z, rho=c(N_u,N_l), FLim=c(b,1-t), distribution="weibull") }else if(type=="lognormal"){ ext_out <- getOutliersI(y=df_sub$Z, rho=c(N_u,N_l), FLim=c(b,1-t), distribution="lognormal") }else{ stop("Distribution for Z must be \"weibull\" or \"lognormal\".") } df_sub$ext_out <- FALSE #Initialize df_sub$ext_out[ext_out$iLeft] <- TRUE #Small values of Z df_sub$ext_out[ext_out$iRight] <- TRUE #Large values of Z df_temp$Z_out[as.numeric(rownames(df_sub[df_sub$ext_out,]))] <- 1 df_temp$D_out + df_temp$Z_out #Return combined measure of outlier status }
The original paper presents adaptations of this method when there are more than $2$ replicates.^1^ We use the same $(\Delta,Z)$-methods from this package for every pairwise comparison of replicates. Since we want to be conservative with keeping data, we should be liberal with removing outliers. To every set of $p$ data points, we assign the minimum $q$ of the $n\choose2$ pairwise comparisons. We consider an entire set of replicates to be an outlier if this minimum $q_{min}$ is below some cutoff $q^$, which is the same as saying that the entire set is an outlier if any $2$ measurements are significantly different from each other. We can adjust for these multiple comparisons in a similar way that we would for $p$-values in significance testing: by using a smaller cutoff for significance or outlier status. For example, if we would use $q^$ for a single pairwise comparison, we should try $\frac{q^*}{n\choose2}$ as a cutoff for $q_{min}$ for each set of replicates, which is analogous to a Bonferroni Correction to a $p$-value.^3^
1. Smith MS, Devarajan K?
2. Kotz S, Kozubowski TJ, Podgorski K. The Laplace Distribution and Generalization: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Boston: Birkhäuser, 2001.
3. Rice JA. Mathematical Statistics and Data Analysis: Third Edition. Belmont, CA: Thomson Higher Education; 2007.
4. Kotz S, Johnson NL. Distributions in Statistics Volume 2: Continuous Univariate Distributions-1. New York: Wiley; 1970.
5. Kotz S, Johnson NL. Distributions in Statistics Volume 3: Continuous Univariate Distributions-2. New York: Wiley; 1970.
6. German, R. Parametric Survival Models. Princeton: 2010. http://data.princeton.edu/pop509/ParametricSurvival.pdf.
7. Stacy EW. A Generalization of the Gamma Distribution. The Annals of Mathematical Statistics: 1962. https://www.jstor.org/stable/2237889.
8. Hubbard JH, Hubbard BB. Vector Calculus, Linear Algebra, and Differential Forms: 4th Edition. Ithaca, NY: Matrix Editions; 2009.
R
Packages Usedcitation("devtools") citation("flexsurv") citation("ggplot2") citation("VGAM") citation("viridis") citation("gridExtra") library(cubature) citation("cubature") library(parallel) citation("parallel")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.