This vignette walks through an analysis of some simulated data using JADE via the jadeTF
package. For more details about the method see
Morrison, J., Witten, D., & Simon, N. (2016). Joint Adaptive Differential Estimation: A tool for comparative analysis of smooth genomic data types.
In this example we compare the profiles of two groups. Running JADE and cross validating the tuning parameters can be computationally burdensome so the functions in this package are designed to allow opportunities to parallelize tasks. The basic outline of a JADE analysis is
gamma=0
. The smoothing parameters f $\lambda$ will be chosen via cross validation if they aren't provided.gamma
values.gamma
.There are a few sets of smooth profiles built into the package. prof_1
is a matrix containing mean values at 100 evenly spaced positions.
library(jadeTF) library(ggplot2) library(IRanges)
ggplot(data.frame(prof_1), aes(x=1:100)) + geom_line(aes(y=X1)) + geom_line(aes(y=X2), color="blue") + labs(x="Position", y="Profile")
First we will simulate some data from these profiles. In one group we have 10 samples, in the other we have 15. JADE will only require the average value at each position.
set.seed(10000) #Make some data from each group y1 <- replicate(10, rnorm(100, prof_1[,1], 0.5)) y2 <- replicate(15, rnorm(100, prof_1[,2], 0.5)) y <- cbind(rowMeans(y1), rowMeans(y2))
JADE will choose the smoothing parameters using cross validation if they aren't provided
fit0 <- jade_admm(y, gamma=0, ord=2, sample.size = c(10, 15)) ## 1 ..2 ..3 ..4 ..5 .. <-------- Output from running (5-fold) cross validation ## 205.1597
Plot the results to see smooth curves with no fusion:
data(fit0, package = "jadeTF") df <- data.frame(y, fit0$fits); names(df)=c("Y1", "Y2", "F1", "F2") ggplot(df, aes(x=1:100))+ geom_line(aes(y=F1)) + geom_line(aes(y=F2), color="blue") + labs(x="Position", y="Estimated Profile")
Here we've chosen an intermediate value of $\gamma$ which results in some fused and some separated regions. Usually at this point you would have no way of knowing what an interesting value of $\gamma$ might be.
fit1 <- jade_admm(y, gamma=10^(-0.12), ord=2, sample.size = c(10, 15), lambda=fit0$lambda)
The jade_admm
function is the workhorse of the jadeTF
package and executes the ADMM algorithm described in Morrison et al 2016. The parameters theta0
, u.alpha0
, rho.alpha
, u.beta0
, and rho.beta
can be used to start the ADMM algorithm off at the solution for a nearby value of gamma
. If you provide u.alpha0
or u.beta0
be sure to provide the corresponding step size. This ''warm start'' techinique can reduce the number of iterations to reach convergence and will be used when we run a sequence of fits using the jade_path_planned
or jade_path
functions.
The function get_separated_regions
can be used to produce a table listing regions with separation. The function provides two tables - one merged over partition types and one with partition types separated. In the case of $K=2$ these tables are thes same. The tables automatically include a "Chrom" and "Window" collumn which can be useful for large problems that are broken into many windows. The tables can also be written out to text files as .bed files.
res.tab <- get_separated_regions(fit = fit1, new.tol=1e-2) res.tab$separated
The results table can be passed to the plotting funciton plot_jade
to illustrate the separated regions (in blue) and the estimated fits.
plot_jade(fits = fit1$fits, pos = 1:100, y = y, maxwidth = 0.5, sep.tab = res.tab$separated)
In most cases we don't just want the solution for a single value of gamma
. We want a series of solutions for increasing values of gamma
from very small to the point when all the profiles are fused. There are two functions which can be used to do this
jade_path_planned
jade_path
Both are initiated with a fit at gamma=0
. jade_path_planned
will run a series of fits from a specified starting point and increasing by a given step size on the log scale until the path is finished. It might be difficult to know where to start or what step size to use. path_guess
doesn't require this information and will try to fit a specified number of fits which are which have approximately evenly spaced values of $\sum_{i=1}^{K}\Vert \theta_{i}\Vert_{1}$. This can be faster or slower than jade_path_planned
and is a more consistent way to guarantee a desired density of fits along the path. It is especially useful if you are fitting many windows and don't want to individually tinker around with each one to try to find good starting places.
Running a full path can take a while. Partial paths are saved to temp files that can be specified and the full path is always saved to an output file as well as returned.
path <- jade_path(fit0 = fit0, n.fits = 100, out.file = "example_path.RData")
The function jade_path
doesn't require any knowledge about what the critical values of gamma
at which the curves are totally fused at the high end or just begin to fuse at the low end. jade_path
can be slower due to fitting more than the target number of fits - for example the above code takes 11.3 minutes and results in 108 fits rather than 100.
There is an alternate function jade_path_planned
which requres inputting a starting and approximate stopping point but can be faster if the points provided are good guesses. The code below takes 10.6 minutes and fits JADE 101 times but has the advantage of knowing where to start and stop.
path.planned <- jade_path_planned(fit0 = fit0, log.gamma.start = -1, log.gamma.stop= 0.1, n.fits = 100, out.file = "example_path_planned.RData")
Here we use $k$-fold cross validation to select a value of $\gamma$. This involves repeating the previous steps on data sets with some values set to missing. Predictions at the missing values will be used to calculate the cross validation error and select a value of $\gamma$
The cv_fit0
function uses a previous fit on the entire data at gamma=0
and performs an analogus fit on the same data set with missing values introduced. To fit the first fold when we are planning on five fold cross validaiont we would use
fit0.1 <- cv_fit0(orig.fit=fit0, n.folds=5, which.fold=1)
To fit all the folds sequentially which.fold
could be set to 1:5
but if multiple processors are available these can be run simultaneously. The following runs each fold sequentially and saves the fit to a file named "f0.i.RData" where i is the fold.
cv_fit0(orig.fit=fit0, n.folds=5, which.fold=1:5, save.prefix="f0", return.objects=FALSE)
For only the first fold we would run
path.1 <- jade_path(fit0 = fit0.1, n.fits = 100, out.file = "example_path.1.RData")
The function cv_err
computes the cross validation error and determines the fit in the original path with the lowest cross validation error and 1 standard error rule cross validation error.
cv.path.list <- paste("example_path.", 1:5, ".RData", sep="") example_cv <- cv_err_wts(orig.path="example_path.RData", cv.path.list=cv.path.list)
f <- example_path$JADE_fits[[example_cv$cv.1se.l1]] res.tab <- get_separated_regions(fit = f, new.tol=1e-3) plot_jade(fits = f$fits, pos = 1:100, maxwidth = 0.5, sep.tab = res.tab$merged)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.