This vignette walks through an analysis of some simulated data using JADE via the jadeTF package. For more details about the method see

Morrison, J., Witten, D., & Simon, N. (2016). Joint Adaptive Differential Estimation: A tool for comparative analysis of smooth genomic data types.

In this example we compare the profiles of two groups. Running JADE and cross validating the tuning parameters can be computationally burdensome so the functions in this package are designed to allow opportunities to parallelize tasks. The basic outline of a JADE analysis is

  1. Fit JADE at gamma=0. The smoothing parameters f $\lambda$ will be chosen via cross validation if they aren't provided.
  2. (Optional and only available for the two group case) Estimate the variance of the fits using cross validation. These values are used to construct the $W$ matrix.
  3. Fit JADE at a sequence of gamma values.
  4. Repeat the above steps on cross validation data sets in which some data has been removed. Use these fits to select a value of gamma.
  5. Obtain a list of separated regions from the selected JADE fit. Plot the results.

Simulate Some Data

There are a few sets of smooth profiles built into the package. prof_1 is a matrix containing mean values at 100 evenly spaced positions.

library(jadeTF)
library(ggplot2)
library(IRanges)
ggplot(data.frame(prof_1), aes(x=1:100)) +
  geom_line(aes(y=X1)) + 
  geom_line(aes(y=X2), color="blue") + 
  labs(x="Position", y="Profile")

First we will simulate some data from these profiles. In one group we have 10 samples, in the other we have 15. JADE will only require the average value at each position.

set.seed(10000)
#Make some data from each group
y1 <- replicate(10, rnorm(100, prof_1[,1], 0.5))
y2 <- replicate(15, rnorm(100, prof_1[,2], 0.5))
y <- cbind(rowMeans(y1), rowMeans(y2))

Fit the data at $\gamma = 0$

JADE will choose the smoothing parameters using cross validation if they aren't provided

fit0 <- jade_admm(y, gamma=0, ord=2, sample.size = c(10, 15))
## 1 ..2 ..3 ..4 ..5 ..   <-------- Output from running (5-fold) cross validation
## 205.1597    

Plot the results to see smooth curves with no fusion:

data(fit0, package = "jadeTF")
df <- data.frame(y, fit0$fits); 
names(df)=c("Y1", "Y2", "F1", "F2")
ggplot(df, aes(x=1:100))+  
  geom_line(aes(y=F1)) + 
  geom_line(aes(y=F2), color="blue") + labs(x="Position", y="Estimated Profile")

Fit JADE for one particular value of $\gamma$

Here we've chosen an intermediate value of $\gamma$ which results in some fused and some separated regions. Usually at this point you would have no way of knowing what an interesting value of $\gamma$ might be.

fit1 <- jade_admm(y, gamma=10^(-0.12), ord=2, sample.size = c(10, 15), lambda=fit0$lambda)

The jade_admm function is the workhorse of the jadeTF package and executes the ADMM algorithm described in Morrison et al 2016. The parameters theta0, u.alpha0, rho.alpha, u.beta0, and rho.beta can be used to start the ADMM algorithm off at the solution for a nearby value of gamma. If you provide u.alpha0 or u.beta0 be sure to provide the corresponding step size. This ''warm start'' techinique can reduce the number of iterations to reach convergence and will be used when we run a sequence of fits using the jade_path_planned or jade_path functions.

The function get_separated_regions can be used to produce a table listing regions with separation. The function provides two tables - one merged over partition types and one with partition types separated. In the case of $K=2$ these tables are thes same. The tables automatically include a "Chrom" and "Window" collumn which can be useful for large problems that are broken into many windows. The tables can also be written out to text files as .bed files.

res.tab <- get_separated_regions(fit = fit1, new.tol=1e-2)
res.tab$separated

The results table can be passed to the plotting funciton plot_jade to illustrate the separated regions (in blue) and the estimated fits.

plot_jade(fits = fit1$fits, pos = 1:100, y = y, maxwidth = 0.5, sep.tab = res.tab$separated)

Run a path of values of $\gamma$

In most cases we don't just want the solution for a single value of gamma. We want a series of solutions for increasing values of gamma from very small to the point when all the profiles are fused. There are two functions which can be used to do this

Both are initiated with a fit at gamma=0. jade_path_planned will run a series of fits from a specified starting point and increasing by a given step size on the log scale until the path is finished. It might be difficult to know where to start or what step size to use. path_guess doesn't require this information and will try to fit a specified number of fits which are which have approximately evenly spaced values of $\sum_{i=1}^{K}\Vert \theta_{i}\Vert_{1}$. This can be faster or slower than jade_path_planned and is a more consistent way to guarantee a desired density of fits along the path. It is especially useful if you are fitting many windows and don't want to individually tinker around with each one to try to find good starting places.

Running a full path can take a while. Partial paths are saved to temp files that can be specified and the full path is always saved to an output file as well as returned.

path <- jade_path(fit0 = fit0, n.fits = 100, out.file = "example_path.RData")

The function jade_path doesn't require any knowledge about what the critical values of gamma at which the curves are totally fused at the high end or just begin to fuse at the low end. jade_path can be slower due to fitting more than the target number of fits - for example the above code takes 11.3 minutes and results in 108 fits rather than 100.

There is an alternate function jade_path_planned which requres inputting a starting and approximate stopping point but can be faster if the points provided are good guesses. The code below takes 10.6 minutes and fits JADE 101 times but has the advantage of knowing where to start and stop.

path.planned <- jade_path_planned(fit0 = fit0, 
                                  log.gamma.start = -1, log.gamma.stop= 0.1, 
                                  n.fits = 100, out.file = "example_path_planned.RData")

Choose $\gamma$ by cross validation

Here we use $k$-fold cross validation to select a value of $\gamma$. This involves repeating the previous steps on data sets with some values set to missing. Predictions at the missing values will be used to calculate the cross validation error and select a value of $\gamma$

Fit at $\gamma = 0$ for data with missing folds.

The cv_fit0 function uses a previous fit on the entire data at gamma=0 and performs an analogus fit on the same data set with missing values introduced. To fit the first fold when we are planning on five fold cross validaiont we would use

fit0.1 <- cv_fit0(orig.fit=fit0, n.folds=5, which.fold=1)

To fit all the folds sequentially which.fold could be set to 1:5 but if multiple processors are available these can be run simultaneously. The following runs each fold sequentially and saves the fit to a file named "f0.i.RData" where i is the fold.

cv_fit0(orig.fit=fit0, n.folds=5, which.fold=1:5, save.prefix="f0", return.objects=FALSE)

Fit paths for each fold

For only the first fold we would run

path.1 <- jade_path(fit0 = fit0.1, n.fits = 100, out.file = "example_path.1.RData")

Collect all the cross validation runs and choose a value of $\gamma$.

The function cv_err computes the cross validation error and determines the fit in the original path with the lowest cross validation error and 1 standard error rule cross validation error.

cv.path.list <- paste("example_path.", 1:5, ".RData", sep="")
example_cv <- cv_err_wts(orig.path="example_path.RData", cv.path.list=cv.path.list)
f <- example_path$JADE_fits[[example_cv$cv.1se.l1]]
res.tab <- get_separated_regions(fit = f, new.tol=1e-3)
plot_jade(fits = f$fits, pos = 1:100, maxwidth = 0.5, sep.tab = res.tab$merged)


jean997/jadeTF documentation built on May 18, 2019, 11:44 p.m.