knitr::opts_chunk$set(collapse = TRUE, comment = "#")
This vignette leads you through an analysis using FRET to identify differentially accessible regions from DNase-seq data. The analysis pipeline should be similar for any other data type once the data is in the right format. The steps of the analysis pipeline are:
There are five basic functions in the FRET
package that we will use:
fret_stats
: Calculates test statistics and permutation test statisticschoose_s0
: Selects a reasonable value for $s_0$choose_zmin
: Selects a reasonable value for $z_min$collect_fret_stats
fret_rates
fret_thresholds
You will need:
pos sample1 sample2 sample3 84419286 0 10 3.6 84419289 2 1 1.3 84419292 1.8 7 9
name x batch sample1 1.33 1 sample2 2.85 1 sample3 1.20 2
For this vignette, we will use 5 small data files and a phenotype file that are provided
library(readr) library(dplyr) library(ggplot2) library(tidyr) library(fret) data_files <- system.file("extdata", paste0("data", 1:5, ".txt"), package = "fret") trait_file <- system.file("extdata", "trait.txt", package = "fret")
We can take a look at the first data file and the phenotype file:
dat1 <- read_delim(data_files[1], delim=" ") head(dat1) trait <- read_delim(trait_file, delim=" ") head(trait)
The phenotype file has a binary phenotype x
, and a covariate covar1
. The data file has 30 columns, the first giving the chromosomal position and the rest giving genomic phenotype values for the 29 samples.
For small regions, we can visualize the data by smoothing and plotting. The ksmooth_0
function we use below is a box kernel smoother especially for genomic data where missing positions are assumed to have observed values of 0. We will use this function again later.
dat1_smooth <- data.frame(apply(dat1[,-1], 2, function(y){ksmooth_0(x=dat1$pos, y, bandwidth = 151)})) dat1_smooth$pos <- dat1$pos plt <- dat1_smooth %>% gather("name", "value", -pos) %>% inner_join(., trait) %>% ggplot(.) + geom_line(aes(x=pos, y=value, color=as.factor(x), group=name)) + scale_x_continuous(limits=c(61855000, 61860000)) + scale_color_discrete(name="x") + theme_bw() plt
fret_stats(pheno_file_list = data_files, trait_file = trait_file, trait="x", covariates = c("covar1"), pheno_transformation = log, mode="dry_run", labels=paste0("chr", 1:5), chunksize=2e3) fret_stats(pheno_file_list = data_files, trait_file = trait_file, trait="covar1", mode="dry_run", labels=paste0("chr", 1:5), chunksize=6e3, s0_est_size = 1e4)
fret_stats(pheno_file_list = data_files, trait_file = trait_file, trait="x", #covariates = c("covar1"), #pheno_transformation = log, mode="s0_only", labels=paste0("chr", 1:5), chunksize=1e4, s0_est_size = 5e4) t2 <- system.time(R <- fret_stats(pheno_file_list = data_files, trait_file = trait_file, bandwidth=151, smoother="ksmooth_0", trait="x", cores=7, #covariates = c("covar1"), pheno_transformation = function(x){log(x + 0.2)}, mode="s0_only", labels=paste0("chr", 1:5), s0 = 0.05, chunksize=1e4)) t3 <- system.time(R2 <- fret_stats(pheno_file_list = data_files, trait_file = trait_file, bandwidth=151, smoother="ksmooth_0", trait="x", cores=7, mode="full", labels=paste0("chr", 1:5), s0 = 0.05, zmin=1.18, n_perm = 4, chunksize=1e4, seed=1, temp_prefix = "temp")) file_list <- paste0("temp-chr", 1:5, ".1.RDS") peaks <- stats_to_rates(file_list, seg_type="by_file") peaks <- stats_to_rates(file_list, seg_type="by_chromosome", chromosome=paste0("chr", rep(c(1, 2), c(2, 3)))) peaks <- stats_to_rates(file_list, seg_type="find", chromosome=paste0("chr", rep(c(1, 2), c(2, 3))))
We need to set three constants before we can begin analysis: s0
, the variance inflation constant, zmin
, the smallest threshold value to permit and z0
, the merging level. We do this using a large, hopefully representative chunk of the data. You also might consider using an entire chromosome, especially if you have access to a cluster or multiple cores.
fret_stats
function. This function can calculate Huber or linear regression test statistics. It will also calculate permutation test statistics and smooth the test statistics if desired. For now we are just choosing s0
so we don't need any permutations and we also don't need to smooth yet so we set n.perm=0
and smoother="none"
. The fret_stats
function also has a range
argument that will let you calculate statistics only for positions in a desired range. Below we calculate statistics for the first chunk of datastats1 <- fret_stats(data_files[1], pheno_file, s0=0, n.perm=0, pheno.transformation=NULL, trait=c("x"), covariates=c("covar1"), smoother="none", stat.type="huber", chrom="chr1") names(stats1) head(stats1$sts)
The arguments pheno.transformation
, trait
and covariates
specify the linear model. In this example, at each position we are fitting
$$ pheno = b_{0} + b_{1}x + b_2 \cdot \text{covar1}$$
The model used in this step should match what you plan to use in the analysis.
The object returned (stats
above) is a list that contains the original parameters, the raw test statistics (stats1$sts
).
choose_s0
function which implements the strategy of Tusher, Tibshirani and Chu (2001) for choosing s0.s0 <- with(stats1$sts, choose_s0(beta=Beta, sd=SD)) #s0 <- 0.05
choose_zmin
function which adjusts the statistics using the variance inflation constant we chose in step 2 and then smooths. The minimum threshold, zmin
, will be a specified quantile (default is 90%) of these smoothed test statistics. The smoother used in this step should match what is used in the final analysis. We recomend choosing a bandwidth approximately equal to the size of discoveries you expect to make. The smoother type ksmooth_0
is appropriate for DNase-seq data and other data types where missing data can be interpreted as a phenotype value of 0. The ksmooth
smoother type is appropriate for bisulfite sequencing and other data types where base-pairs with no data should be treated as missing.
zmin <- with(stats1$sts, choose_zmin(beta=Beta, sd=SD, s0=s0, pos=pos, bandwidth=150, smoother="ksmooth_0", zmin_quantile=0.9))
0.3*zmin
.z0 <- 0.3*zmin
This can be a lot of computing and you may wish to break it into smaller than chromosome sized pieces if you can access lots of nodes. The fret_stats
function will break computation into chunksize
-sized chunks where chunksize
gives the number of lines in the phenotype file. If you want to process ony one or a few chunks you can use the which.chunks
argument to specify a list of chunks you wish to process. For each chunk, fret_stats
will write a temporary file. If which.chunks
isn't specified (i.e. we analyze the whole chromosome), then fret_stats
will automatically collate all the temporary files and delete them. Otherwise, you will need to follow up your analysis by calling the collect_fret_stats
function.
By default, fret_stats
will use all available cores. This can be adjusted using the parallel
and cores
options. If no argument is supplied to temp.prefix
a random string of letters will be used.
Here is an example where we process the first three chunks of size 100,000 linesusing 500 permutations :
fret_stats(data_files[1], pheno_file, s0=s0, n.perm=500, zmin=zmin, z0=0.3*zmin, pheno.transformation=NULL, trait=c("x"), covariates=c("covar1"), stat.type="huber", bandwidth=150, smoother="ksmooth_0", chrom="chr1")
This will write three files: ./test_chr1.1.RData
, ./test_chr1.2.RData
and ./test_chr1.3.RData
.
fret_stats
keeps only the information needed to identify peaks from the permuation statistics and discards the statistics themselves to avoid creating enormous objects.
Now we collect all the temporary files into one file:
collect_fret_stats(temp.dir="./", temp.prefix="test_chr1", which.chunk=1:3, out.file="test_chr1.RData", del.temp=TRUE)
This will combine the information from the three temporary files into one file test_chr1.RData
and delete the temporary files. collect_fret_stats
will also automatically determine segment boundaries for the intervals on which the threshold can vary.
Note that these are not the same as boundaries of differential regions.
By default, the minimum interval width is 50 times the smoothing bandwidth. This can be modified with the min.interval.width
option to collect_fret_stats
.
You can also determine new boundaries using the find_segments
function (this is the function called by collect_fret_stats
):
stats <- getobj("test_chr1.RData") seg.bounds <- find_segments(vv = stats$perm.var$var, pos=stats$perm.var$pos, min.length=50*150)
Ok. So now the hard part is done and all we need to do is use the information we have to associate sets of thresholds with false discovery rates. Lets say we have files called "test_chr1.RData", "test_chr2.RData" and "test_chr3.RData". We run
rates <- fret_rates(file.list=paste0("test_chr", 1:3, ".RData"))
Now we can pull out the set of discoveries with fdr < 0.05 or 0.1:
thresh_05 <- fret_thresholds(rates, target.fdr=0.05)
The thresholds object is a list with two elements: threhsolds
which reports how many discoveries, the positive and negative threshold and chromosome for each segment and discoveries
which lists all discoveries made at the desired level.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.