In jean997/fret: Association testing with spatially structured phenotypes

knitr::opts_chunk$set(collapse = TRUE, comment = "#")

Introduction

This vignette leads you through an analysis using FRET to identify differentially accessible regions from DNase-seq data. The analysis pipeline should be similar for any other data type once the data is in the right format. The steps of the analysis pipeline are:

Format the data
Choose constants $s_0$, $z_min$ and $z_0$ using a subset of the data
Calculate test statistics genome-wide; calculate permutation test statistics genome-wde (bulk of computing is here!)
Map test statistic thresholds to false discovery rates
Pull out differential regions

There are five basic functions in the FRET package that we will use:

fret_stats: Calculates test statistics and permutation test statistics
choose_s0: Selects a reasonable value for $s_0$
choose_zmin: Selects a reasonable value for $z_min$
collect_fret_stats
fret_rates
fret_thresholds

Analysis pipeline (in progress)

Format your data

You will need:

Phenotype Files: These files contain the genomic phenotype data You should have one file for each chromosome. Each file should have a header and one line per base-pair. The first column gives the genomic position. Each subsequent column gives the phenotype value for one sample with sample names given in the header. In the case of DNase-seq data these are counts of cleavages (read-ends) at each base-pair. If a base-pair is missing, we will either treat those values as zero (most appropriate for DNase-seq data) or as missing (appropriate for bisulfite sequencing data). This option will be discusesd later. The phenotype file will look something like this:

  pos sample1 sample2 sample3
  84419286 0 10 3.6
  84419289 2 1 1.3
  84419292 1.8 7 9

A trait file: This file should have a header. The first column should be the sample name (corresponding to the header of the phenotype file). Subsequent columns should contain the trait and any covariates you wish to adjust for. The trait file will look something like this:

  name x batch
  sample1 1.33 1
  sample2 2.85 1
  sample3 1.20 2

For this vignette, we will use 5 small data files and a phenotype file that are provided

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(fret)

data_files <- system.file("extdata", paste0("data", 1:5, ".txt"), package = "fret")
trait_file <- system.file("extdata", "trait.txt", package = "fret")

We can take a look at the first data file and the phenotype file:

dat1 <- read_delim(data_files[1], delim=" ")
head(dat1)
trait <- read_delim(trait_file, delim=" ")
head(trait)

The phenotype file has a binary phenotype x, and a covariate covar1. The data file has 30 columns, the first giving the chromosomal position and the rest giving genomic phenotype values for the 29 samples.

For small regions, we can visualize the data by smoothing and plotting. The ksmooth_0 function we use below is a box kernel smoother especially for genomic data where missing positions are assumed to have observed values of 0. We will use this function again later.

dat1_smooth <- data.frame(apply(dat1[,-1], 2, function(y){ksmooth_0(x=dat1$pos, y, bandwidth = 151)}))
dat1_smooth$pos <- dat1$pos
plt <- dat1_smooth %>% gather("name", "value", -pos) %>% 
       inner_join(., trait) %>% 
       ggplot(.) + geom_line(aes(x=pos, y=value, color=as.factor(x), group=name)) + 
       scale_x_continuous(limits=c(61855000, 61860000)) + 
       scale_color_discrete(name="x") +
       theme_bw()
plt

Choose s0, zmin, and z0

fret_stats(pheno_file_list = data_files, trait_file = trait_file, 
           trait="x", covariates = c("covar1"), pheno_transformation = log,
           mode="dry_run", labels=paste0("chr", 1:5), chunksize=2e3)

fret_stats(pheno_file_list = data_files, trait_file = trait_file, 
           trait="covar1",
           mode="dry_run", labels=paste0("chr", 1:5), chunksize=6e3, s0_est_size = 1e4)

fret_stats(pheno_file_list = data_files, trait_file = trait_file, 
           trait="x", #covariates = c("covar1"), #pheno_transformation = log,
           mode="s0_only", labels=paste0("chr", 1:5), chunksize=1e4, s0_est_size = 5e4)

t2 <- system.time(R <- fret_stats(pheno_file_list = data_files, trait_file = trait_file, 
                                  bandwidth=151, smoother="ksmooth_0", trait="x", cores=7,
                                  #covariates = c("covar1"), pheno_transformation = function(x){log(x + 0.2)}, 
                                  mode="s0_only", labels=paste0("chr", 1:5), s0 = 0.05, 
                                  chunksize=1e4))

t3 <- system.time(R2 <- fret_stats(pheno_file_list = data_files, trait_file = trait_file, 
                                  bandwidth=151, smoother="ksmooth_0", trait="x", cores=7,
                                  mode="full", labels=paste0("chr", 1:5), s0 = 0.05, zmin=1.18,
                                  n_perm = 4, chunksize=1e4, seed=1, temp_prefix = "temp"))
file_list <- paste0("temp-chr", 1:5, ".1.RDS")
peaks <- stats_to_rates(file_list, seg_type="by_file")
peaks <- stats_to_rates(file_list, seg_type="by_chromosome", chromosome=paste0("chr", rep(c(1, 2), c(2, 3))))
peaks <- stats_to_rates(file_list, seg_type="find", chromosome=paste0("chr", rep(c(1, 2), c(2, 3))))

We need to set three constants before we can begin analysis: s0, the variance inflation constant, zmin, the smallest threshold value to permit and z0, the merging level. We do this using a large, hopefully representative chunk of the data. You also might consider using an entire chromosome, especially if you have access to a cluster or multiple cores.

Calculate test statistics for your chosen data chunk using thefret_stats function. This function can calculate Huber or linear regression test statistics. It will also calculate permutation test statistics and smooth the test statistics if desired. For now we are just choosing s0 so we don't need any permutations and we also don't need to smooth yet so we set n.perm=0 and smoother="none". The fret_stats function also has a range argument that will let you calculate statistics only for positions in a desired range. Below we calculate statistics for the first chunk of data

stats1 <- fret_stats(data_files[1], pheno_file, 
                    s0=0, n.perm=0, 
                    pheno.transformation=NULL, trait=c("x"), 
                    covariates=c("covar1"),
                    smoother="none", stat.type="huber", chrom="chr1")
names(stats1)
head(stats1$sts)

The arguments pheno.transformation, trait and covariates specify the linear model. In this example, at each position we are fitting $$ pheno = b_{0} + b_{1}x + b_2 \cdot \text{covar1}$$

The model used in this step should match what you plan to use in the analysis.

The object returned (stats above) is a list that contains the original parameters, the raw test statistics (stats1$sts).

Feed the resulting stats into the choose_s0 function which implements the strategy of Tusher, Tibshirani and Chu (2001) for choosing s0.

s0 <- with(stats1$sts, choose_s0(beta=Beta, sd=SD))
#s0 <- 0.05

Choose zmin using the choose_zmin function which adjusts the statistics using the variance inflation constant we chose in step 2 and then smooths. The minimum threshold, zmin, will be a specified quantile (default is 90%) of these smoothed test statistics. The smoother used in this step should match what is used in the final analysis.

We recomend choosing a bandwidth approximately equal to the size of discoveries you expect to make. The smoother type ksmooth_0 is appropriate for DNase-seq data and other data types where missing data can be interpreted as a phenotype value of 0. The ksmooth smoother type is appropriate for bisulfite sequencing and other data types where base-pairs with no data should be treated as missing.

zmin <- with(stats1$sts, choose_zmin(beta=Beta, sd=SD, s0=s0, 
                                      pos=pos, bandwidth=150, smoother="ksmooth_0", 
                                      zmin_quantile=0.9))

Select $z_0$. We recomend setting $z_0$ to 0.3*zmin.

z0 <- 0.3*zmin

Calculate test statistics and permuation test statistics genome-wide

This can be a lot of computing and you may wish to break it into smaller than chromosome sized pieces if you can access lots of nodes. The fret_stats function will break computation into chunksize-sized chunks where chunksize gives the number of lines in the phenotype file. If you want to process ony one or a few chunks you can use the which.chunks argument to specify a list of chunks you wish to process. For each chunk, fret_stats will write a temporary file. If which.chunks isn't specified (i.e. we analyze the whole chromosome), then fret_stats will automatically collate all the temporary files and delete them. Otherwise, you will need to follow up your analysis by calling the collect_fret_stats function.

By default, fret_stats will use all available cores. This can be adjusted using the parallel and cores options. If no argument is supplied to temp.prefix a random string of letters will be used.

Here is an example where we process the first three chunks of size 100,000 linesusing 500 permutations :

fret_stats(data_files[1], pheno_file, s0=s0, 
           n.perm=500, zmin=zmin, z0=0.3*zmin, 
           pheno.transformation=NULL, trait=c("x"), 
           covariates=c("covar1"),
           stat.type="huber", bandwidth=150, smoother="ksmooth_0", chrom="chr1")

This will write three files: ./test_chr1.1.RData, ./test_chr1.2.RData and ./test_chr1.3.RData.

fret_stats keeps only the information needed to identify peaks from the permuation statistics and discards the statistics themselves to avoid creating enormous objects.

Now we collect all the temporary files into one file:

collect_fret_stats(temp.dir="./", temp.prefix="test_chr1", which.chunk=1:3,
                               out.file="test_chr1.RData", del.temp=TRUE)

This will combine the information from the three temporary files into one file test_chr1.RData and delete the temporary files. collect_fret_stats will also automatically determine segment boundaries for the intervals on which the threshold can vary. Note that these are not the same as boundaries of differential regions. By default, the minimum interval width is 50 times the smoothing bandwidth. This can be modified with the min.interval.width option to collect_fret_stats.

You can also determine new boundaries using the find_segments function (this is the function called by collect_fret_stats):

stats <- getobj("test_chr1.RData")
seg.bounds <- find_segments(vv = stats$perm.var$var, pos=stats$perm.var$pos, min.length=50*150)

Map thresholds to false discovery rates

Ok. So now the hard part is done and all we need to do is use the information we have to associate sets of thresholds with false discovery rates. Lets say we have files called "test_chr1.RData", "test_chr2.RData" and "test_chr3.RData". We run

rates <- fret_rates(file.list=paste0("test_chr", 1:3, ".RData"))

Now we can pull out the set of discoveries with fdr < 0.05 or 0.1:

thresh_05 <- fret_thresholds(rates, target.fdr=0.05)

The thresholds object is a list with two elements: threhsolds which reports how many discoveries, the positive and negative threshold and chromosome for each segment and discoveries which lists all discoveries made at the desired level.

jean997/fret documentation built on May 18, 2019, 11:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com