roadmap/scripts/readme.md

Analysis pipeline

Folder structure

Generally there are following folders:

BASE_DIR
|- data
|- enrichedheatmap
|- gviz
|- plots
|- rds
|- rds_methylation
|- scripts

-data raw data from Roadmap or UCSC -enrichedheatmap plots -gviz plots -plots general plots -rds intermediate results -rds_methylation -scripts

Data

All the data files are stored under BASE_DIR/data/ folder.

Preprocess

filter transcriptome

In the analysis, we only look at protein coding genes. So we first filter out genes and transcripts which are not protein coding in the transcriptome annotation file.

cd data
sh ../scripts/filter_protein_coding.sh

A new file gen10_long_protein_coding.gtf is generated in the data/ folder.

build txdb

In the analysis, the transcriptome is stored as a Txdb object. So here we need to build it first.

Rscript generate_txdb.R

A new file gen10_long_protein_coding.sqlite is generated in the data/ folder.

merge methylation and do smoothing

We observed that some of the samples have too many missing methylation values (represented as -1 in the methylation files and CpG coverage files). We first calculate NA rate in each sample:

cd ../scripts
Rscript all_sample_coverage_NA_rate.R

After looking at the plot, we set the cutoff of NA rate to 10% and removed E054 (14%), E070 (%13), E084 (18%) and E085 (41%).

We only look at the methylation on CpG dinucleitide while in the methylation file, methylation rate is measured on the C on both strand. We calcualte the mean methylation rate of the CpG weighted by the coverage. CpG sites are removed if in more than 50% of the samples, they have CpG coverage <= 2. finally we use bsseq to smooth the methylation data.

Following script is executed for each chromosome, e.g.

Rscript make_rds_for_methylation_data.R --chr chr1

BASE_DIR/rds_methylation/*_roadmap_merged_bsseq.rds is generated.

configuration file



jokergoo/epik documentation built on Sept. 28, 2019, 9:20 a.m.