epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data

Import datasets

Basicly, for epigenomic datasets, there are methylation datasets (we only consider whole genome bisulfite seuqencing datasets), expression datasets (e.g. RNA-seq) and ChIP-seq datasets. We assume in the study, methylation and expression are the major datasets that more samples are sequenced and only part of sammples have ChIP-seq datasets.

Since methylation datasets are always huge, they are stored and processed by chromosome. In epik package, users should define how to get the methylation data so that it will be used in further analysis. To define reading methylation datasets is simple that users only need to define a function for methylation_hooks$get_by_chr. The self-defined function only accepts one argument which is the chromosome name and it returnes a list that contains:

gr: a GRanges object which contains positions of CpG sites. Note we only take the start positions
meth: methylation matrix. It is used in most of the analysis. So if you apply smoothing to the methylation, set it to the smoothed values. The column name of this matrix will be used as sample names or sample IDs.
cov: CpG coverage matrix
raw: the unsmoothed methylation values

It should be noted that rows in gr should be sorted and rows in all four elements should be corresponded.

Following code shows how to read methylation data from roadmap datasets. Here the methylation has already been smoothed by bsseq package.

library(GetoptLong)
library(bsseq)
PROJECT_DIR = "/icgc/dkfzlsdf/analysis/B080/guz/epik_roadmap/"

methylation_hooks$get_by_chr = function(chr) {
    obj = readRDS(qq("@{PROJECT_DIR}/rds_methylation/@{chr}_roadmap_merged_bsseq.rds"))
    sample_id = c("E003", "E004", "E005", "E006", "E007", "E011", "E012", "E013", "E016",
              "E024", "E050", "E065", "E066", "E071", "E079", "E094", "E095", "E096",
              "E097", "E098", "E100", "E104", "E105", "E106", "E109", "E112", "E113")
    obj2 = list(gr = granges(obj),
                raw = getMeth(obj, type = "raw")[, sample_id],
                cov = getCoverage(obj, type = "Cov")[, sample_id],
                meth = getMeth(obj, type = "smooth")[, sample_id]
                )
    return(obj2)
}

After you defined get_by_chr hook, you can load data for a chromosome by set_chr():

methylation_hooks$set_chr("chr21")

Then there are following variables you can directly access: methylation_hooks$gr, methylation_hooks$meth, methylation_hooks$cov, methylation_hooks$raw and methylation_hooks$sample_id:

methylation_hooks$gr

## GRanges object with 373473 ranges and 0 metadata columns:
##            seqnames               ranges strand
##               <Rle>            <IRanges>  <Rle>
##        [1]    chr21   [9411552, 9411552]      *
##        [2]    chr21   [9411784, 9411784]      *
##        [3]    chr21   [9412099, 9412099]      *
##        [4]    chr21   [9412376, 9412376]      *
##        [5]    chr21   [9412503, 9412503]      *
##        ...      ...                  ...    ...
##   [373469]    chr21 [48119473, 48119473]      *
##   [373470]    chr21 [48119505, 48119505]      *
##   [373471]    chr21 [48119516, 48119516]      *
##   [373472]    chr21 [48119519, 48119519]      *
##   [373473]    chr21 [48119538, 48119538]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

head(methylation_hooks$meth)

##           E003      E004      E005      E006      E007      E011      E012
## [1,] 0.7142217 0.7872469 0.7511234 0.5207780 0.7184070 0.6581870 0.5192272
## [2,] 0.7084592 0.7791235 0.7426325 0.5219837 0.7105972 0.6497308 0.5177450
## [3,] 0.7009525 0.7682274 0.7314745 0.5236844 0.7003123 0.6388015 0.5161186
## [4,] 0.6946713 0.7588145 0.7220677 0.5252476 0.6916028 0.6297571 0.5150731
## [5,] 0.6918961 0.7545628 0.7178952 0.5259873 0.6877224 0.6257980 0.5147187
## [6,] 0.6855107 0.7445430 0.7082669 0.5278274 0.6787116 0.6167961 0.5142002
##           E013      E016      E024      E050      E065      E066      E071
## [1,] 0.4647443 0.7425690 0.7141583 0.5218743 0.4686218 0.4552843 0.4282101
## [2,] 0.4658386 0.7325431 0.7062145 0.5215680 0.4732506 0.4529363 0.4346531
## [3,] 0.4675165 0.7193164 0.6957982 0.5214453 0.4795770 0.4501663 0.4432953
## [4,] 0.4691907 0.7081275 0.6870251 0.5216328 0.4851741 0.4481370 0.4507999
## [5,] 0.4700247 0.7031553 0.6831328 0.5218156 0.4877494 0.4473363 0.4542125
## [6,] 0.4722093 0.6916657 0.6741405 0.5225146 0.4939534 0.4457523 0.4623402
##           E079      E094      E095      E096      E097      E098      E100
## [1,] 0.4984796 0.5249127 0.4106633 0.4421115 0.3493896 0.2544267 0.4569943
## [2,] 0.5070684 0.5309428 0.4189998 0.4515964 0.3606577 0.2634982 0.4643034
## [3,] 0.5183900 0.5389489 0.4300968 0.4641459 0.3756954 0.2757225 0.4740372
## [4,] 0.5280124 0.5458116 0.4396297 0.4748530 0.3886305 0.2863497 0.4824059
## [5,] 0.5323174 0.5489011 0.4439262 0.4796555 0.3944607 0.2911748 0.4861805
## [6,] 0.5423769 0.5561717 0.4540456 0.4909070 0.4081816 0.3026193 0.4950799
##           E104      E105      E106      E109      E112      E113
## [1,] 0.3858376 0.4606906 0.5422712 0.5123722 0.6406768 0.5194789
## [2,] 0.3960060 0.4659035 0.5476611 0.5160727 0.6413254 0.5250885
## [3,] 0.4095632 0.4728514 0.5548592 0.5211275 0.6422788 0.5325942
## [4,] 0.4212222 0.4788336 0.5610722 0.5255990 0.6431844 0.5390846
## [5,] 0.4264789 0.4815354 0.5638835 0.5276566 0.6436200 0.5420249
## [6,] 0.4388600 0.4879166 0.5705378 0.5326156 0.6447170 0.5489923

head(methylation_hooks$cov)

##      E003 E004 E005 E006 E007 E011 E012 E013 E016 E024 E050 E065 E066 E071
## [1,]   11   22   18    7   29   65   72   52   52   10   41   56   17   75
## [2,]   37   43   35    5   38   68   25   27   34   15   18   81    4   46
## [3,]   22   23    4    0   14    8    8    0    0   24    7   99    0    0
## [4,]   28   32   31    0   25    8    0    0    0   23    0   77    0    0
## [5,]   10    8    7    0    5    0    0    0    0    0    0   42    0    0
## [6,]   11   25    4    4    4    4    0    0    0   10    0   77    0    4
##      E079 E094 E095 E096 E097 E098 E100 E104 E105 E106 E109 E112 E113
## [1,]   67   61  105   67   44   49   62   55   56  111   32   35   53
## [2,]   73   68  130   54   43   75   95   60   72  107   44   58   61
## [3,]  109  124  214   90  102  121  137  128  119  180   72   88  111
## [4,]   85   80  154   80   86  100  100   90   82  159   57   57   69
## [5,]   38   42   71   31   28   44   52   46   37   66   18   30   26
## [6,]   93   89  174   87  102  104  133  111   96  177   69   81   78

head(methylation_hooks$sample_id)

## [1] "E003" "E004" "E005" "E006" "E007" "E011"

The data will be reloaded when you switch to a different chromosome, and the values in the five variables will change.

methylation_hooks$set_chr("chr21")
methylation_hooks$set_chr("chr22")
methylation_hooks$gr

## GRanges object with 550973 ranges and 0 metadata columns:
##            seqnames               ranges strand
##               <Rle>            <IRanges>  <Rle>
##        [1]    chr22 [16050633, 16050633]      *
##        [2]    chr22 [16050678, 16050678]      *
##        [3]    chr22 [16050688, 16050688]      *
##        [4]    chr22 [16050703, 16050703]      *
##        [5]    chr22 [16051245, 16051245]      *
##        ...      ...                  ...    ...
##   [550969]    chr22 [51244169, 51244169]      *
##   [550970]    chr22 [51244175, 51244175]      *
##   [550971]    chr22 [51244181, 51244181]      *
##   [550972]    chr22 [51244185, 51244185]      *
##   [550973]    chr22 [51244205, 51244205]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We assume sometimes, only a subset of samples have ChIP-seq data and across different histone marks, the samples under sequenced are not exactly the same. In this case, the data structure for ChIP-seq datasets is loose and will be stored as a simple list.

There are following functions that users should define:

chipseq_hooks$sample_id: given a histome mark, it returnes a vector of sample ids. This hook is important that it will be used to match samples in other types of datasets.
chipseq_hooks$peak: given a histome mark and a sample id (maybe additional arguments), it returns regions of peaks. There must be a meta column containing density of the peaks.

mark = "H3K4me1"
chipseq_hooks$sample_id = function(mark) {
    sample_id = dir(qq("@{PROJECT_DIR}/data/narrow_peaks"), pattern = qq("E\\d+-@{mark}.narrowPeak.gz"))
    sample_id = gsub(qq("-@{mark}.narrowPeak.gz"), "", sample_id)
    sample_id[1:4] # just for fast producing the document
}

chipseq_hooks$peak = function(mark, sid, ...) {
    df = read.table(qq("@{PROJECT_DIR}/data/narrow_peaks/@{sid}-@{mark}.narrowPeak.gz"), stringsAsFactors = FALSE)
    GRanges(seqnames = df[[1]], ranges = IRanges(df[[2]] + 1, df[[3]]), density = df[[5]])
}

sample_id = chipseq_hooks$sample_id(mark)
sample_id

## [1] "E001" "E002" "E003" "E004"

chipseq_hooks$peak(mark, sample_id[1])

## GRanges object with 308752 ranges and 1 metadata column:
##            seqnames                 ranges strand |   density
##               <Rle>              <IRanges>  <Rle> | <integer>
##        [1]    chr20 [ 30298909,  30300550]      * |       423
##        [2]    chr17 [ 75284496,  75284928]      * |       410
##        [3]    chr12 [114024796, 114025578]      * |       400
##        [4]     chr2 [172173141, 172175014]      * |       400
##        [5]     chr6 [158250559, 158252604]      * |       391
##        ...      ...                    ...    ... .       ...
##   [308748]    chr17 [ 76929272,  76929591]      * |        20
##   [308749]     chr8 [ 76007005,  76007198]      * |        20
##   [308750]     chr4 [128765826, 128766052]      * |        20
##   [308751]    chr15 [ 73656516,  73656709]      * |        20
##   [308752]    chr20 [ 55452364,  55452624]      * |        20
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

After these two hooks are defined, there is a get_peak_list() function which gives peaks in all supported samples for a given histome mark.

peak_list = get_peak_list(mark)
names(peak_list)

## [1] "E001" "E002" "E003" "E004"

length(peak_list)

## [1] 4

In chipseq_hooks$peak, ... is useful, e.g. you can only read peaks in a single chromosome:

chipseq_hooks$peak = function(mark, sid, chr) {
    df = read.table(pipe(qq("zcat @{PROJECT_DIR}/data/narrow_peaks/@{sid}-@{mark}.narrowPeak.gz | grep @{chr}")), 
        stringsAsFactors = FALSE)
    GRanges(seqnames = df[[1]], ranges = IRanges(df[[2]] + 1, df[[3]]), density = df[[5]])
}
chipseq_hooks$peak(mark, sample_id[1], "chr21")

## GRanges object with 3194 ranges and 1 metadata column:
##          seqnames               ranges strand |   density
##             <Rle>            <IRanges>  <Rle> | <integer>
##      [1]    chr21 [44752216, 44752778]      * |       295
##      [2]    chr21 [27520010, 27523105]      * |       262
##      [3]    chr21 [37581770, 37582452]      * |       258
##      [4]    chr21 [35320733, 35321807]      * |       255
##      [5]    chr21 [39221506, 39222338]      * |       247
##      ...      ...                  ...    ... .       ...
##   [3190]    chr21 [18980024, 18980259]      * |        20
##   [3191]    chr21 [30687622, 30687887]      * |        20
##   [3192]    chr21 [31387319, 31387564]      * |        20
##   [3193]    chr21 [34077608, 34077850]      * |        20
##   [3194]    chr21 [39355897, 39356122]      * |        20
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then these additional arguments can be passed in get_peak_list():

peak_list = get_peak_list(mark, chr = "chr21")
peak_list[[1]]

## GRanges object with 3194 ranges and 1 metadata column:
##          seqnames               ranges strand |   density
##             <Rle>            <IRanges>  <Rle> | <integer>
##      [1]    chr21 [44752216, 44752778]      * |       295
##      [2]    chr21 [27520010, 27523105]      * |       262
##      [3]    chr21 [37581770, 37582452]      * |       258
##      [4]    chr21 [35320733, 35321807]      * |       255
##      [5]    chr21 [39221506, 39222338]      * |       247
##      ...      ...                  ...    ... .       ...
##   [3190]    chr21 [18980024, 18980259]      * |        20
##   [3191]    chr21 [30687622, 30687887]      * |        20
##   [3192]    chr21 [31387319, 31387564]      * |        20
##   [3193]    chr21 [34077608, 34077850]      * |        20
##   [3194]    chr21 [39355897, 39356122]      * |        20
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

There is also another hook for reading chromHMM results if it is available. The returned GRanges object must have a state column which contains prediced chromatin states.

And also you can set addition arguments by ....

chipseq_hooks$chromHMM = function(sid, ...) {
    f = qq("@{PROJECT_DIR}/data/chromatin_states/@{sid}_15_coreMarks_mnemonics.bed.gz")
    gr = read.table(f, sep = "\t", stringsAsFactors = FALSE)
    GRanges(seqnames = gr[[1]], ranges = IRanges(gr[[2]] + 1, gr[[3]]), states = gr[[4]])
}
chipseq_hooks$chromHMM(sample_id[1])

## GRanges object with 510150 ranges and 1 metadata column:
##            seqnames               ranges strand |      states
##               <Rle>            <IRanges>  <Rle> | <character>
##        [1]    chr10     [     1, 119600]      * |    15_Quies
##        [2]    chr10     [119601, 120400]      * |      1_TssA
##        [3]    chr10     [120401, 136200]      * | 14_ReprPCWk
##        [4]    chr10     [136201, 139400]      * |    15_Quies
##        [5]    chr10     [139401, 145200]      * |       9_Het
##        ...      ...                  ...    ... .         ...
##   [510146]     chrY [59003801, 59005800]      * |    15_Quies
##   [510147]     chrY [59005801, 59006000]      * |       9_Het
##   [510148]     chrY [59006001, 59011800]      * |    15_Quies
##   [510149]     chrY [59011801, 59026000]      * |       9_Het
##   [510150]     chrY [59026001, 59373400]      * |    15_Quies
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Or get the data for all samples:

chromHMM_list = get_chromHMM_list(sample_id)
length(chromHMM_list)

## [1] 4

The expression datasets are always represented as matrix, so the data importing is straightforward.

Normally, we can use GenomicFeatures::makeTranscriptDbFromGFF() to import the GTF file into R. Here import_gencode_as_txdb() is a modified version which additionally allows to do pre-filtering on the GTF file and also retrieve some mapping information which cannot be provided by the original function.

GTF_FILE = qq("@{PROJECT_DIR}/data/gen10.long.chr21.gtf")
txdb = import_gencode_as_txdb(GTF_FILE)

The TxDb database which only contains protein coding genes:

txdb = import_gencode_as_txdb(GTF_FILE, gene_type == "protein_coding")

In Gencode annotation file, there are some additional useful information such as gene symbols and gene types, this information can be retrieved later by extract_field_from_gencode(). This function uses an external Perl script.

mapping = extract_field_from_gencode(GTF_FILE, level = "gene", primary_key = "gene_id", field = "gene_name")
head(mapping)

## ENSG00000223662.1 ENSG00000235277.1 ENSG00000229047.1 ENSG00000231201.1 
##      "SAMSN1-AS1"      "AF127577.8"     "AF127577.10"     "AF127577.11" 
## ENSG00000226771.1 ENSG00000233783.1 
##      "AP001136.2"      "AP001442.2"

In Roadmap project, the transcriptome annotation is Gencode v10 which is quite out-of-date. In this analysis, we removed genes which have different annotation to Gencode v19 which is the newest annotation for human genome hg19. Genes are kept only if the positions are exactly the same in the two annotations. The matching can be done by match_gencode() function. In following example, we also filter transcripts by `transcript_type == "protein_coding".

g19 = qq("@{PROJECT_DIR}/data/gencode.v19.annotation.chr21.gtf")
match_by_gencode(GTF_FILE, g19, transcript_type == "protein_coding")

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Genome: NA
## # transcript_nrow: 666
## # exon_nrow: 2492
## # cds_nrow: 1975
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2017-01-26 15:11:28 +0100 (Thu, 26 Jan 2017)
## # GenomicFeatures version at creation time: 1.24.5
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

jokergoo/epik documentation built on Sept. 28, 2019, 9:20 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jokergoo/epik
Integrative Analysis and Visualization of Epigenomic Sequencing Data

roadmap/docs/01.import_data.md
In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data

Import datasets

Methylation datasets

ChIP-seq datasets

Expression datasets

Processing Gencode annotations

R Package Documentation

Browse R Packages

We want your feedback!

jokergoo/epik Integrative Analysis and Visualization of Epigenomic Sequencing Data

roadmap/docs/01.import_data.md In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data

Import datasets

Methylation datasets

ChIP-seq datasets

Expression datasets

Processing Gencode annotations

R Package Documentation

Browse R Packages

We want your feedback!

jokergoo/epik
Integrative Analysis and Visualization of Epigenomic Sequencing Data

roadmap/docs/01.import_data.md
In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data