%\VignetteIndexEntry{GSE84920.parser: Working with the GSE84920 Hi-C Data Sets} %\VignetteAuthor{Henrik Bengtsson} %\VignetteEngine{GSE84920.parser::self} %\VignetteEncoding{UTF-8} %\VignetteKeyword{R} %\VignetteKeyword{package} %\VignetteKeyword{vignette}
This R package provides a shell script and an R API for working with the Ramani et al. (2017) Hi-C data set.
The Ramani data set is published on NCBI's Gene Expression Omnibus (GEO) in the GEO series GSE84920 (titled 'Massively multiplex single-cell Hi-C'), which contains:
| GEO Sample | GEO Title | Cell Types | ------------------- | --------------------------------- | ------------ | GSM2254215 | Combinatorial scHi-C Library ML1 | human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski') | GSM2254216 | Combinatorial scHi-C Library ML2 | human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski') | GSM2254217 | Combinatorial scHi-C Library ML3 | human ('GM12878', 'K562'), mouse ('MEF', 'Patski') | GSM2254218 | Combinatorial scHi-C Library PL1 | human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski') | GSM2254219 | Combinatorial scHi-C Library PL2 | human ('HAP1', 'HeLa'), mouse ('MEF', 'Patski') | GSM2438426** | Combinatorial scHi-C Library ML4 | human ('Asynchronous', 'Nocadazole'), mouse ('Patski')
The subsetted example GSM2254215_ML1 files in the system.file("extdata", package = "GSE84920.parser")
folder were obtained by first downloading relevant GEO files and truncating using the following Bash script:
url_path="https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM2254215&format=file"
sample=GSM2254215_ML1
types=(percentages validPairs assignments)
tmpfile=$(mktemp)
mkdir -p inst/extdata/
for type in "${types[@]}"; do
src=$sample.$type.txt.gz
dest=inst/extdata/$sample.rows=1-1000.$type.txt
curl "$url_path&file=$src" -o "$tmpfile"
zcat "$tmpfile" | head -1000 > "$dest"
gzip "$dest"
done
rm -- "$tmpfile"
This R package provides functions for reading the above GEO data set, and other data with the same file formats, into R. It also includes a small subset of the ML1 data for the purpose of illustrating and testing the different functions:
> library(GSE84920.parser)
> path <- system.file("extdata", package = "GSE84920.parser")
> dir(path)
[1] "combo_hg19_mm10.genomesize"
[2] "GSM2254215_ML1.rows=1-1000_assignments.txt.gz"
[3] "GSM2254215_ML1.rows=1-1000.percentages.txt.gz"
[4] "GSM2254215_ML1.rows=1-1000.validPairs.txt.gz"
Lets look at the content of each of these *.txt.gz
files:
> file <- file.path(path, "GSM2254215_ML1.rows=1-1000_assignments.txt.gz")
> a <- read_assignments(file)
> a
# A tibble: 1,000 x 4
readname left_inner_barcode right_inner_barc… outer_barcode
<chr> <chr> <chr> <chr>
1 D00584:136:HMTLJBCXX:1:11… CTCTCACG CTCTCACG TCAGATGC
2 D00584:136:HMTLJBCXX:1:11… GCACCATG GCACCATG GTGTAGCA
3 D00584:136:HMTLJBCXX:1:11… AGGTGCGA AGGTGCGA GTATCTAT
4 D00584:136:HMTLJBCXX:1:11… GCCTTAGG GCCTTAGG CAGCATAT
5 D00584:136:HMTLJBCXX:1:11… CACCTGTG CACCTGTG TACTAAGC
6 D00584:136:HMTLJBCXX:1:11… CCGCTACG CCGCTACG CAGCATAT
7 D00584:136:HMTLJBCXX:1:11… GCCTCGAA GCCTCGAA GTATCTAT
8 D00584:136:HMTLJBCXX:1:11… CTGGTCAC CTGGTCAC TTGACCAT
9 D00584:136:HMTLJBCXX:1:11… CTGCGTAG CTGCGTAG TATCTTGT
10 D00584:136:HMTLJBCXX:1:11… CACGACCT CACGACCT GATGATCC
# … with 990 more rows
> p <- read_percentages(file)
> p
# A tibble: 1,000 x 16
hg19_frac mm10_frac hg19_count mm10_count pair_count inner_barcode outer_barcode is_observed Col10 dpnii_1x
<dbl> <dbl> <int> <int> <int> <chr> <chr> <chr> <chr> <int>
1 1.000 0.0000356 28050 1 28052 ACCACCAC TCAGATGC True All 55603
2 0.680 0.320 19081 8970 28052 ACCACCAC TCAGATGC Randomized All 55603
3 1.000 0.0000370 27010 1 28052 ACCACCAC TCAGATGC True Long 55603
4 0.683 0.317 18444 8555 28052 ACCACCAC TCAGATGC Randomized Long 55603
5 0 1 0 1 1 CATAGCGC ACTTGATA True All 2
6 1 0 1 0 1 CATAGCGC ACTTGATA Randomized All 2
7 0 1 0 1 1 CATAGCGC ACTTGATA True Long 2
8 1 0 1 0 1 CATAGCGC ACTTGATA Randomized Long 2
9 1 0 2 0 2 GGCCGTTC GCCATTAA True All 4
10 1 0 2 0 2 GGCCGTTC GCCATTAA Randomized All 4
# … with 990 more rows, and 6 more variables: dpnii_2x <int>, dpnii_3x <int>, dpnii_4x <int>,
# cistrans_ratio <dbl>, hela_allele_frac <dbl>, celltype <chr>
> file <- file.path(path, "GSM2254215_ML1.rows=1-1000.validPairs.txt.gz")
> vp <- read_validpairs(file)
> vp
# A tibble: 1,000 x 17
chr_a start_a end_a chr_b start_b end_b readname col8 col9 col10 col11 inner_barcode outer_barcode
<chr> <int> <int> <chr> <int> <int> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 mous… 2.28e7 2.28e7 mous… 5.33e7 5.33e7 D00584:… 37 42 + - ATCCGCGG GTCATAGT
2 huma… 8.86e7 8.86e7 huma… 8.91e7 8.91e7 D00584:… 42 42 - - GAGGAGCA CGATGACA
3 huma… 1.27e8 1.27e8 huma… 1.27e8 1.27e8 D00584:… 42 42 + + GCTACGGT AGTCGTAT
4 huma… 4.21e7 4.21e7 huma… 4.27e7 4.27e7 D00584:… 42 42 + + AGGTGCGA ATACATGT
5 huma… 2.09e8 2.09e8 huma… 2.32e8 2.32e8 D00584:… 42 35 + - GCCTCGAA GAGTACGT
6 huma… 5.57e6 5.57e6 huma… 1.50e8 1.50e8 D00584:… 42 42 + - GCTCGCTA CTAGTGAA
7 mous… 4.12e7 4.12e7 mous… 4.12e7 4.12e7 D00584:… 42 42 + - GAGGAGCA CGTTACTT
8 mous… 6.54e7 6.54e7 mous… 6.59e7 6.59e7 D00584:… 42 42 - + TCCGGACA TGTCTGCA
9 huma… 1.82e8 1.82e8 huma… 1.82e8 1.82e8 D00584:… 42 42 + - CAGGCTTG GATATAAC
10 mous… 5.86e7 5.86e7 mous… 6.33e7 6.33e7 D00584:… 42 42 + + TCACGAGC TGAGGCAA
# … with 990 more rows, and 4 more variables: col14 <chr>, col15 <int>, col16 <chr>, col17 <int>
Reading the full Ramani HiC files can take quite a while, particularly the ones with "assignments" and "validPairs" data. Because of this, you might want to import these data (once) into a local database and work with the data from there. See help("import_assignments", package = "GSE84920.parser")
for an example showing how to import into an SQLite database on file, which is easy since it requires zero setup.
Footnote:
**: The ML4 files are of slightly different file format than the other sets. Specifically, the *.validPairs.txt.gz
file has an additional three columns appended at the end - GSE84920.parser::read_validpairs()
will not be able to read those data because of that.
The package also provides a Bash script for splitting a raw HiC-count data file into chromosome-pair files, e.g. HiC sequence data files in GSE35156 (Dixon, et al., 2012). This split.sh
script is located under system.file("scripts", package = "GSE84920.parser")
.
$ split.sh
Split HiC Count Data File into Chromosome-Pair Files
Usage:
./split.sh [options] path/to/*.summary.txt.gz
Options:
--help Display this help
--version Display the version
--dryrun Don't run anything
--intraonly Only process intra-chromosome pairs
Examples:
./split.sh --intraonly hicData/GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary.txt.gz
This outputs *.tsv.gz files:
1. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chr1_vs_chr1.tsv.gz
2. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chr2_vs_chr2.tsv.gz
...
25. GSE35156_GSM862720_J1_mESC_HindIII_ori_HiC.nodup.hic.summary_chrM_vs_chrM.tsv.gz
to folder hicData/GSE35156_GSM862720/ with names
Details:
The produced files are names '<name>,<chr_i>_vs_<chr_j>.tsv.gz' where <name>
is the name of input '*.summary.txt.gz' file without the extension. The
files are written to folder 'hicData/<gse_id>_<gsm_id>/' where <gse_id> and
<gsm_id> are inferred from the <name>. This folder is automatically created,
if missing.
Version: 0.1.2
Copyright: Henrik Bengtsson (2017-2019)
License: GPL (>= 3.0)
Source: https://github.com/HenrikBengtsson/ramani
Comment: This split.sh
script is unrelated to the Ramani et al. (2017) study. It might be better fitted for the TopDomData R package.
Ramani, V., Deng, X., Qiu, R., Gunderson, K. L., Steemers, F. J., Disteche, C. M., ..., Shendure, J. (2017). Massively multiplex single-cell Hi-C. Nature methods, 14(3), 263–266. doi:10.1038/nmeth.4155, PMC5330809.
Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012 Apr 11; 485(7398):376-80, doi: 10.1038/nature11082, PMCID: PMC3356448, PMID: 22495300.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.