knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(misha)
Starting with misha 5.3.0, databases can be stored in two formats:
The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).
The indexed format uses unified files:
Sequence data:
- seq/genome.seq - All chromosome sequences concatenated
- seq/genome.idx - Index mapping chromosome names to positions
Track data:
- tracks/mytrack.track/track.dat - All chromosome data concatenated
- tracks/mytrack.track/track.idx - Index with offset/length per chromosome
Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion
The per-chromosome format uses separate files:
Sequence data:
- seq/chr1.seq, seq/chr2.seq, ... - One file per chromosome
Track data:
- tracks/mytrack.track/chr1.track, chr2.track, ... - One file per chromosome
When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible
By default, new databases use the indexed format:
# Create database from FASTA file gdb.create("mydb", "/path/to/genome.fa") # Or download pre-built genome gdb.create_genome("hg38", path = "/path/to/install")
To create a database in legacy format (for compatibility with older misha):
# Set option before creating database options(gmulticontig.indexed_format = FALSE) gdb.create("mydb", "/path/to/genome.fa")
Use gdb.info() to check your database format:
gsetroot("/path/to/mydb") info <- gdb.info() print(info$format) # "indexed" or "per-chromosome"
Example output:
info <- gdb.info() # $path # [1] "/path/to/mydb" # # $is_db # [1] TRUE # # $format # [1] "indexed" # # $num_chromosomes # [1] 24 # # $genome_size # [1] 3095693983
Convert all tracks and sequences to indexed format:
gsetroot("/path/to/mydb") gdb.convert_to_indexed()
This will:
1. Convert sequence files (chr*.seq → genome.seq + genome.idx)
2. Convert all tracks to indexed format
3. Validate conversions
4. Remove old files after successful conversion
Convert specific tracks while keeping others in legacy format:
gtrack.convert_to_indexed("mytrack")
Note that 2D tracks cannot be converted to indexed format yet.
Convert interval sets to indexed format:
# 1D intervals gintervals.convert_to_indexed("myintervals") # 2D intervals gintervals.2d.convert_to_indexed("my2dintervals")
High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues
Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)
Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions
Step 1: Backup (optional but recommended)
# Create backup of important database system("cp -r /path/to/mydb /path/to/mydb.backup")
Step 2: Check current format
gsetroot("/path/to/mydb") info <- gdb.info() print(paste("Current format:", info$format))
Step 3: Convert
gdb.convert_to_indexed()
Step 4: Verify
# Check format changed info <- gdb.info() print(paste("New format:", info$format)) # Test a few operations result <- gextract("mytrack", gintervals(1, 0, 1000)) print(head(result))
Step 5: Remove backup (after validation)
# After thorough testing system("rm -rf /path/to/mydb.backup")
You can freely copy tracks between databases with different formats.
# Export from source database gsetroot("/path/to/source_db") gextract("mytrack", gintervals.all(), iterator = "mytrack", file = "/tmp/mytrack.txt" ) # Import to target database (format auto-detected) gsetroot("/path/to/target_db") gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0) # Automatically converted to target database format!
# Copy multiple tracks tracks <- c("track1", "track2", "track3") for (track in tracks) { # Export gsetroot("/path/to/source_db") file_path <- sprintf("/tmp/%s.txt", track) gextract(track, gintervals.all(), iterator = track, file = file_path) # Import gsetroot("/path/to/target_db") info <- gtrack.info(track) # Get description gtrack.import(track, info$description, file_path, binsize = 0) unlink(file_path) }
Based on comprehensive benchmarks comparing indexed vs legacy formats:
# Work with both formats in same session gsetroot("/path/to/legacy_db") data1 <- gextract("track1", gintervals(1, 0, 1000)) gsetroot("/path/to/indexed_db") data2 <- gextract("track2", gintervals(1, 0, 1000))
This occurs with many-contig genomes in legacy format:
Solution: Convert to indexed format
gdb.convert_to_indexed()
After manually copying track directories:
Solution: Reload database
gdb.reload()
Conversion needs 2x track size temporarily:
Solution: Free disk space or convert tracks individually
# Convert one track at a time gtrack.convert_to_indexed("track1") gtrack.convert_to_indexed("track2") # etc.
gdb.create_genome() for standard genomesgdb.create() with multi-FASTA for custom genomesgdb.info()gdb.convert_to_indexed()Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.