gsynth.random: Generate random genome sequences
In misha: Toolkit for Analysis of Genomic Data

gsynth.random

R Documentation

Generate random genome sequences

Description

Generates random DNA sequences based on nucleotide probabilities without using a trained Markov model. Each nucleotide is sampled independently according to the specified probabilities.

Usage

gsynth.random(
  intervals = NULL,
  output_path = NULL,
  output_format = c("misha", "fasta", "vector"),
  nuc_probs = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25),
  mask_copy = NULL,
  seed = NULL,
  n_samples = 1,
  iterator = 1
)

Arguments

`intervals`	Genomic intervals to sample. If NULL, uses all chromosomes.
`output_path`	Path to the output file (ignored when output_format = "vector")
`output_format`	Output format: "misha": .seq binary format (default) "fasta": FASTA text format "vector": Return sequences as a character vector (does not write to file)
`nuc_probs`	Nucleotide probabilities. Can be specified as: A named vector: `c(A = 0.3, C = 0.2, G = 0.2, T = 0.3)` An unnamed vector in A, C, G, T order: `c(0.3, 0.2, 0.2, 0.3)` Probabilities are automatically normalized to sum to 1. Default is uniform (0.25 each).
`mask_copy`	Optional intervals to copy from the original genome instead of random sampling. Use this to preserve specific regions exactly as they appear in the reference.
`seed`	Random seed for reproducibility. If NULL, uses current random state.
`n_samples`	Number of samples to generate per interval. Default is 1.
`iterator`	Iterator for position resolution. Default is 1 (base-pair resolution). Larger values may speed up processing but are typically not needed for random sampling.

Details

Unlike gsynth.sample which uses a trained Markov model to generate sequences that preserve k-mer statistics, gsynth.random generates purely random sequences where each nucleotide is sampled independently. This is useful for generating baseline random sequences or sequences with specific GC content.

Nucleotide ordering: When using an unnamed vector for nuc_probs, the order is A, C, G, T. Named vectors can be in any order.

Value

When output_format is "misha" or "fasta", returns invisible NULL and writes the random sequences to output_path. When output_format is "vector", returns a character vector of sequences (length = n_intervals * n_samples).

Examples

gdb.init_examples()

# Generate random sequences with uniform nucleotide probabilities
seqs <- gsynth.random(
    intervals = gintervals(1, 0, 1000),
    output_format = "vector",
    seed = 42
)

# Generate GC-rich sequences (60% GC)
gc_rich <- gsynth.random(
    intervals = gintervals(1, 0, 1000),
    output_format = "vector",
    nuc_probs = c(A = 0.2, C = 0.3, G = 0.3, T = 0.2),
    seed = 42
)

# Generate AT-rich sequences
at_rich <- gsynth.random(
    intervals = gintervals(1, 0, 1000),
    output_format = "vector",
    nuc_probs = c(A = 0.35, C = 0.15, G = 0.15, T = 0.35),
    seed = 42
)

misha documentation built on Feb. 20, 2026, 5:08 p.m.