Shazam: Simulating sequence mutations

SHazaM provides two functions for simulating mutated sequences, one at the sequence level (shmulateSeq), and the other at the lineage level (shmulateTree).

Both functions rely on a 5-mer targeting model for computing the probabilities of mutations at each position along the input sequence. The 5-mer targeting models currently availbale in SHazaM are:

Simulate mutations in a single sequence

shmulateSeq generates random mutations in an input sequence. This sequence is provided by the user as a string, with the acceptable alphabet being {A, T, G, C, N, .}. Note that - is not accepted as part of the input sequence. If the input sequence has a non-triplet overhang at the end, it will be trimmed to the last codon. For example, ATGCATGC will be trimmed to ATGCAT before mutations are introduced.

The total number or frequency of mutations to be introduced is user-specified via numMutations with frequency set to FALSE (default) or TRUE respectively. For frequency=TRUE, the number of mutations to be introduced is calculated as the length of the sequence multiplied by the specified mutation frequency and rounded down to the nearest whole number (floor). Mutations are not introduced to positions in the input sequence that contain . or N.

Mutations are introduced iteratively using a targeting model. Targeting probabilities at each position are updated after each iteration.

# Import required packages
library(shazam)

# Input sequence
sequence <- "NGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATA.TTTA"

# Simulate introduction of 6 mutations using the default HH_S5F targeting model
shmulateSeq(sequence, numMutations=6)

# Simulate introduction of mutations at frequency 0.2 using the default HH_S5F targeting model
shmulateSeq(sequence, numMutations=0.2, frequency=TRUE)

# Simulate introduction of 4 mutations using the MK_RS5NF targeting model
shmulateSeq(sequence, numMutations=4, targetingModel=MK_RS5NF)

Simulate mutations in a lineage tree

shmulateTree generates a set of simulated sequences based on an input sequence and a lineage tree. The input sequence will act as the most recent common ancestor (MRCA) of the lineage tree, and sequences in the offspring nodes will be simulated with the numbers of mutations corresponding to the edge weights of the tree.

The lineage tree is supplied by the user as an igraph graph object, such as that returned by buildPhylipLineage of the alakazam package. For details, see the Reconstruction of Ig lineage trees vignette of alakazam. It is assumed that the name vertex attribute of the root node is Germline, as is the case with the trees built by buildPhylipLineage.

# Import required packages
library(alakazam)
library(igraph)
library(shazam)

# Load example lineage
data(ExampleTrees, package="alakazam")
graph <- ExampleTrees[[17]]

# Input sequence to be used as MRCA of the lineage tree
sequence <- "NGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGATAGTTTA"

# Simulate using the default HH_S5F targeting model
shmulateTree(sequence, graph)

It is possible to exclude certain specified nodes from being considered as the MRCA and from being included as part of the simulation. To specify such nodes, use the field argument to indicate which annotation field in vertex_attr(graph) contains information relevant to deciding which nodes to exclude, and the exclude argument to indicate the value in the annotation field that nodes to be excluded carry.

Note that when excluding some nodes, additional nodes that have not been explicitly specified by the user to be excluded may also get excluded. For example, suppose that node B is an offspring of node A; and node A has been specified by the user to be excluded. As a corollary of node A being excluded, its offspring node B will also become excluded, despite not being specified explicitly.

# The annotation field called "sample_id"
vertex_attr(graph)$sample_id

# notice that node "GN5SHBT01AKANC" is an offspring of "Inferred1"
par(mar=c(0, 0, 0, 0) + 0.1)
plot(graph, layout=layout_as_tree, edge.arrow.mode=0, vertex.label.cex=0.75)

# Exclude nodes without a sample identifier
# The nodes "Germline" and "Inferred1" are thus excluded
# As a corollary, "GN5SHBT01AKANC", the offspring of "Inferred1", is also excluded
# In this case, "GN5SHBT07JDYW5" is then taken to be the MRCA
shmulateTree(sequence, graph, field="sample_id", exclude=NA)

It is also possible to add a proportional number of mutations to the immediate offsprings of the MRCA based on the fraction of the nucleotide sequence that is within the junction region. This is achieved via the optional junctionWeight argument, to be supplied as a numeric value between 0 and 1.

As an example, suppose that the MRCA has two immediate offsprings, each containing 2 and 4 mutations respectively compared to the MRCA. With junctionWeight=0.2, the number of mutations to be introduced to these two offsprings will become round(2*(1+0.2)) (2) and round(4*(1+0.2)) (5) respectively.

# The "Inferred1" node is taken to be the MRCA and has 2 immediate offsprings
par(mar=c(0, 0, 0, 0) + 0.1)
plot(graph, layout=layout_as_tree, edge.arrow.mode=0, vertex.label.cex=0.75)

# Add 20% mutation rate to the immediate offsprings of the MRCA
shmulateTree(sequence, graph, junctionWeight=0.2)


Try the shazam package in your browser

Any scripts or data that you put into this service are public.

shazam documentation built on Oct. 3, 2023, 1:06 a.m.