simulateSV: Structural Variant Simulation

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

A tool for simulating deletions, insertions, inversions, tandem duplications and translocations in any genome available as FASTA-file or BSgenome data package. Structural variations (SVs) are placed within the given genome, or only a subset of it, in a random, non-overlapping manner or at given genomic coordinates. SV breakpoints can be positioned uniformly or with a bias towards repeat regions and regions of high homology.

Usage

1
simulateSV(output=".", genome, chrs, dels=0, ins=0, invs=0, dups=0, trans=0, size, sizeDels=10, sizeIns=10, sizeInvs=10, sizeDups=10, regionsDels, regionsIns, regionsInvs, regionsDups, regionsTrans, maxDups=10, percCopiedIns=0, percBalancedTrans=1, bpFlankSize=20, percSNPs=0, indelProb=0, maxIndelSize=10, repeatBias=FALSE, weightsMechanisms, weightsRepeats, repeatMaskerFile, bpSeqSize=100, random=TRUE, seed, verbose=TRUE)

Arguments

output

Output directory for the rearranged genome and SV lists; turn this off by passing NA (default: current directory)

genome

The genome as DNAStringSet or as filename pointing to a FASTA-file containing the genome sequence

chrs

Restrict simulation to certain chromosomes only (default: all chromosomes available)

dels

Number of deletions

ins

Number of insertions

invs

Number of inversions

dups

Number of tandem duplications

trans

Number of translocations

size

Size of SVs in bp (a single numeric value); a quick way to set a size, which is applied to all simulated SVs

sizeDels

Size of deletions: Either a single number for all deletions or a vector with a length for every single deletion

sizeIns

Size of insertions: Either a single number for all insertions or a vector with a length for every single insertion

sizeInvs

Size of inversions: Either a single number for all inversions or a vector with a length for every single inversion

sizeDups

Size of tandem duplications: Either a single number for all tandem duplications or a vector with a length for every single tandem duplication

regionsDels

GRanges object with regions within the genome where to place the deletions

regionsIns

GRanges object with regions within the genome where to place the Insertions

regionsInvs

GRanges object with regions within the genome where to place the inversions

regionsDups

GRanges object with regions within the genome where to place the tandem duplications

regionsTrans

GRanges object with regions within the genome where to place the translocations

maxDups

Maximum number of repeats for tandem duplications

percCopiedIns

Percentage of copy-and-paste-like insertions (default: 0, i.e. no inserted sequences are duplicated)

percBalancedTrans

Percentage of balanced translocations (default: 1, i.e. all translocations are balanced)

bpFlankSize

Size of the each breakpoint's flanking regions, which may contain additional SNPs and/or indels

percSNPs

Percentage of SNPs within a breakpoint's flanking region

indelProb

Probability for an indel within a breakpoint's flanking region

maxIndelSize

Maximum size of an indel

repeatBias

If TRUE, the breakpoint positioning is biased towards repeat regions instead of a uniform distribution; turned off by default (see details below)

weightsMechanisms

Weights for SV formation mechanisms (see details and examples below)

weightsRepeats

Weights for repeat regions (see details and examples below)

repeatMaskerFile

Filename of a RepeatMasker output file

bpSeqSize

Length of the breakpoint sequences in the output

random

If TRUE, the SVs will be placed randomly within the genome or the given regions; otherwise, the given regions will be used as SV coordinates (random can also be a vector of five elements with TRUE/FALSE for every SV in the following order: deletions, insertions, inversions, duplications, translocations)

seed

Fixed seed for generation of random SV positions

verbose

If TRUE, some messages about the progress of the simulation will be printed into the R console

Details

About the supported SV types:

About SV sizes and predefined regions:

About biases towards SV formation mechanisms and repeat regions:

About additional breakpoint mutations:

Misc:

Value

The rerranged genome as a DNAStringSet. Its metadata slot contains a named list of data.frames with information about the simulated SVs:

deletions

The coordinates of the implemented deletions and the breakpoint sequence

insertions

The coordinates of the origin (chrA) and destination (chrB) of the inserted sequence and the breakpoints sequences at both ends (5' and 3'). If the sequence is cut out from the original chromosome A, the sequence of this breakpoint is given as well.

inversions

The coordinates of the implemented inversions and the breakpoint sequences at both ends (5' and 3')

tandemDuplications

The coordinates of the duplicated sequence and the breakpoint sequence

translocations

The coordinates of the translocated sequences and the two breakpoint sequences (if balanced)

The coordinates in the tables refer to the "normal" reference genome.
All the list items can also be written to the specified output directory (which is the current directory by default). The genome will be saved in FASTA format and the SVs data.frames as CSV tables.

Author(s)

Christoph Bartenhagen

References

Chen W. et al., Mapping translocation breakpoints by next-generation sequencing, 2008, Genome Res, 18(7), 1143-1149. Lam H.Y. et al., Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library, 2010, Nat Biotechnol, 28(1), 47-55. Mills R.E. et al., Mapping copy number variation by population-scale genome sequencing, 2011, Nature, 470(7332), 59-65 Ou Z. et al., Observation and prediction of recurrent human translocations mediated by NAHR between nonhomologous chromosomes, 2011, Genome Res, 21(1), 33-46. Pang A.W. et al., Mechanisms of Formation of Structural Variation in a Fully Sequenced Human Genome, 2013, Hum Mutat, 34(2), 345-354. Smit et al., RepeatMasker Open-3.0., 1996-2010, <http://www.repeatmasker.org>

See Also

estimateSVSizes

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
## Toy example: Artificial genome with two chromosomes
genome = DNAStringSet(c("AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT", "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC"))
names(genome) = c("chr1","chr2")

## Three deletions of sizes 10bp each
sim = simulateSV(output=NA, genome=genome, dels=3, sizeDels=10, bpSeqSize=10)
sim
metadata(sim)

## Three insertions of 5bp each; all cut-and-paste-like (default)
sim = simulateSV(output=NA, genome=genome, ins=3, sizeIns=5, bpSeqSize=10)
sim
metadata(sim)
## Three insertions of 5bp each; all copy-and-paste-like (note the parameter \code{percCopiedIns})
sim = simulateSV(output=NA, genome=genome, ins=3, sizeIns=5, percCopiedIns=1, bpSeqSize=10)
sim
metadata(sim)

## Three inversions of sizes 2bp, 4bp and 6bp
sim = simulateSV(output=NA, genome=genome, invs=3, sizeInvs=c(2,4,6), bpSeqSize=10)
sim
metadata(sim)

## A tandem duplication of 4bp with at most ten duplications
## The duplication shall be placed somewhere within chr4:18-40
library(GenomicRanges)
region = GRanges(IRanges(10,30),seqnames="chr1")
sim = simulateSV(output=NA, genome=genome, dups=1, sizeDups=4, regionsDups=region, maxDups=10, bpSeqSize=10)
sim
metadata(sim)

## A balanced translocation (default)
sim = simulateSV(output=NA, genome=genome,trans=1, bpSeqSize=6, seed=246)
sim
metadata(sim)
## Another translocation, but unbalanced (note the parameter \code{percBalancedTrans})
sim = simulateSV(output=NA, genome=genome, trans=1, percBalancedTrans=0, bpSeqSize=6)
sim
metadata(sim)

## Simulate all four SV types at once:
## 2 deletions (5bp), 2 insertions (5bp),2 inversions (3bp), 1 tandem duplication (4bp), 1 translocations
sim = simulateSV(output=NA, genome=genome, dels=2, ins=2, invs=2, dups=1, trans=1, sizeDels=5, sizeIns=5, sizeInvs=3, sizeDups=4, maxDups=3, percCopiedIns=0.5, bpSeqSize=10)
sim
metadata(sim)

## Avoid random generation of coordinates and implement a given deletion of 10bp on chr2:16-25
knownDeletion = GRanges(IRanges(16,25), seqnames="chr2")
names(knownDeletion) = "myDeletion"
knownDeletion
sim = simulateSV(output=NA, genome=genome, regionsDels=knownDeletion, bpSeqSize=10, random=FALSE)
sim
metadata(sim)

## Avoid random generation of coordinates and implement a given insertion from chr1:16:25 at chr2:26
knownInsertion = GRanges(IRanges(16,25), seqnames="chr1", chrB="chr2", startB=26)
names(knownInsertion) = "myInsertion"
knownInsertion
sim = simulateSV(output=NA, genome=genome, regionsIns=knownInsertion, bpSeqSize=10, random=FALSE)
sim
metadata(sim)

## This example simulates a translocation t(9;22) leading to the BCR-ABL fusion gene.
## It uses simple breakpoints within 9q34.1 and 22q11.2 for demonstration
## Take care to add coordinates of both chromosomes to the GRanges object:
trans_BCR_ABL = GRanges(IRanges(133000000,141213431), seqnames="chr9", chrB="chr22", startB=23000000, endB=51304566, reciprocal=TRUE)
names(trans_BCR_ABL) = "BCR_ABL"
trans_BCR_ABL
## This example requires the \pkg{BSgenome.Hsapiens.UCSC.hg19} which is used by default (hence, no genome argument)
## Not run: sim = simulateSV(output=NA, chrs=c("chr9", "chr22"), regionsTrans=trans_BCR_ABL, bpSeqSize=30, random=FALSE)


## Add additional SNPs and indels at the flanking regions of each SV breakpoint:
## One deletion and 10% SNPs, 100% indel probability within 10bp up-/downstream of the breakpoint
sim = simulateSV(output=NA, genome=genome, dels=1, sizeDels=5, bpFlankSize=10, percSNPs=0.25, indelProb=1, maxIndelSize=3, bpSeqSize=10);
sim
metadata(sim)

## Setting the weights for SV formation mechanism and repeat biases demands a given data.frame structure
## The following weights are the default settings
## Please make sure your data.frames have the same row and column names, when setting your own weights
data(weightsMechanisms, package="RSVSim")
weightsMechanisms
data(weightsRepeats, package="RSVSim")
weightsRepeats
## The weights take effect, when no genome argument has been specified (i.e. the default genome hg19 will be used) and the argument repeatBias has been set to TRUE
## Not run: sim = simulateSV(output=NA, dels=10, invs=10, ins=10, dups=10, trans=10, repeatBias = TRUE, weightsMechanisms=weightsMechanisms, weightsRepeats=weightsRepeats)
## If weightsMechanisms and weightsRepeats were omitted, RSVSim loads the default weights automatically (see details section above for more info)
## Not run: sim = simulateSV(output=NA, dels=10, invs=10, ins=10, dups=10, trans=10, repeatBias = TRUE)

RSVSim documentation built on Nov. 8, 2020, 5:35 p.m.