06-generateSimulationSet: Generating sets of simulated tumors with SNP array and SNV...

Generating simulated tumor and data setsR Documentation

Generating sets of simulated tumors with SNP array and SNV data

Description

Generating and saving a set of simulated tumors and data that can be used in clonal heterogeneity analysis to assess accuracy of algorithms.

Usage

generateSimulationSet(simPath, dataPath, nPerK, rounds=400, nu=0,
                      pcnv=1, norm.contam=FALSE, dataPars=NULL)

Arguments

simPath

path to which simulated tumors will be saved.

dataPath

path to which simulated SNP array and/or SNV data will be saved.

nPerK

a vector of integers denoting the number of tumors to generate for each possible number of clones, where the nth entry dictates how many n-clone tumors will be generated.

rounds

integer; the number of branches or total 'historical' clones generated in the tumor simulation.

nu

an integer; the average number of mutations occuring per clonal branching event.

pcnv

a real number between 0 to 1; the probability of a CNV occurring at each clonal branching event.

norm.contam

a logical value; determines whether to include normal contamination in simulated tumor.

dataPars

a list of parameters for data generation; see Details.

Details

A set of simulation can be generated including both the simulated clonally heterogeneous tumors and the data generated therefrom. The size and general characteristics of the tumor set, as well as the types of data to be created from it (SNP array data and/or SNV data), are determined by the input parameter s. The script included generates three simulated data sets, each with 300 simulations, one with only copy number alterations (and only SNP array data), one with only single nucleotide variants (SNVs) and SNV data, and one with both.

Value

The generateSimulationSet function generates and saves two lists for each simulation:

  1. a simulated tumor (saved in the path simpath), with objects: psi, a vector of clonal fractions, and clones, which is a list of tumor clones, each of which in turn consists of a data frame cn and a data frame seq; and

  2. a simulated data object (saved in the path datapath), with objects: cn.data and se .data. Each component is itself a data frame. Note that in some cases, one of these data frames may have zero rows or may be returned as an NA.

Each list in the cn component contains seven columns:

chr

the chromosome number;

start

the starting locus of each genomic segment;

end

the ending locus of each genomic segment;

A

the first allelic copy number at each genomic segment;

B

the second allelic copy number at each genomic segment;

seg

the segment number; and

parent.index

the index of the clone from which this clone is descended (equals 0 if the clone is an original tumor clone).

Each list in the seq component contains seven columns:

chr

the chromosome number;

start

the locus of the simulated SNVs;

seg

the segment on which each SNV occurs;

mut.id

the id unique id number for each simulated SNV;

mutated.copies

the number of copies of the mutated allele at each SNV;

alllele

which allele (A or B) is mutated at each SNV; and

normal.copies

the number of copies of the unmutated allele at each SNV.

The cn.data component contains seven columns:

chr

the chromosome number;

seq

a unique segment identifier;

LRR

simulated segment-wise log ratios;

BAF

simulated segment-wise B allele frequencies;

X and Y

simulated intensities for two separate alleles/haplotypes per segment; and

markers

the simulated number of SNPS per segment.

The seq.data component contains eight columns:

chr

the chromosome number;

seq

a unique "segment" identifier;

mut.id

a unique mutation identifier;

refCounts and varCounts

the simulated numbers of reference and variant counts per mutation;

VAF

the simulated variant allele frequency;

totalCounts

the simulated total number of read counts; and

status

a character (that should probably be a factor) indicating whether a variant should be viewed as somatic or germline.

Author(s)

Kevin R. Coombes krc@silicovore.com, Mark Zucker zucker.64@buckeyemail.osu.edu

References

Zucker MR, Abruzzo LV, Herling CD, Barron LL, Keating MJ, Abrams ZB, Heerema N, Coombes KR. Inferring Clonal Heterogeneity in Cancer using SNP Arrays and Whole Genome Sequencing. Bioinformatics. To appear. doi: 10.1093/bioinformatics/btz057.

Examples

# Simulation set with just CNVs, 300 simulations in total, 60 with 1
#clone, 60 with 2 clones... 60 with 5 clones.
## Not run: 
generateSimulationSet(simPath = 'sims-cnv', dataPath = 'data-cnv',
    nPerK = rep(60,5), rounds = 400, nu = 0, pcnv = 1, norm.contam = FALSE)

## End(Not run)

CloneData documentation built on July 1, 2022, 3 a.m.