07-generateMixtures: Generating sets of artificially mixed and altered...
In CloneData: Data to Support CloneSeeker Algorithm

Generating data from artificial mixtures

R Documentation

Generating sets of artificially mixed and altered heterogeneous data

Description

Generating and saving a 'simulated' tumor data set by artificially mixing and altering real SNP array data that can be used in clonal heterogeneity analysis to assess accuracy of algorithms.

Usage

generateMixtures(dataPath, mixPath, nPerK, segmentedData, ID_pool, pos)

Arguments

`dataPath`	path to which simulated tumors will be saved.
`mixPath`	path to which artificially mixed and altered SNP array data will be saved.
`nPerK`	a vector of integers denoting the number of tumors to generate for each possible number of clones, where the nth entry dictates how many n-clone tumors will be generated.
`segmentedData`	segmented SNP array data from which mixtures will be generated; must contain following columns: 'loc.start' (segment start locus), 'loc.end' (segment end locus), 'seg.median' (median Log R ratio), 'SamID' (sample ID), 'chrom' (chromosome number), 'AvgBAF' (average B allele fraction for segment), 'num.mark' (number of markers per segment).
`ID_pool`	a list of sample IDs from segmentedData from which samples will be drawn to generate artificial mixtures.
`pos`	a data frame with two columns, `Chr` and `Position`, defining the chromsoomal locations of the simulated SNPs.

Details

A set of artificial mixtures (with CNVs artificially added) can be generated from real SNP array data. The number of artificial mixtures to generate - and how many mixtures for each possible number of clones to generate - can be set with the input parameters.

Value

The generateMixtures function generates and saves two lists for each mixture: a 'tumor' (consisingt of artificially altered real data making up the 'clones' of the mixture, saved in the path 'simpath'), with objects: psi, a vector of clonal fractions, clones, which is a list of tumor clones, each of which in turn consists of a data frame cn and a data frame seq, a list altered (a list of segments artificially altered), and a list change (the copy number change introduced to the altered segments); and a simulated data object (saved in the path 'datapath'), with objects: cn.data and se .data. Each component is itself a data frame. Note that in some cases, one of these data frames may have zero rows or may be returned as an NA.

Each list in the cn component contains seven columns:

chr: the chromosome number;
start: the starting locus of each genomic segment;
end: the ending locus of each genomic segment;
A: the first allelic copy number at each genomic segment;
B: the second allelic copy number at each genomic segment;
seg: the segment number; and
parent.index: the index of the clone from which this clone is descended (equals 0 if the clone is an original tumor clone).

Each list in the seq component contains seven columns:

chr: the chromosome number;
start: the locus of the simulated SNVs;
seg: the segment on which each SNV occurs;
mut.id: the id unique id number for each simulated SNV;
mutated.copies: the number of copies of the mutated allele at each SNV;
alllele: which allele (A or B) is mutated at each SNV; and
normal.copies: the number of copies of the unmutated allele at each SNV.

The cn.data component contains seven columns:

chr: the chromosome number;
seq: a unique segment identifier;
LRR: simulated segment-wise log ratios;
BAF: simulated segment-wise B allele frequencies;
X and Y: simulated intensities for two separate alleles/haplotypes per segment; and
markers: the simulated number of SNPS per segment.

The seq.data component contains eight columns:

chr: the chromosome number;
seq: a unique "segment" identifier;
mut.id: a unique mutation identifier;
refCounts and varCounts: the simulated numbers of reference and variant counts per mutation;
VAF: the simulated variant allele frequency;
totalCounts: the simulated total number of read counts; and
status: a character (that should probably be a factor) indicating whether a variant should be viewed as somatic or germline.

Author(s)

Kevin R. Coombes krc@silicovore.com, Mark Zucker zucker.64@buckeyemail.osu.edu

References

Zucker MR, Abruzzo LV, Herling CD, Barron LL, Keating MJ, Abrams ZB, Heerema N, Coombes KR. Inferring Clonal Heterogeneity in Cancer using SNP Arrays and Whole Genome Sequencing. Bioinformatics. To appear. doi: 10.1093/bioinformatics/btz057.

Examples

# Set of 300 simulated 'tumors' generated by artificially mixing and
# altering real data; 60 samples with one #clone, 60 with 2 clones,
# ..., 60 with 5 clones.
data("hapmapSegments", package = "CloneData")
data("snpPositions", package = "CloneData")
IDset <- c('NA07019', 'NA12234', 'NA12249', 'NA12753', 'NA12761',
           'NA18545', 'NA18975', 'NA18999', 'NA18517')
# Generating the data set:
## Not run: 
generateMixtures(dataPath = 'mixdat', mixPath = 'mixsim',
                 nPerK = rep(60,5),  segmentedData = hapmapSegments,
                 ID_pool = IDset, pos = snpPositions)

## End(Not run)

CloneData documentation built on July 1, 2022, 3 a.m.