SeqKat

Background

Kataegis is a localized hypermutation occurring when a region is enriched in somatic SNVs (Nik-Zainal S. et al 2012). Kataegis can result from multiple cytosine deaminations catalyzed by the AID/APOBEC family of proteins (Lada AG et al 2012). A first step to understand kataegis requires the ability to reproducibly and reliability identify it. Although a formal, quantifiable definition of kataegis has not been reached, we have provided the first operational definition in the form of SeqKat, a R package that predicts kataegis from paired tumour normal human whole genome samples. This package contains functions to detect kataegis from SNVs in BED format.

Approach

SeqKat uses a sliding window (of fixed width) approach to test deviation of observed SNV trinucleotide content and inter-mutational distance from expected by chance alone. Additionally, an exact binomial test is performed to test that the proportion of each of the 32 tri-nucleotides within each window is higher than expected. The resulting p-values are then adjusted for multiple hypothesis testing using FDR. Hypermutation and kataegic scores are calculated for each window as follows

hypermutation score = $-log_{10}$(binomial $p_{adj}$) * $\frac{N observed Mutations}{N expected Mutations}$ [Equation 1]

kataegis score = hypermutation score * $\frac{N TCX bases}{N expected TCX bases}$ [Equation 2]

SeqKat reports both hypermutation score and an APOBEC mediated kataegic score along with the start and end position of each detected event. A reference paper will be added upon publication in an upcoming version of this package.

Input

SeqKat accepts a SNV BED file per patient with the following columns:

Running SeqKat

seqkat(sigcutoff = 5,
       mutdistance = 3.2,
       segnum = 4,
       ref.dir = NULL,
       bed.file = "./",
       output.dir = "./",
       chromosome = "all",
       chromosome.length.file = NULL,
       trinucleotide.count.file = NULL
       )
chr4    17185   G   A
chr4    38640   T   C
chr4    52598   C   T
chr4    53102   C   G
chr4    71989   G   A
chr4    91099   C   G
chr4    91139   G   C
chr4    192852  G   C
chr4    201573  G   C
chr4    212498  C   G
"num" "length"
"1" 249250621
"2" 243199373
"3" 198022430
"4" 191154276
"5" 180915260
"6" 171115067
"7" 159138663
"8" 146364022
"9" 141213431
"10" 135534747
"11" 135006516
"12" 133851895
"13" 115169878
"14" 107349540
"15" 102531392
"16" 90354753
"17" 81195210
"18" 78077248
"19" 59128983
"20" 63025520
"21" 48129895
"22" 51304566
"23" 155270560
"24" 59373566
"sum.f" 3036303846
"sum.m" 3095677412
trinucleotide   count
ACA 118307548
ACC 67377361
ACG 15031779
ACT 94371148
ATA 120046083
ATC 78231773
ATG 107040211
ATT 145343907
CCA 107257513
CCC 75979793
CCG 16160851
CCT 103230644
CTA 75285140
CTC 98529730
CTG 118268845
CTT 117139678
GCA 84247974
GCC 68786498
GCG 13995012
GCT 81425256
GTA 65911575
GTC 54961008
GTG 88082699
GTT 86132552
TCA 114943318
TCC 90485129
TCG 13117247
TCT 130196437
TTA 120231763
TTC 117019750
TTG 111364601
TTT 225211336

note: sigcutoff, multidistance and segnum default parameters are optimized using Alexandrov et al's "Signatures of mutational processes in human cancer" dataset.

note: trinucleotide.count.file and chromosome.length.file have been provided for GRCh38 reference as well

Output

If Kataegic events are detected, SeqKat generates a tab delimited file that includes details about the detected events. Each line represents one detected hypermutation or kataegic event. The file includes the following columns:

note: if no event is detected then no file is generated

Example

A subset BED file from the publically available breast cancer sample PD4120a is provided in the package. This BED contains 2804 SNVs in the first 2,000,000 bases of chromosome 4. A subset FASTA file and chromosome length file have also been provided for testing purposes only.

example.bed.file <- paste0(
    path.package("SeqKat"),
    "/extdata/test/PD4120a-chr4-1-2000000_test_snvs.bed"
    );
example.ref.dir <- paste0(
    path.package("SeqKat"),
    "/extdata/test/ref/"
    );
example.chromosome.length.file <- paste0(
    path.package("SeqKat"),
    "/extdata/test/length_hg19_chr_test.txt"
    );
seqkat(
    5,
    3.2,
    2,
    bed.file = example.bed.file,
    output.dir = ".",
    chromosome = "4",
    ref.dir = example.ref.dir,
    chromosome.length.file = example.chromosome.length.file
    );

To view the detected events, you can check the file PD4120a-chr4-1-2000000_chr4_cutoff5_mutdist3.2_segnum2.txt

sample  chr start   end variants    score.hm    score.kat
PD4120a-chr4-1-2000000  4   1009070 1009541 4   119920.029973896    0

In this example, SeqKat detected one hypermutation window on chromosome 4 between 1009070 and 1009541, containing 4 SNVs with a hypermutation score of 119920.03. The kataegic score is 0, indicating that it is not an APOBEC mediated event.

References



Try the SeqKat package in your browser

Any scripts or data that you put into this service are public.

SeqKat documentation built on March 13, 2020, 1:59 a.m.