1. Per-sample
    1. Set up sample workflow
    2. Set up sample workflow reference data
    3. Set up sample list and sample data
    4. Run sample pipeline
    5. Publish sample pipeline data to web.
  2. Cohort
    1. Set up cohort workflow
    2. Set up cohort workflow reference data
    3. Set up cohort list and cohort data
    4. Run cohort pipeline
    5. Publish cohort pipeline data to web.

Questions?

  1. Does viral counting differ between original (bwa) and re-aligned (abra)?

  2. What features might be associated with false positives, and how do these differ between expected false positives, unknowns with similar count levels to expected false posities, and susptected true positives with high count levels?

    • total read count
    • filtered read count and filtered/unfiltered ratios.
      • clipping
      • split reads
      • missing pair
      • mismatches
      • low base quality
      • low map quality
    • coverage patterns

Library Prep

[TODO]

Alignment

[TODO]

Virus counts

Overview

Counting reads aligned to viruses using bedtools coverage summary. Count the tumor and normal separately to look for tumor-specific effects, but these samples are all from a tumor-only pipeline, so normals are not really applicable, except as a whole.

The virus read count file

Set up to load the virus count file

library(FusionClust)
virusCountName <- "all.tumor.virus.counts"
dataRoot <- "/srj/Projects/HpvIntegration"

virusCountFile <- file.path( dataRoot, virusCountName )
print( paste0( "Loading ", virusCountFile, ": ", file.info(virusCountFile)$size, " bytes"))

Loading the virus count file

virusCounts <- FusionClust::readVirusCounts(virusCountFile)
print( paste0( "Loaded virus counts for ", nrow(virusCounts),
               " samples X ", ncol(virusCounts), " viruses." ))
knitr::kable( t( head( virusCounts, 3 )))
print("Max counts, over all samples")
knitr::kable( as.data.frame(apply( virusCounts, 2, max )))

Integration sites

The abra structural variant file

Overview

Integration sites are generated by abra during reassembly of exome capture alignment data. Abra will detect viral/human fusion points if:

These are saved into the file "abra.sv.txt". Results from multiple samples can be concatenated if a leading "sample" column is added. Doing this is left as an exercise for the reader, but on Linux a good start is from a shared parent directory run

egrep "^" **/abra.sv.txt > all.sv.txt

Setting up the abra integration site fusions file to load:

library(FusionClust)
abraSvFileName <- "all.abra.sv.txt"

fusionFile <- file.path( dataRoot, abraSvFileName )
print( paste0( "Loading ", abraSvFileName, ": ", file.info(fusionFile)$size, " bytes"))

Loading the fusions:

fusions <- FusionClust::readAbraSv(fusionFile)
print( paste0( "Loaded ", nrow(fusions), " fusions from ",
               length(unique(fusions$sample)), " samples."))
knitr::kable(head(fusions))

Clustering fusions

Why custering

Due to symmetric sequence at the junction point, it is often difficult to determine exactly where a fusion occurs, i.e if one end is on chr1 as ...AGG~ and the other is on chr2 as ~GGT..., where is the fusion ~? The sequence found is ...AGGT..., so is it A~GGT, AG~GT, or AGG~T. No way to know which G's were kept from chr1 and which from chr2.

We will ignore the details of the above and report all fusions within some agglomerative window as the same fusion, named like:

<chr>:<low>-<high>:<strand>~<chr>:<low>-<high>:<strand>

The strandedness is often unknown, but the relative "strandedness" of a fusion is usually known, either as parallel or anti-parallel, so may be +, -, *, P, or A. If P or A, both must be P or A.

Window size

The window size depends on the length of the feature we care about locating fusions with respect to. Without additional input, starting with the size of an average gene, which is about 50k.

Before we choose a window, though, lets look at the distribution of gaps between successive fusion end points. This will give a feel for where good clusting gaps might be. To do this we'll just merge both ends to one list of chr/pos, then sort, then calculate the differences between each successive position in each chr separately, then plot these differences. This is not perfect as it can include a short deletion on the same chromosome as successive fusion points, but that might be good anyway.

gaps <- FusionClust::fusionGaps(fusions)
fusedChromosomes <- names(gaps)
print( paste( fusedChromosomes, sapply(gaps, length), sep=" - "))
chrFusionEndNames <- c( paste0("chr", 1:22), "chrX", "chrY", "chrMT" )
chrFusionEnds <- gaps[chrFusionEndNames]
length(chrFusionEnds)
chrBreaks <- unlist(chrFusionEnds)
names(chrBreaks) <- NULL
hist(chrBreaks, breaks = 200)
hist(chrBreaks[chrBreaks > 10 & chrBreaks < 1e6], breaks = 200)
hist(chrBreaks[chrBreaks > 100 & chrBreaks < 1e6], breaks = 200)
hist(chrBreaks[chrBreaks > 1000 & chrBreaks < 1e6], breaks = 200)
hist(chrBreaks[chrBreaks > 10000 & chrBreaks < 1e6], breaks = 200)
hist(chrBreaks[chrBreaks > 10000 & chrBreaks < 50000], breaks = 200)
hist(chrBreaks[chrBreaks > 100000 & chrBreaks < 1e6], breaks = 200)

Probably could use a window of 20K, but 50k is not far wrong.

Lets look at the window sizes in the viruses:

setdiff(fusedChromosomes, chrFusionEndNames)
nonChrFusionEnds <- gaps[setdiff(fusedChromosomes, chrFusionEndNames)]
chrBreaks <- unlist(nonChrFusionEnds)
names(chrBreaks) <- NULL
hist(chrBreaks, breaks = 20)
hist(chrBreaks[chrBreaks > 1 & chrBreaks < 50], breaks = 20)
hist(chrBreaks[chrBreaks > 2 & chrBreaks < 50], breaks = 20)
hist(chrBreaks[chrBreaks > 4 & chrBreaks < 50], breaks = 20)
hist(chrBreaks[chrBreaks > 10 & chrBreaks < 1e4], breaks = 20)
hist(chrBreaks[chrBreaks > 10 & chrBreaks < 100], breaks = 20)
hist(chrBreaks[chrBreaks > 100 & chrBreaks < 1e4], breaks = 20)
hist(chrBreaks[chrBreaks > 1000 & chrBreaks < 1e4], breaks = 20)

Looks like an integration size of 5 in the short chr

Sorting fusions

ends <- sortFusions( fusions )
knitr::kable(head(ends))

Clustering

Clustering end by window, creating numbered position-ranges that merge nearby fusion points, and reporting the original fusion list annotated with the clustering info.

windowSize <- 100000

clusteredEnds <- clusterFusionEnds(ends, window= windowSize)
str(clusteredEnds)
knitr::kable(head(clusteredEnds, 5))
knitr::kable(tail(clusteredEnds, 5))

Looking at groupings

sum(clusteredEnds$clusterLow != clusteredEnds$clusterHigh)
diffs <- clusteredEnds$clusterHigh - clusteredEnds$clusterLow
selectRow <- (diffs == max(diffs))
knitr::kable( head( clusteredEnds[selectRow, ] ))
selectRow <- (clusteredEnds$clusterLow != clusteredEnds$clusterHigh)
knitr::kable( head( clusteredEnds[selectRow, ] ))

Now that we have fusion end ranges, report fusions by cluster.

clusterAnnotatedFusions <- clusterFusions(fusions, window= windowSize)
knitr::kable(head(clusterAnnotatedFusions, 10))
clusterAnnotatedFusionsFile <- paste0("all.abra.sv.clustered.", windowSize, ".tsv")
clusterAnnotatedFusionsPath <- file.path( dataRoot, clusterAnnotatedFusionsFile )
write.table(clusterAnnotatedFusions, file = clusterAnnotatedFusionsPath, sep="\t")

Ok, now that we have fusion clusters identified (single and double ended) lets merge counts within each sample by fusion id

mergedClusterCounts <- mergeCountsByCluster( clusterAnnotatedFusions )
knitr::kable(head(mergedClusterCounts, 10))

Write out the fusion clusters file.

mergedClusterCountsFile <-
  paste0( "all.abra.sv.clustered.aggregated.", windowSize, ".tsv" )
mergedClusterCountsPath <- file.path( dataRoot, mergedClusterCountsFile )
write.table(mergedClusterCounts, file = mergedClusterCountsPath, sep="\t", row.names=FALSE )

Look at some cluster distributions

fusionDist <- table(mergedClusterCounts$fusionClusterId)
fusionDupDist <- table(fusionDist)
fusionDupDist
fusionDup <- fusionDist[fusionDist > 1]
fusionDup
for (clusterId in as.integer(unlist(dimnames(fusionDup)))) {
  print(clusterAnnotatedFusions[clusterAnnotatedFusions$fusionClusterId == clusterId,][1,])
}

Samples with probable false positives

Basing selection on data from michelles' spreadsheet: UNCseq_HPVcounts_20170106.xlsx, which is encrypted. My excerpt, clean of potential protected health information, is saved as UNCseq_HPVcounts_20170106_srj_extract.csv

samplesWithHpvReadsFilename <- "UNCseq_HPVcounts_20170106_srjPosExtractNoPHI.csv"
samplesWithHpvReadsFile <- file.path( dataRoot, samplesWithHpvReadsFilename )

print( paste0( "Loading ", samplesWithHpvReadsFile, ": ",
               file.info(samplesWithHpvReadsFile)$size, " bytes"))
samplesWithHpvReadsDF <- read.csv( samplesWithHpvReadsFile, header= TRUE,
                                 sep= "\t", stringsAsFactors= FALSE )
table(samplesWithHpvReadsDF$TumorOnlySequencing)
pattern <- "\\s+-\\s+.*$"
samplesWithHpvReadsDF$mainType <- gsub(pattern, "", samplesWithHpvReadsDF$CancerType)
str(samplesWithHpvReadsDF)
typeTable <- table(samplesWithHpvReadsDF$CancerType)
typeTable <- typeTable[order(typeTable)]
types <- unlist(dimnames(typeTable))
paste0( typeTable, " - ", types )
mainTypeTable <- table(samplesWithHpvReadsDF$mainType)
mainTypeTable <- mainTypeTable[order(mainTypeTable)]
mainTypes <- unlist(dimnames(mainTypeTable))
paste0( mainTypeTable, " - ", mainTypes )
selectHighBreast <- (samplesWithHpvReadsDF$mainType == 'Breast' & samplesWithHpvReadsDF$TUMOR > 9 )
sum(selectHighBreast)
selectHighBrain <- ( samplesWithHpvReadsDF$mainType == 'Brain/CNS' & samplesWithHpvReadsDF$TUMOR > 9 )
sum(selectHighBrain)
sampleNames <- samplesWithHpvReadsDF[ selectHighBrain | selectHighBreast, "StudyNumber" ]
length(sampleNames)
sampleNames <- unique(sampleNames)
length(sampleNames)
falsePosSampleNamesFile <- file.path(dataRoot, "falsePosSampleNames.txt")
write.table(sampleNames, falsePosSampleNamesFile, row.names = FALSE, col.names = FALSE )
file.info(falsePosSampleNamesFile)


jefferys/FusionClust documentation built on May 22, 2019, 2:39 p.m.