Does viral counting differ between original (bwa) and re-aligned (abra)?
What features might be associated with false positives, and how do these differ between expected false positives, unknowns with similar count levels to expected false posities, and susptected true positives with high count levels?
[TODO]
[TODO]
Counting reads aligned to viruses using bedtools coverage summary. Count the tumor and normal separately to look for tumor-specific effects, but these samples are all from a tumor-only pipeline, so normals are not really applicable, except as a whole.
Set up to load the virus count file
library(FusionClust) virusCountName <- "all.tumor.virus.counts" dataRoot <- "/srj/Projects/HpvIntegration" virusCountFile <- file.path( dataRoot, virusCountName ) print( paste0( "Loading ", virusCountFile, ": ", file.info(virusCountFile)$size, " bytes"))
Loading the virus count file
virusCounts <- FusionClust::readVirusCounts(virusCountFile) print( paste0( "Loaded virus counts for ", nrow(virusCounts), " samples X ", ncol(virusCounts), " viruses." )) knitr::kable( t( head( virusCounts, 3 ))) print("Max counts, over all samples") knitr::kable( as.data.frame(apply( virusCounts, 2, max )))
Integration sites are generated by abra during reassembly of exome capture alignment data. Abra will detect viral/human fusion points if:
These are saved into the file "abra.sv.txt". Results from multiple samples can be concatenated if a leading "sample" column is added. Doing this is left as an exercise for the reader, but on Linux a good start is from a shared parent directory run
egrep "^" **/abra.sv.txt > all.sv.txt
library(FusionClust) abraSvFileName <- "all.abra.sv.txt" fusionFile <- file.path( dataRoot, abraSvFileName ) print( paste0( "Loading ", abraSvFileName, ": ", file.info(fusionFile)$size, " bytes"))
fusions <- FusionClust::readAbraSv(fusionFile) print( paste0( "Loaded ", nrow(fusions), " fusions from ", length(unique(fusions$sample)), " samples.")) knitr::kable(head(fusions))
Due to symmetric sequence at the junction point, it is often difficult to determine exactly where a fusion occurs, i.e if one end is on chr1 as ...AGG~ and the other is on chr2 as ~GGT..., where is the fusion ~? The sequence found is ...AGGT..., so is it A~GGT, AG~GT, or AGG~T. No way to know which G's were kept from chr1 and which from chr2.
We will ignore the details of the above and report all fusions within some agglomerative window as the same fusion, named like:
<chr>:<low>-<high>:<strand>~<chr>:<low>-<high>:<strand>
The strandedness is often unknown, but the relative "strandedness" of a fusion is usually known, either as parallel or anti-parallel, so
The window size depends on the length of the feature we care about locating fusions with respect to. Without additional input, starting with the size of an average gene, which is about 50k.
Before we choose a window, though, lets look at the distribution of gaps between successive fusion end points. This will give a feel for where good clusting gaps might be. To do this we'll just merge both ends to one list of chr/pos, then sort, then calculate the differences between each successive position in each chr separately, then plot these differences. This is not perfect as it can include a short deletion on the same chromosome as successive fusion points, but that might be good anyway.
gaps <- FusionClust::fusionGaps(fusions) fusedChromosomes <- names(gaps) print( paste( fusedChromosomes, sapply(gaps, length), sep=" - "))
chrFusionEndNames <- c( paste0("chr", 1:22), "chrX", "chrY", "chrMT" ) chrFusionEnds <- gaps[chrFusionEndNames] length(chrFusionEnds) chrBreaks <- unlist(chrFusionEnds) names(chrBreaks) <- NULL hist(chrBreaks, breaks = 200) hist(chrBreaks[chrBreaks > 10 & chrBreaks < 1e6], breaks = 200) hist(chrBreaks[chrBreaks > 100 & chrBreaks < 1e6], breaks = 200) hist(chrBreaks[chrBreaks > 1000 & chrBreaks < 1e6], breaks = 200) hist(chrBreaks[chrBreaks > 10000 & chrBreaks < 1e6], breaks = 200) hist(chrBreaks[chrBreaks > 10000 & chrBreaks < 50000], breaks = 200) hist(chrBreaks[chrBreaks > 100000 & chrBreaks < 1e6], breaks = 200)
Probably could use a window of 20K, but 50k is not far wrong.
Lets look at the window sizes in the viruses:
setdiff(fusedChromosomes, chrFusionEndNames) nonChrFusionEnds <- gaps[setdiff(fusedChromosomes, chrFusionEndNames)] chrBreaks <- unlist(nonChrFusionEnds) names(chrBreaks) <- NULL hist(chrBreaks, breaks = 20) hist(chrBreaks[chrBreaks > 1 & chrBreaks < 50], breaks = 20) hist(chrBreaks[chrBreaks > 2 & chrBreaks < 50], breaks = 20) hist(chrBreaks[chrBreaks > 4 & chrBreaks < 50], breaks = 20) hist(chrBreaks[chrBreaks > 10 & chrBreaks < 1e4], breaks = 20) hist(chrBreaks[chrBreaks > 10 & chrBreaks < 100], breaks = 20) hist(chrBreaks[chrBreaks > 100 & chrBreaks < 1e4], breaks = 20) hist(chrBreaks[chrBreaks > 1000 & chrBreaks < 1e4], breaks = 20)
Looks like an integration size of 5 in the short chr
ends <- sortFusions( fusions ) knitr::kable(head(ends))
Clustering end by window, creating numbered position-ranges that merge nearby fusion points, and reporting the original fusion list annotated with the clustering info.
windowSize <- 100000 clusteredEnds <- clusterFusionEnds(ends, window= windowSize) str(clusteredEnds) knitr::kable(head(clusteredEnds, 5)) knitr::kable(tail(clusteredEnds, 5))
Looking at groupings
sum(clusteredEnds$clusterLow != clusteredEnds$clusterHigh) diffs <- clusteredEnds$clusterHigh - clusteredEnds$clusterLow selectRow <- (diffs == max(diffs)) knitr::kable( head( clusteredEnds[selectRow, ] )) selectRow <- (clusteredEnds$clusterLow != clusteredEnds$clusterHigh) knitr::kable( head( clusteredEnds[selectRow, ] ))
Now that we have fusion end ranges, report fusions by cluster.
clusterAnnotatedFusions <- clusterFusions(fusions, window= windowSize) knitr::kable(head(clusterAnnotatedFusions, 10)) clusterAnnotatedFusionsFile <- paste0("all.abra.sv.clustered.", windowSize, ".tsv") clusterAnnotatedFusionsPath <- file.path( dataRoot, clusterAnnotatedFusionsFile ) write.table(clusterAnnotatedFusions, file = clusterAnnotatedFusionsPath, sep="\t")
Ok, now that we have fusion clusters identified (single and double ended) lets merge counts within each sample by fusion id
mergedClusterCounts <- mergeCountsByCluster( clusterAnnotatedFusions ) knitr::kable(head(mergedClusterCounts, 10))
Write out the fusion clusters file.
mergedClusterCountsFile <- paste0( "all.abra.sv.clustered.aggregated.", windowSize, ".tsv" ) mergedClusterCountsPath <- file.path( dataRoot, mergedClusterCountsFile ) write.table(mergedClusterCounts, file = mergedClusterCountsPath, sep="\t", row.names=FALSE )
Look at some cluster distributions
fusionDist <- table(mergedClusterCounts$fusionClusterId) fusionDupDist <- table(fusionDist) fusionDupDist fusionDup <- fusionDist[fusionDist > 1] fusionDup for (clusterId in as.integer(unlist(dimnames(fusionDup)))) { print(clusterAnnotatedFusions[clusterAnnotatedFusions$fusionClusterId == clusterId,][1,]) }
Basing selection on data from michelles' spreadsheet: UNCseq_HPVcounts_20170106.xlsx, which is encrypted. My excerpt, clean of potential protected health information, is saved as UNCseq_HPVcounts_20170106_srj_extract.csv
samplesWithHpvReadsFilename <- "UNCseq_HPVcounts_20170106_srjPosExtractNoPHI.csv" samplesWithHpvReadsFile <- file.path( dataRoot, samplesWithHpvReadsFilename ) print( paste0( "Loading ", samplesWithHpvReadsFile, ": ", file.info(samplesWithHpvReadsFile)$size, " bytes")) samplesWithHpvReadsDF <- read.csv( samplesWithHpvReadsFile, header= TRUE, sep= "\t", stringsAsFactors= FALSE ) table(samplesWithHpvReadsDF$TumorOnlySequencing) pattern <- "\\s+-\\s+.*$" samplesWithHpvReadsDF$mainType <- gsub(pattern, "", samplesWithHpvReadsDF$CancerType) str(samplesWithHpvReadsDF) typeTable <- table(samplesWithHpvReadsDF$CancerType) typeTable <- typeTable[order(typeTable)] types <- unlist(dimnames(typeTable)) paste0( typeTable, " - ", types ) mainTypeTable <- table(samplesWithHpvReadsDF$mainType) mainTypeTable <- mainTypeTable[order(mainTypeTable)] mainTypes <- unlist(dimnames(mainTypeTable)) paste0( mainTypeTable, " - ", mainTypes ) selectHighBreast <- (samplesWithHpvReadsDF$mainType == 'Breast' & samplesWithHpvReadsDF$TUMOR > 9 ) sum(selectHighBreast) selectHighBrain <- ( samplesWithHpvReadsDF$mainType == 'Brain/CNS' & samplesWithHpvReadsDF$TUMOR > 9 ) sum(selectHighBrain) sampleNames <- samplesWithHpvReadsDF[ selectHighBrain | selectHighBreast, "StudyNumber" ] length(sampleNames) sampleNames <- unique(sampleNames) length(sampleNames) falsePosSampleNamesFile <- file.path(dataRoot, "falsePosSampleNames.txt") write.table(sampleNames, falsePosSampleNamesFile, row.names = FALSE, col.names = FALSE ) file.info(falsePosSampleNamesFile)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.