In anasrana/spector: Computes a measure for the signal roughnes for genomic sequencing files

# Lodaing packages
library(tidyverse)
library(stringr)
library(spector)
library(knitr)

knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)

theme_anas <- theme_set(theme_bw())
theme_anas <- theme_update(
  axis.text.x = element_text(size = rel(0.9)),
  axis.text.y = element_text(size = rel(0.9)),
  axis.title.x = element_text(size = rel(1)),
  axis.title.y = element_text(angle = 90, size = rel(1)),
  axis.ticks = element_line(colour = "grey90"),
  legend.key = element_rect(colour = "grey99", size = 0.1),
  legend.background = element_blank(),
  plot.background = element_blank(),
  panel.background = element_blank(),
  panel.border = element_rect(fill = NA, colour = "grey50"),
  panel.grid.major = element_line(colour = "grey90", size = 0.2),
  panel.grid.minor = element_line(colour = "grey98", size = 0.5),
  strip.text = element_text(colour="#DBDBDB"),
  strip.background = element_rect(colour="black", fill='black', size = 0.2)
  )

Background

Recent advances in high-throughput sequencing of cancer genomes has elucidated the landscape of genetic alterations in many tumour types, thus expanding molecular classification of cancer and facilitating routine whole genome sequencing in clinical oncology.

In samples obtained by high throughput sequencing we find aberrations in the coverage. It is difficult to elucidate the source of such aberrations, as they can stem from any part of the sequencing process or can be due to biological factors.

All whole genome sequencing (WGS) data contains noise the challenge is identifying the limits of this acceptable noise. One the one hand we have acceptable levels of noise which can be used further detailed study of a sample, on the other we have samples or regions along the genome of a sample that contain significant aberrations that make further study difficult.

Aim

Understanding the location and strength of aberrations is central to any further investigation as to it's source or simple quantification of aberrations per sample and along the sample. We have developed a Local Aberration Score (LAS) to quantify local aberrations for equally sized regions along the genome. The package also computes a sample wide Genome Integrity Metric (GIM) which is based on a comparison of LAS for a sample with a baseline computed as part of the package. The LAS should allow for comparison of whole genome sequence datasets (e.g. samples preserved with FFPE and FF protocols) as well as regions of particular concern.

Method

This package uses wavelets to decompose the signal into coefficients on different levels, these are used to set a threshold between acceptable noise and strong aberrations. The algorithm simple segments the genome into regions of equal size and computes a LAS for each region. It can also be used in a targeted fashion by supplying a custom bed file.

The default regions used in spector are based on the genome in a bottle (giab) project (more details). The regions are further subset to identify reliable regions using ReliableGenome.

First run

To run spector the first time all we need it is access to a bam file. There are two very basic bam files provided with the package. The files are sample1.bam and sample2.bam. To obtain the path of these files we can use the function spector_sample().

The best way to obtain QC results from spector is to use the function spector_qc(). With default settings the default region size $= 2^{13}$. We notice that the package provides a progress update while running, which is especially useful for larger files.

bam_f <- spector_sample("sample1.bam")
results_qc <- spector_qc(f_bam = bam_f)

Region size

We can specify region sizes when running spector_qc() using the region_size variable. This variable should be a power of $2$ if it isn't it is coerced to the largest power of $2$ which is smaller than < region_size inside the package.

bam_f <- spector_sample("sample1.bam")
results_qc <- spector_qc(f_bam = bam_f, region_size = 2^16)

All files in folder

It is also possible to run spector on all bam files contained inside a folder. All we need to do is to set the variable f_bam = as the path to the folder containing the bam files. We can also specify a custom bed file, we select one supplied with the package corresponding to the bam files in the package.

The package will first compute chromosomal overlaps between the bed file and the bam file and only compute LAS values for regions in chromosomes contained in both the bam and the bed file.

bam_folder <- spector_sample("")
basic_path <- spector_sample("basic.bed")

results_qc <- spector_qc(f_bam = bam_folder, f_bed = basic_path)

Parameter file

The final way to pass bam files to spector is to use a parameter file. The file needs to have a very specific structure.

par_file <- spector_sample("sample_id_2.txt")

cat(readChar(par_file, 1e5))

The required format is a tab-delimited file with three columns. The first contains the path to the bam file relative to the current working directory, it is recommended to use the absolute path to avoid ambiguity. The second column contains a unique sample id. The third column contains any further description of the sample type. Any further columns will be ignored. There should be no header in the parameter file, if there is it should start with a #. To read this parameter file spector has the function read_par_file().

read_par_file(par_file) %>% 
  kable()

Results

The results of a spector_qc run are a tbl_df object, which is part of the dplyr package. It contains an id variablec which encodes the genomic region in standard format the LAS values and any further sample specific parameters passed to the method.

glimpse(results_qc)

Compare results

A simple way to visualise the results is to plot comparative boxplots, as shown in this example.

results_qc %>% 
  ggplot(aes(x = id_bam, y = las)) +
  geom_boxplot()

This is not always satisfactory depending on the application. For example in this example the samples contain different chromosomes. We could do a chromosome specific comparison to get a more fine-grained picture.

results_qc %>% 
  ggplot(aes(x = id_bam, y = las, fill = chrom)) +
  geom_boxplot() +
  facet_grid(.~chrom)

In this case this shows us that the samples are different only in chromosome 1 and chromosome 2. This is because the sample bam files originate from the same source file with different sub-setting.

anasrana/spector documentation built on May 14, 2019, 2:36 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com