knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
An R package designed to facilitate generating reliable, full-length phase-defined reference sequences for novel HLA and KIR alleles.
Note, that this package is still maturing. There's no guarantee yet that the API or the under-the-hood workings of the package won't change substantially
This package is only available on GitHub for now. It depends on a local installation
of samtools
, bwa
(>= 0.7.11), python and the pysam library, and a C++11
compliant compiler.
For longreads, an alternative mapper with better results is minimap2
, the successor
of bwa
For installation over git via SSH, you need to have installed libssh2-1 and libssh2-1-dev prior to installing git2r/devtools or reinstall the package afterwards.
install.packages("devtools") # if not already installed library("devtools") devtools::install_github("gschofl/DR2S)
DR2S
is designed to integrate longread HLA and KIR data (e.g., PacBio or Oxford
Nanopore sequences) and shortread shotgun data (Illumina). It is also possible
to run DR2S
in "longread-only mode", but don't expect reference-quality
consensus sequences that way.
As input, we expect longread and, optionally, shortread FASTQ files to be placed in
separate subdirectories within a working directory and to follow the naming
convention SAMPLEID_LOCUS_.*.fastq(.gz)?
.
SAMPLEID
can be any arbitrary unique identification code and LOCUS
should
be one of A
, B
, C
, DQB1
, DRB1
, or DPB1
for HLA, or one of 2DL1
,
2DL2
, 2DL3
, 2DL4
, 2DL5A
, 2DL5B
, 2DP1
, 2DS1
, 2DS2
, 2DS3
,
2DS4
, 2DS5
, 3DL1
, 3DL2
, 3DL3
, 3DP1
, or 3DS1
for KIR.
All output is placed in a directory tree output\LOCUS\SAMPLEID\
.
An example:
~/dr2s_data | +-- pacbio | |-- ID12912701_DPB1_lbc23.fastq.gz | | +-- illumina | |-- ID12912701_DPB1_S23_L001_R1_001.fastq.gz | |-- ID12912701_DPB1_S23_L001_R2_001.fastq.gz | +-- output | +-- DPB1 | +-- ID12912701
Once all input files are put in place a DR2S analysis is started with a call
to the functions createDR2SConf()
and InitDR2S()
:
## a minimal example: x <- InitDR2S(createDR2SConf( sample = "ID12912701", locus = "DPB1", longreads = list(dir = "pacbio", type = "pacbio", mapper = "minimap"), shortreads = list(dir = "illumina", type = "illumina", mapper = "bwamem"), datadir = "~/dr2s_data", outdir = "~/dr2s_data/output" ))
sample
: A unique sample identifier. The FASTQs associated with a sample need
need to be prefixed with this identifier.locus
: One of the allowed HLA and KIR loci above. If allele information for
a sample is available it can be specified as, e.g. DPB1*04:02:01:01
.
In this case this allele will be used as a reference against which an initial
mapping of the longreads is performed. If this information is not given a
generic locus-specific reference is used. NOTE: generic references are not yet
implemented for KIR.longreads
: The location, type, and mapper for longreads as a named list
with the fields dir
, type
("pacbio" or "nanopore") and mapper
("bwamem" or "minimap").shortreads
: (optional) The location, type, and mapper for shortreads as a
named list with the fields dir
, type
("illumina") and mapper
("bwamem"
or "minimap").datadir
: The data directory (see above).outdir
: The output directory (see above).reference
: (optional) Path to a fasta file containing the reference sequence.details
: (optional) Named list of sample metadata. These data will be
included in the fasta headers of the final sequences and stored in the
config file.opts
: (optional) Named list of arguments to the DR2S pipeline steps. They
will be stored in the config file. See below for a detailed descriptions of
options that control the DR2S pipeline.This call generates an R6
object of class DR2S
that encapsulates all data
and methods for all subsequent analysis steps.
An analysis proceeds in a number of steps that can be chained together
using the pipe %>%
:
x %>% mapInit() %>% partitionLongreads() %>% mapIter() %>% partitionShortreads() %>% mapFinal() %>% polish() %>% report() %>% cache()
Alternatively, the complete pipeline can be run in one go:
x$runPipeline()
The individual steps perform the following analyses:
mapInit()
:partitionLongreads()
:mapIter()
: partitionShortreads()
:mapFinal()
:polish()
:report()
:Throughout this process a number of diagnostic plots are produced and placed in the output directory for later inspection.
While this process works remarkably well, there are situations where alignment artefacts or plain bad luck may introduce errors in the final consensus sequences. You should never accept the result as ground truth without some manual and visual consistency checks!
DR2S
provides some facilities to aid checking and signing-off of finalised
consensus sequences.
A typical post-processing workflow may look as follows:
plot_diagnostic_alignment(x) run_igv(x, 3000) check_alignment_file(x) refineAlignment(x, "A") run_igv(x, "refine") report_checked_consensus(x)
plot_diagnostic_alignment: Displays an alignment of three preliminary long-read-based consensus sequences and the final short-read-based consensus sequence for all alleles in your browser. This can be used to spot inconsistencies between the long-read and the short-read evidence.
run_igv: Opens an instance of the IGV Genome Browser for each haplotype at a specified position (one for each allele) displaying both the long read and short read data for manual inspection.
check_alignment_file : Opens a pairwise or multiple alignment of the final consensus sequences in your text editor. Use this to perform any manual edits on the consensus sequences. Editor options are: "subl", "gvim" and "gedit". Defaults to systems standard editor.
report_checked_consensus:
Export final consensus sequences from the edited pairwise or multiple alignment
as FASTAs into a separate subdirectory ./checked
in the output
directory.
DR2S
creates bash scripts for the convenient access to important postprocessing functions:
check_alignment_file
command.A more fine-grained control of the DR2S pipeline is available via a json
-based
config file. This config file can be created externally and read in using the
readDR2Sconf()
function. Alternatively the config file is created by the
createDR2Sconf()
function. Pipeline control options are set using the opts
argument in createDR2Sconf()
, e.g.:
conf <- createDR2SConf( sample = "ID12912701", locus = "A*01:01:01:01", longreads = list(dir = "pacbio", type = "pacbio", mapper = "minimap"), shortreads = list(dir = "illumina", type = "illumina", mapper = "bwamem"), datadir = "~/dr2s_data", outdir = "~/dr2s_data/output", opts = list( mapInit = list(topx = "auto", createIgv = FALSE), partitionLongreads = list(threshold = 1/5, noGapPartitioning = TRUE, selectCorrelatedPositions = TRUE, selectAllelesBy = "distance"), mapIter = list(iterations = 2), mapFinal = list(createIgv = FALSE), report = list(createIgv = FALSE) ))
The complete set of available options and their defaults are:
## ## mapInit() defaults #### ## mapInit = list( ## <includeDeletions>: include deletions in pileup. includeDeletions = TRUE, ## <includeInsertions>: include insertions in pileup. includeInsertions = TRUE, ## <callInsertionThreshold>: if <includeInsertions == TRUE>, an insertion ## needs to be at frequency <callInsertionThreshold> for it to be included ## in the pileup. callInsertionThreshold = 1/5, ## <microsatellite>: if <pipeline == "SR">, perform a second mapping of ## shortreads to the inferred reference. Set to TRUE if you suspect ## microsatellites or repetitive regions in your sequence. This extends ## the reference to a maximum length and enables a better mapping. microsatellite = FALSE, ## <forceMapping>: set to TRUE if you want to force processing of "bad" ## shortreads when the distribution of coverage is heavily unequal. ## Aborts the program if maximum coverage > 75 % quantile * 5. forceMapping = FALSE, ## <minMapq>: don't filter longreads for mapping quality unless specified. ## NOTE: for shortreads <minMapq = 50> is hardcoded. minMapq = 0, ## <topx>: pick the x top-scoring reads. Set to an integer value to pick ## a fixed number of reads. Set to "auto" to use a dynamically determined ## number of reads to be selected. topx = FALSE, ## <pickiness>: if <topx == "auto">: <pickiness < 1>: bias towards higher ## scores/less reads; <pickiness > 1>: bias towards lower scores/more reads pickiness = 1, ## <increasePickiness>: if <topx == "auto">: increase pickiness for the ## second iteration of LR mapping increasePickiness = 1, ## <lowerLimit>: if <topx == "auto"> or <topx > 0>: the minimum number ## of reads to pick if available. lowerLimit = 200, ## <updateBackgroundModel>: estimate the indel noise in a pileup and use ## this information to update the background model for PWM scoring updateBackgroundModel = FALSE, ## <createIgv>: subsample bam files for visualisation with IgvJs in the ## DR2S shiny app. createIgv = TRUE, ## <plot>: generate diagnostic plots. plot = TRUE ) ## ## partitionLongreads() defaults #### ## partitionLongreads = list( ## Threshold to call a polymorphic position. A minority nucleotide frequency ## below this threshold is considered noise rather than a valid polymorphism. threshold = 1/5, ## The expected number of distinct alleles in the sample. This should be 2 ## for heterozygous samples, 1 for homozygous samples may be >2 for some ## KIR loci. distAlleles = 2, ## The minumum frequency of the gap character required to call a gap position. skipGapFreq = 2/3, ## Don't partition based on gaps. Useful for samples with only few SNPs but ## with homopolymers. The falsely called gaps could mask the real variation. ## Set to override global default. noGapPartitioning = TRUE, ## Correlate polymorphic positions and cluster based on the absolute ## correlation coefficient. Extract positions from the cluster with the ## higher absolute mean correlation coefficient. This gets rid of positions ## that are not well distributed across the two alleles. selectCorrelatedPositions = FALSE, ## if <selectCorrelatedPositions> == TRUE, use <measureOfAssociation> ## ("cramer.V" or "spearman") to determine linkage between all polymorphic ## positions. measureOfAssociation = "cramer.V", ## We perform an equivalence test on clusters of polymorhic positions: ## Calculate the lower 1-sigma bound of the high-association cluster i. ## Calculate the upper 1-sigma bound of the low-association cluster j. ## Reject the clusters, if this bounds overlap by more than <proportionOfOverlap> ## of the average distance (dij) between clusters. proportionOfOverlap = 1/3, ## By how much do we expect 2 clusters to minimally differ in mean Cramér's V. ## BIC-informed model-based clustering tends to split rather than lump ## and this is a heuristical attempt to forestall this. minimumExpectedDifference = 0.06, ## If more than <distAlleles> clusters are found select clusters based on: ## (1) "distance": The hamming distance of the resulting variant consensus ## sequences or (2) "count": Take the clusters with the most reads as the ## true alleles. selectAllelesBy = "distance", ## Minimum size of an allele cluster minClusterSize = 20, ## When selecting reads from allele clusters using a dynamic threshold: ## pickiness < 1: bias towards higher scores/less reads ## pickiness > 1: bias towards lower scores/more reads pickiness = 1, ## When selecting reads from allele clusters the minimum number of ## reads to pick if available. lowerLimit = 40, ## Generate diagnostic plots. plot = TRUE ) ## ## mapIter() defaults #### ## mapIter = list( ## Number of <mapIter> iterations. How often are the ## clustered reads remapped to updated reference sequences. iterations = 1, ## Minimum occupancy (1 - fraction of gap) below which ## bases at insertion position are excluded from from consensus calling. columnOccupancy = 2/5, ## an insertion needs to be at frequency <callInsertionThreshold> for it ## to be included in the pileup. callInsertionThreshold = 1/5, ## Generate diagnostic plots. plot = TRUE ) ## ## mapFinal() defaults #### ## mapFinal = list( ## include deletions in pileup. includeDeletions = TRUE, ## include insertions in pileup. includeInsertions = TRUE, ## an insertion needs to be at frequency <callInsertionThreshold> for it ## to be included in the pileup. callInsertionThreshold = 1/5, ## (for shortreads only) trim softclips and polymorphic ends of reads before ## the final mapping trimPolymorphicEnds = FALSE, ## Subsample bam files for visualisation with IgvJs in the ## DR2S shiny app. createIgv = TRUE, ## Generate diagnostic plots. plot = TRUE ) ## ## polish() defaults #### ## polish = list( ## Threshold to call a polymorphic position. Set to override global default. threshold = NULL, ## Check the number of homopolymer counts. Compare the resulting sequence ## with the mode value and report differences. checkHpCount = TRUE, ## The minimal length of a homopolymer to be checked. hpCount = 10 ) ## ## report() defaults #### ## report = list( ## Maximum number of sequence letters per line in pairwise alignment. blockWidth = 80, ## Suppress remapping of reads against final consensus. noRemap = FALSE, ## Subsample bam files for visualisation with IgvJs in the ## DR2S shiny app. createIgv = TRUE )
An example config file:
{ "sampleId": "ID12912701", "locus": "A", "datadir": "/home/user/dr2s_data", "outdir": "/home/user/dr2s_data/A/ID12912701", "reference": "HLA-A*01:01:01:01", "longreads": { "dir": "pacbio", "type": "pacbio", "mapper": "minimap" }, "shortreads": { "dir": "illumina", "type": "illumina", "mapper": "bwamem" }, "pipeline": "SR", "opts": { "mapInit": { "includeDeletions": true, "includeInsertions": true, "callInsertionThreshold": 0.2, "microsatellite": true, "forceMapping": false, "minMapq": 0, "topx": false, "pickiness": 1, "increasePickiness": 1, "lowerLimit": 200, "updateBackgroundModel": false, "createIgv": false, "plot": true }, "partitionLongreads": { "threshold": 0.2, "distAlleles": 2, "skipGapFreq": 0.6667, "noGapPartitioning": true, "selectCorrelatedPositions": false, "measureOfAssociation": "cramer.V", "proportionOfOverlap": 0.3333, "minimumExpectedDifference": 0.06, "selectAllelesBy": "distance", "minClusterSize": 20, "pickiness": 1, "lowerLimit": 40, "plot": true }, "mapIter": { "iterations": 2, "columnOccupancy": 0.4, "callInsertionThreshold": 0.2, "plot": true }, "mapFinal": { "includeDeletions": true, "includeInsertions": true, "callInsertionThreshold": 0.2, "trimPolymorphicEnds": false, "createIgv": false, "plot": true }, "polish": { "checkHpCount": true, "hpCount": 10 }, "report": { "blockWidth": 80, "noRemap": false, "createIgv": false } }, "format": "json" }
If you want to find allele sequences only based on longreads you just need to set
shortreads = NULL
in createDR2SConf()
and skip the the partitionShortreads()
step in the DR2S pipeline.
x <- InitDR2S(createDR2SConf( sample = "ID12912701", locus = "DPB1", longreads = list(dir = "pacbio", type = "pacbio", mapper = "minimap"), datadir = "~/dr2s_data", outdir = "~/dr2s_data/output" )) %>% mapInit() %>% partitionLongreads() %>% mapIter() %>% mapFinal() %>% polish() %>% report() %>% cache()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.