We now provide a ready-to-run Docker container that includes the package and all prerequisites (see section Docker). Alternatively, you can follow the installation procedure below.
Numbat uses cellsnp-lite for generating SNP pileup data and eagle2 for phasing. Please follow their installation instructions and make sure their binary executables can be found in your $PATH.
Additionally, Numbat needs a common SNP VCF and phasing reference panel. You can use the 1000 Genome reference below:
# hg38 wget https://sourceforge.net/projects/cellsnp/files/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz # hg19 wget https://sourceforge.net/projects/cellsnp/files/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
# hg38 wget http://pklab.med.harvard.edu/teng/data/1000G_hg38.zip # hg19 wget http://pklab.med.harvard.edu/teng/data/1000G_hg19.zip
You can install the numbat R package via CRAN:
install.packages('numbat')
Alternatively, you can install the GitHub version:
devtools::install_github("https://github.com/kharchenkolab/numbat")
We provide a ready-to-run Docker container that includes the Numbat R package and all of its dependencies. You can launch it as follows:
docker run -v /work:/mnt/mydata -it pkharchenkolab/numbat-rbase:latest /bin/bash
where /work
is a local data folder you would like to access and write to. Within the container, cellsnp-lite
, eagle2
and samtools
are available in path and the 1000 Genome SNP VCF and phasing panel files (hg38) are stored under /data
. You can also launch R and run interactive analysis using Numbat (or run an R script).
numbat/inst/bin/pileup_and_phase.R
) to count alleles and phase SNPs.usage: pileup_and_phase.R [-h] --label LABEL --samples SAMPLES --bams BAMS [--barcodes BARCODES] --gmap GMAP [--eagle EAGLE] --snpvcf SNPVCF --paneldir PANELDIR --outdir OUTDIR --ncores NCORES [--UMItag UMITAG] [--cellTAG CELLTAG] [--smartseq] [--bulk] Run SNP pileup and phasing with 1000G Arguments: -h, --help show this help message and exit --label LABEL Individual label --samples SAMPLES Sample names, comma delimited --bams BAMS BAM files, one per sample, comma delimited --barcodes BARCODES Cell barcode files, one per sample, comma delimited --gmap GMAP Path to genetic map provided by Eagle2 (e.g. Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz) --eagle EAGLE Path to Eagle2 binary file --snpvcf SNPVCF SNP VCF for pileup --paneldir PANELDIR Directory to phasing reference panel (BCF files) --outdir OUTDIR Output directory --ncores NCORES Number of cores
For example, within the Numbat Docker container you can run the preprocessing script like this:
Rscript /numbat/inst/bin/pileup_and_phase.R \ --label {sample} \ --samples {sample} \ --bams /mnt/mydata/{sample}.bam \ --barcodes /mnt/mydata/{sample}_barcodes.tsv \ --outdir /mnt/mydata/{sample} \ --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \ --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \ --paneldir /data/1000G_hg38 \ --ncores ncores
This will produce a file named {sample}_allele_counts.tsv.gz
under the specified output directory, which contains cell-level phased allele counts. If multiple samples from the same individual was provided, there will be one allele count file for each sample. Other outputs include phased vcfs under phasing/
folder and raw pileup counts under pileup/
.
Prepare the expression data. Numbat takes a gene by cell integer UMI count matrix as input. You can directly use results from upstream transcriptome quantification pipelines such as 10x CellRanger.
Prepare the expression reference, which is a gene by cell type matrix of normalized expression values (raw gene counts divided by total counts). For a quick start, you may use a our HCA collection (ref_hca
) that ships with the package. If you have matched normal cells (ideally, of various cell type) from the same patient or dataset and would like to make your own references, you may use this utility function:
# count_mat is a gene x cell raw count matrices # cell_annot is a dataframe with columns "cell" and "group" ref_internal = aggregate_counts(count_mat, cell_annot)
In this example (ATC2 from Gao et al), the gene expression count matrix and allele dataframe are already prepared for you.
library(numbat) count_mat_ATC2 = readRDS(url('http://pklab.med.harvard.edu/teng/data/count_mat_ATC2.rds')) df_allele_ATC2 = readRDS(url('http://pklab.med.harvard.edu/teng/data/df_allele_ATC2.rds')) # run out = run_numbat( count_mat_ATC2, # gene x cell integer UMI count matrix ref_hca, # reference expression profile, a gene x cell type normalized expression level matrix df_allele_ATC2, # allele dataframe generated by pileup_and_phase script genome = "hg38", t = 1e-5, ncores = 4, plot = TRUE, out_dir = './test' )
Note: If you wish to use your own custom reference, please use the aggregate_counts
function as per the example in preparing data. You do not need to include the reference cells in count_mat
or df_allele
; only provide them as lambdas_ref
.
There are a few parameters you can consider tuning to a specific dataset.
CNV detection
t
: the transition probability used in the HMM. A lower t
is more appropriate for tumors with more complex copy number landscapes (from which you can expect more breakpoints) and is sometimes better for detecting subclonal events. A higher t
is more effective for controlling false-positive rates of CNV calls.gamma
: Overdispersion in allele counts (default: 20). For 10x data, 20 is recommended. Non-UMI protocols (e.g. SMART-Seq) usually produce noisier allele data, and a smaller value of gamma is recommended (e.g. 5). min_cells
: minimum number of cells for which an pseudobulk HMM will be run. If the allele coverage per cell is very sparse for a dataset, then I would consider setting this threshold to be higher.multi_allelic
: Whether to enable calling of multiallelic CNVsCNV filtering
min_LLR
: minimum log-likelihood ratio threshold to filter CNVs (default: 5). To ensure quality of phylogeny inference, we only use confident CNVs to reconstruct the phylogeny.max_entropy
: another criteria that we use to filter CNVs before phylogeny construction (default: 0.5). The entropy of the binary posterior quantifies the uncertainty of an event across single cells. The value can be from 0 to 1 where 1 is the least stringent.Phylogeny
tau
: Stringency to simplify the mutational history (0-1). A higher tau
produces less clones and a more simplified evolutionary history.Iterative optimization
init_k
: initial number of subclusters to use for the hclust
initialization. Numbat by default uses hierarchical clustering (hclust
) of smoothed expression values to approximate an initial phylogeny. This will cut the initial tree into k clusters. More clusters means more resolution at the initial stage for subclonal CNV detection. By default, we set init_k to be 3.max_iter
: maximum number of iterations. Numbat iteratively optimizes the phylogeny and copy number estimations. In practice, we find that results after 2 iterations are usually stable. check_convergence
: stop iterating if the results have converged (based on consensus CNV segments).Parallelization
ncores
: number of cores to use for single-cell CNV testingncores_nni
: number of cores to use for phylogeny inferenceFor cell line data and high-purity tumors, we recommend running the below procedure to identify and exclude regions of clonal deletions/LOH before running Numbat. To do so, aggregate all cells into a pseudobulk, and run the SNP-density HMM:
bulk = get_bulk( count_mat = count_mat, lambdas_ref = ref_hca, df_allele = df_allele, gtf = gtf_hg38, genetic_map = genetic_map_hg38 ) segs_loh = bulk %>% detect_clonal_loh(t = 1e-4)
Then pass segs_loh
to the Numbat run for those regions to be excluded from baseline during analysis.
out = run_numbat( ...., segs_loh = segs_loh )
A detailed vignette on how to interpret and visualize Numbat results is available:
- Interpreting Numbat results
Numbat generates a number of files in the output folder. The file names are post-fixed with the i
th iteration of phylogeny optimization. Here is a detailed list:
gexp_roll_wide.tsv.gz
: window-smoothed normalized expression profiles of single cellshc.rds
: initial hierarchical clustering result based on smoothed expressionexp_roll_clust.png
: visualization of single-cell smoothed gene expression profilesbulk_subtrees_{i}.tsv.gz
: subtree pseudobulk data based on current cell lineage treebulk_subtrees_{i}.png
: visualization of subtree pseudobulk CNV profiles segs_consensus_{i}.tsv.gz
: consensus segments from subtree pseudobulk HMMsbulk_clones_{i}.tsv.gz
: clone-leve pseudobulk data based on current cell lineage treebulk_clones_{i}.png
: visualization of clone pseudobulk CNV profilesbulk_clones_final.tsv.gz
: clone-leve pseudobulk data based on final cell lineage treebulk_clones_final.png
: visualization of final clone pseudobulk CNV profiles exp_post_{i}.tsv
: single-cell expression posteriorsallele_post_{i}.tsv
: single-cell allele posteriorsjoint_post_{i}.tsv
: single-cell joint posteriorsclone_post_{i}.tsv
: single-cell clone assignment and tumor versus normal classification posteriorstree_list_{i}.rds
: list of candidate phylogeneies in the maximum likelihood tree searchpanel_{i}.png
: integrated visualization of single-cell phylogeny and CNV landscapeThe CNV detection power of Numbat can be further enhanced by conducting population-based phasing with a larger and more diverse reference panel (i.e. reference haplotypes). The default pipeline above uses the 1000 Genome panel, which contains genotypes from 2,504 individuals. Larger reference panels include:
gnomAD HGDP + 1KG panel (n=4,099). You can download the reference files using gsutil: gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes
.
TOPMed panel (n=97,256). You can upload your VCF to the TOPMed imputation server.
Q: We expect a certain CNV to be in a particular sample, but it is not detected by Numbat.
A: In general, there are three scenarios that a CNV is not called.
The CNV is not called in pseudobulk HMM. You can check bulk_subtrees_*.png
to see if the event was called. If it was not, I would suggest trying to find a better expression reference to reduce the noise in logFC, or changing the parameters used to configure the HMM (gamma
, t
, etc). Sometimes the CNV is very subclonal and was not isolated as a pseudobulk by the initial clustering. You can see if this was the case by looking at the exp_roll_clust.png
and if so, try increasing k_init
.
CNV is called in pseudobulks but did not appear in segs_consensus_*.tsv
and joint_post_*tsv
. This is because the event is filtered out due to low LLR. Lowering min_LLR
threshold will help.
CNV is called in pseudobulks, is present in segs_consensus_*.tsv
and joint_post_*tsv
, but did not appear in phylogeny. This is because the event is filtered out due to high entropy (weak evidence) in single cells. Raising max_entropy
threshold will help. You can check the entropy of specific events in joint_post_*tsv
in the column avg_entropy
.
Q: Numbat predicts most of the genome to be amplified, and there are gaps in the pseudobulk allele profile with no heterozygous SNPs.
A: This is a sign that the sample consists mostly of tumor (with very few normal cells), and clonal deletions are present. The heterozygous SNPs cannot be identified in these regions, therefore leaving gaps. You can identify these clonally deleted regions via the SNP-density HMM prior to running the main run_numbat
workflow (see section Detecting Clonal LOH).
Q: Does Numbat run on mouse data?
A: In principle, Numbat should work for a hybrid F1 mouse bred from two known lab mice strains. Since the parental haplotypes are known, you can just supply a VCF containing heterozygous sites and run cell-snp-lite
to create the allele count dataframe.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.