knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
CCLid
(Cancer Cell Line identification) is designed as a toolkit to address the lack of a publicly available resource for genotype-based cell line authentication. We developed this resource to allow for genotype-matching of any given cancer cell line to the 1,497 unique cancer cell lines found in the GDSC, CCLE or gCSI datasets. Using the B-allele frequencies (BAFs) for all SNPs found in common between the input and reference datasets, this tool will allow for a genotype matching operation that trains and uses a logistic model to calculate the probability of the best cell line matches.
If multiple cell lines are found to be genetically similar to the input cell line, there are several conclusions that can be reached: 1) the cell lines in the reference datasets are derivatives/clones of each other or originate from the same patient or 2) the input cell line is cross-contaminated. In the case of cross-contamination, we included a function to use NMF with a prior matrix containing all matching cell line BAFs to deconvolute the relative contribution of each cell line
As BAFs can be viewed as an indirect representation of karyotype (i.e. AB has a 0.5/0.50 BAF,while AAB would have a 0.66/0.33 BAF), we developed this tool to scan for segments of the genome that are significantly different between genotypically "matching" cell lines. This function allows for the inference of karyotypically discordant regions, hence representing what is likely to be genetic drift.
CCLid
requires that several packages are installed. However, all
dependencies are available from CRAN or Bioconductor.
library(devtools) devtools::install_github('bhklab/CCLid')
Load CCLid
into your current workspace:
library(CCLid)
CCLid
has been tested on the Cent OS 6 platforms. The packages
uses the core R package parallel
to preform parallel computations, and
therefore if parallelization is desired, the dependencies for the parallel
package must be met.
The reference BAF matrix for all cell lines (2,887) and all SNPs (507,452) can be downloaded using the loadRef()
function. The RDS file that gets downloaded (bafs-matrix.rds.gz
) and extracted from this function requires a total space of 4.1gb to process, and 1.9gb once intermediary files are removed. A variance estimate for each SNP in a set [bin.size]
will be generated and stored in the same folder for future loadings ``.
refdir <- '~/cclid' metadata <- c("meta.df", "affy.omni", "cin70", "gne.meta", "melt.cells", "snp6.dat") sapply(metadata, downloadRefCCL, saveDir=refdir, verbose=verbose) ref_dat <- CCLid::loadRef(PDIR=refdir, analysis='baf', bin.size=5e5, just.var=TRUE)
To save on memory, computing resources, and time for future analyses, a pre-computed bigmemory .bin
and .desc
file can be stored and passed into PDIR
argument. These files can either be created using the bigmemory package, or downloaded from: https://zenodo.org/deposit/3891805. The function will use the pre-existing bigmemory objects rather than download and calculate variance estimates.
CCLid
requires a VCF as the input for all future analyses. An example of the vcf can be downloaded and viewed at https://pmgenomics.ca/pughlabwiki/lib/exe/fetch.php?media=data:public_data:a549.sample_id.vcf.gz or stored in the CCLid R package folder
path_to_vcf = file.path(system.file(file.path("extdata"), package="CCLid"), "A549_trim.vcf.gz") vcf_map <- mapVcf2Affy(path_to_vcf)
This vcf_map
object will be the main input for all future analysis. Alternatively, wrappers exist that run the entire analyses from start to end using the path_to_vcf
variable. Both methods will be shown in their respective sections.
The goal of this section is to read in the input VCF file and subset it for SNPs that overlap with the reference datasets. The compareVcf
function works by reading in the path_to_vcf
variable, mapping the SNPs to the reference dataset matrix, filtering out non-informative SNPs, keeping the Ref/Alt allele consistent with the reference SNP probesets, and finally reducing the dataset to a certain number of SNPs for memory purposes. A max.snps
parameter is used to limit the memory footprint of this process and it is recommended that this value is greater than 100 SNPs to maximize detection.
path_to_vcf = file.path(system.file(file.path("extdata"), package="CCLid"), "a549.sample_id.vcf") vcf_mat <- compareVcf(path_to_vcf, var.dat=ref_dat$var, ref.mat=ref_dat$ref, max.snps=200, snp6.dat=snp6.dat)
This compareVcf
function returns a matrix of BAf values for the SNPs that overlap the input VCF. For example:
RNA_varscan2_KMS11 201T 22RV1 GDSC_23132-87 42-MG-BA SNP_A-2241506 1.00 0.97 0.99 0.50 0.95 SNP_A-8383579 0.49 -0.05 0.56 0.60 0.01 SNP_A-8318107 0.37 1.01 0.01 0.97 0.98 SNP_A-2206183 0.99 0.05 0.62 1.13 0.52 SNP_A-8311452 1.00 1.02 0.57 0.47 0.01
The goal of this section is to build a logistic regression model and test the fit of the input VCF. With the newly created vcf_mat
matrix, the checkForConcordance
function will create a logistic regression model on the reference dataset by splitting the data into cell lines with Matching annotations (M) and cell lines with Non-Matching annotations (NM). The euclidean distance between the BAFs for SNPs that overlap the input VCF for each pair of cell lines are calculated and used to train the model. Using this trained model, the input sample will be used to predict either M or NM given the probability of being a NM. This function will return a list that is split into these two categories.
colnames(vcf_mat)[1] <- sample_name pred <- checkForConcordance(x.mat=vcf_mat, sampleID=sample_name, rm.gcsi = FALSE, meta.df=meta.df)
With the results looking as followed:
## Matching samples > head(pred$pred$M) Var1 Var2 baf baf.fit baf.p.fit z p q NM.2627 CCLE_KMS-11 varscan2_KMS11 0.06557439 0.005653928 M -2.924280 0.003452542 0.8503905 NM.423 KMS-11 varscan2_KMS11 0.10295630 0.007490657 M -2.817130 0.004845484 0.8503905 ## Non-matching samples > head(pred$pred$NM) Var1 Var2 baf baf.fit baf.p.fit z p q NM.2649 CCLE_HSC-2 varscan2_KMS11 0.7485987 0.5009832 NM -0.9664978 0.3337952 0.955501 NM.1845 CCLE_NCI-H2085 varscan2_KMS11 0.7497333 0.5031317 NM -0.9632456 0.3354243 0.955501
The goal of this section is to the look for genetic drift segments between the input sample and the matching cell lines from the reference dataset. There are 3 main parts to do this: 1) Isolate for cell lines with the same genotype, 2) Re-assemble the BAF matrix with all possible SNPs, and 3) Scan for genetic drift.
# Get all isogenic cell line IDs all.ids <- unique(unlist(pred$pred$M[,c('Var1', 'Var2')]))
vcf_mat
matrix. However, this time we do not need to worry about memory usage as much so the max.snps
parameter is omitted and the default will be to use all matching SNPs. This is needed to increase the density of SNPs across the genome, allowing for a more robust segmentation.# Subset the VCF matrix for just the cell lines of interest vcf_mat <- compareVcf(vcfFile, var.dat=ref.dat$var, ref.mat=ref.dat$ref, ids=all.ids, snp6.dat=snp6.dat) ## starts at 2.5g, ramps up to 6.8Gb
bafDrift
function which will go through each sample in the vcf_mat
matrix, and do an all by all comparison.# Drift estimation bdf <- tryCatch({ bafDrift(vcf_mat, centering='median', snp6.dat=snp6.dat) }, error=function(e){NULL})
The bafDrift
function will return a list containing frac
and cna.obj
lists.
The bdf$frac
returns the fraction of the genome found drifted for each value of t
The bdf$cna.obj
returns a list composed of the SNPs that populate this analysis in the a data
dataframe. The output
dataframe will contain the segmented BAF-difference where seg.mean
is the absolute difference in BAF, and the seg.z
is the z-score calculated based on the mean and sd across the entire genome. The t
column is a rounded version of the seg.z
meant to be used as a threshold.
$output ID chrom loc.start loc.end num.mark seg.mean arm seg.sd seg.z t seg.diff 1 CCLE_KMS-11 chr6 72899530 72899530 1 0.02 q NA NA NA NA 2 CCLE_KMS-11 chr14 107142448 107142448 1 -0.02 q NA NA NA NA 3 CCLE_KMS-11 chr22 17414640 22673639 2 0.00 q 0.021 0 0 0
The bdf$cna.obj
list elements contains objects of the CCLid class, which can be plotted using:
plot(bdf$cna.obj[[1]])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.