We developed the FLCNA method based on a fused lasso model to detect copy number aberrations (CNAs) and identify subclones simultaneously. To capture the biological heterogeneity between potential subclones, we developed the FLCNA method which is capable of subcloning, and simultaneously detecting breakpoints with scDNA-seq data. First, procedures including quality control (QC), normalization, logarithm transformation are used for pre-processing of the datasets. Subclone clustering is achieved based on a Gaussian Mixture Model (GMM), and breakpoints detection is conducted by adding a fused lasso penalty term to the typical GMM model. Finally, based on these shared breakpoints in each cluster, candidate CNA segments for each cell were clustered into different CNA states using a GMM-based clustering strategy. The framework of the FLCNA method is summarized and illustrated in Figure 1.

library(devtools) install_github("FeifeiXiao-lab/FLCNA")
For public data from NCBI SRA, starting with Sequence Read Archive (SRA) files, FASTQ files can be generated with Fastq-dump from SRA-Toolkit, and then aligned to NCBI hg19 reference genome and converted to BAM files. For the 10× Genomics datasets, we need to demultiplex the original integrated BAM file into separate BAM files. Raw read depth of coverage data are generated from the BAM files with bin size 100kb. SCOPE R package can be utilized for generating coverage data, mappability and GC content. Specifically, get_bam_bed() can be used for generating bed files. get_coverage_scDNA() function can be applied for computing the depth of coverage for each cell and each marker. get_mapp() and get_gc() function can be used to calculate mappability and GC content, respectively.
To help explain how FLCNA can be used, we utilized a TNBC dataset from triple-negative breast cancer (TNBC) patients in NCBI SRA (SRP114962). TNBC displays extensive intratumor heterogeneity and frequently develops resistance to neoadjuvant chemotherapy (NAC) treatment. The KTN126 patient was used in this vignette, where tumor cells were only reported in the pre-treatment samples. Cells were sequenced at two time points (pre- and mid/post-treatment) with 93 cells (46 pre- and 47 post-treatment) in the KTN126 patient. Raw read depth of coverage data were generated from the BAM files with bin size 100k.
FLCNA_QC() R function can be used to remove bins that have extreme GC content (less than 20% and greater than 80%) and low mappability (less than 0.9) to reduce artifacts.
\newpage
# The example data have 3,000 markers and 93 cells. library(FLCNA) data(KTN126_data_3000) data(KTN126_ref_3000) RD <- KTN126_data_3000 dim(RD)
ref <- KTN126_ref_3000 head(ref)
\newpage
QCobject <- FLCNA_QC(Y_raw=RD, ref_raw=ref, mapp_thresh = 0.9, gc_thresh = c(20, 80))
A two-step median normalization approach is implemented to remove the effect of biases from the GC-content and mappability. We further calculate the ratio of normalized RC and its sample specific mean, and the logarithm transformation of this ratio (log2R) is used in the main step of the FLCNA method. FLCNA_normalization() R function is used for the normalization.
log2Rdata <- FLCNA_normalization(Y=QCobject$Y, gc=QCobject$ref$gc, map=QCobject$ref$mapp)
Subclone clustering is achieved based on a GMM, and breakpoints detection is conducted by adding a fused lasso penalty term to the typical GMM model. FLCNA() R function can be used for the CNA detection and simultaneous subclone clustering. There are two hyperparameters to be pre-defined in the FLCNA method, including the number of clusters K and the tuning parameter lambda. The tuning hyperparameter lambda is used to control the overall number of change points that less change points tend to be generated with larger lambda value. To find the optimal values of K and lambda, we use a BIC-type criterion, and the model with smallest BIC value is selected as the optimal model.
\newpage
# K: The number of clusters. # lambda: The tuning parameter in the penalty term, the default is 3. output_FLCNA <- FLCNA(K=c(3,4,5), lambda=3, Y=log2Rdata)
data(output_FLCNA)
# The number of clusters in the optimal model output_FLCNA$K.best
# The estimated mean matrix for K clusters output_FLCNA$mu.hat.best[,1:11]
# The cluster index for each cell output_FLCNA$s.hat.best[1:20] table(output_FLCNA$s.hat.best)
After the mean vector is estimated for each cluster, we locate and quantify all the change points, and identify segments that share the same underlying copy number profile. CNA.out() R function is used for the clustering of candidate CNAs for each cell. Change-point can be identified from the estimate of mean vector where for the marker before and after the change point show different values. Typically, different CNA states are required to be assigned for each segment in each cell to help locate significant CNA signatures. A GMM-based clustering strategy is implemented for CNA clustering using the normalized read counts data (i.e., log2R). Segments sharing similar intensity levels (i.e., the log2R values) in a cell are identified as the ones with same copy number states. Each segment is classified using a five-state classification scheme with deletion of double copies (Del.d), deletion of a single copy (Del.s), normal/diploid, duplication of a single copy (Dup.s) and duplication of double copies (Dup.d).
\newpage
# mean.matrix: The cluster mean matrix estimated from FLCNA R function. # cutoff: Cutoff value to further control the number of CNAs, # the larger cutoff, the smaller number of CNAs. # The default is 0.35. # L: Repeat times in the EM algorithm, defaults to 100. CNA.output <- CNA.out(mean.matrix = output_FLCNA$mu.hat.best, LRR=log2Rdata, Clusters=output_FLCNA$s.hat.best, ref=ref, cutoff=0.35, L=100) head(CNA.output)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.