knitr::opts_chunk$set(collapse = TRUE,comment = "#>")
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)

## PACKAGES
library(dplyr)
library(CNpare)

## DATA DIRECTORY
# The file `component_parameters.rds` needs to be downloaded from repo (https://bitbucket.org/britroc/cnsignatures/src/master/data/). 
# This file has not been added to this repo due to size limitations)
path_to_data<-"C:/Users/bhernando/Desktop/CNIO/Projects/CNpare_analyses/data/"

Finding the best model of chromosomal instability

This CNpare Tool aims at finding the best cell line model for studying a type of chromosomal instability (CIN) in vitro. To this purpose, the model selection can be based on its copy-number profile or on its exposure to copy-number signatures.

This Rmarkdown is an example of how to use this tool.

Here, we selected the OVISE ovarian cancer cell line from the Cancer Cell Line Encyclopedia (CCLE) project as our sample of interest.

We first identified the cell line with the most similar copy-number profile independently of tissue origin.

Then, we selected cell lines of the same tissue origin, quantified the copy-number signatures and obtained a list of cell lines representing the same type of chromosomal instability. For quantifying copy-number signatures, this script implements the computational approach developed by Geoff Macintyre in 2018.

1) Input data

We first get the RData files in the data/ folder:

cells_segcn<-CNpare::cells_segcn
cells_mapping<-CNpare::cells_mapping
CCLE_metadata<-CNpare::CCLE_metadata
CCLE_metadata$Cell.line.primary.name<-toupper(CCLE_metadata$Cell.line.primary.name)

2) Get the cell line model with the closest copy-number profile

[Cell lines data preprocessing]{.ul}

Segmented copy-number profiles of cell lines may be fitted to the genomic positions of bins. Therefore, we split genome in evenly sized bins (here we used 1Mb) and get their positions.

allchr=c(1:22) #Add 23 if want to include chrX
lengthChr<-lengthChr
posBins <- lapply(allchr,function(chr) 
    getBinsStartsEnds(window=1000000, chr, lengthChr[chr]))

We selected the CCLE models with available copy-number profile. We also selected profiles with detectable CIN (those with at least 20 CNAs in a profile). Therefore, we discarded profiles with non detectable chromosomal instability. Finally, a matrix with copy-numbers per bin is then generated.

ccle<-as.data.frame(cbind(cellid=cells_mapping$cellid[cells_mapping$study=="CCLE"],sample=cells_mapping$fileid[cells_mapping$study=="CCLE"]))
cells_segcn <- cells_segcn[cells_segcn$sample%in%ccle$sample,]
cells_segcn <- getCINProfiles(segcn=cells_segcn, samples=unique(cells_segcn$sample))
cells_segcn <- left_join(cells_segcn,ccle, by="sample")[,c(1:4,6)]
colnames(cells_segcn)[5]<-"sample"

#get copy-number data per bin
ccle_cn <- getCNbins(posBins=posBins, data=cells_segcn, samples=unique(cells_segcn$sample))

For this example, we compare the copy-number profile of OVKATE (our sample of interest) with the copy-number profiles of the cell line models. We use [four similarity metrics]{.ul}:

exp_cell=as.matrix(ccle_cn[,colnames(ccle_cn)=="OVKATE"])
colnames(exp_cell)<-"OVKATE"
measures<-getSimilarities(dat1=exp_cell,dat2=ccle_cn,method="all")

Then, we can print the cell line with the most similar copy-number profile (lowest manhattan distance). We used the manhattan distance because it is usually preferred over the more common Euclidean distance when there is high dimensionality in the data.

The user can also choose to compute the empirical p-value of each comparison by adding the argument pvalue=TRUE in the getSimilarities function. The empirical p-value represents the probability of obtaining the corresponding similarity value by change. We used the observed values (similarities obtained after comparing all cell lines included in the CNpare dataset) to calculate the empirical p-value. Empirical p-values are calculated by dividing the number of pairwise comparisons with a better similarity value than the computed by the total number of comparisons made.

measures<-measures[order(measures$manhattan),]
head(measures,5)
exp_cell=cells_segcn[cells_segcn$sample=="OVKATE",]
mod1_cell=cells_segcn[cells_segcn$sample=="Panc 02.03",]
CNPlot_events(exp_cell,mod1_cell,method_diff="non-normalized")

Then, we can print the cell line with the most similar copy-number profile regardless the ploidy (highest pearson's r value). Interestingly, in this example, the same cell line comes out as the most representative cell line for OVKATE.

measures<-measures[order(-measures$r),]
head(measures,5)

CNpare allows to visually inspect the uniqueness of the similarity value respect to the observed values in the precomputed dataset by using the plot_simdensity function. The argument method can be modified to select the similarity metric to explore.

plot_simdensity(measures, method="pearson")
[Visualization of genome differences between two copy-number profiles]{.ul}

CNpare also provides a visual representation of the genome differences. To this functionality, we changed the the default argument plot_dif to TRUE. Here, we plotted differences in PANC0203 respect to OVKATE:

CNPlot_events(exp_cell,mod1_cell,method_diff="non-normalized",plot_diff=TRUE)

3) Get the cell line models of the same tissue origin representing the CIN type of interest

[Selection of samples with the same tissue origin]{.ul}

We aim at finding the cell line of the same tissue origin that represents the CIN type operating in OVKATE (our sample of interest). We have selected this HGSO cancer cell line because it is known to have extreme CIN. Therefore, we selected profiles of all cell lines derived from ovary tissue.

sample="OVKATE"
cells.tissue<-as.character(na.omit(CCLE_metadata$Cell.line.primary.name[CCLE_metadata$Site.Primary == CCLE_metadata$Site.Primary[CCLE_metadata$Cell.line.primary.name == sample]]))
tissue_segcn<-cells_segcn[cells_segcn$sample %in% cells.tissue,]
tissue.profiles<-getProfiles(segcn=tissue_segcn, samples=unique(tissue_segcn$sample))
[Quantification of copy-number signatures]{.ul}

Now, we generate a matrix with the activity levels of each copy-number signature per sample. The file component_parameters.rds needs to be downloaded from here: https://bitbucket.org/britroc/cnsignatures/src/master/data/

component_parameters<-readRDS(paste0(path_to_data,"component_parameters.rds"))

CN_features <- extractCopynumberFeatures(tissue.profiles)
sample_by_component <- generateSampleByComponentMatrix(CN_features, all_components=component_parameters)
signature_quantification <- quantifySignatures(sample_by_component)
[Clustering samples by copy-number signatures]{.ul}

K-means clustering (MacQueen, 1967) is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. Therefore, we will be able to detect similar samples according to copy-number signature exposures.

Since the aim is to match OVKATE with a group of ovarian cell lines with similar signature activities, clustering analyses can be done excluding the input sample and then asking which cluster is the closest to OVKATE. By default, the getClusterSamples function includes the input sample for performing the clustering analysis.

exp_cell=as.matrix(signature_quantification[,colnames(signature_quantification)=="OVKATE"])
colnames(exp_cell)<-"OVKATE"
signature_quantification=signature_quantification[,colnames(signature_quantification)!="OVKATE"]

The getClusterSamples function computed the k-means clustering by using cosine similarity and outputs a list of samples included in the closest cluster.. First, the function calculates the optimal number of clusters (k) and then outputs the list of samples clustering with the input sample. In case the input sample is not included in the clustering process, this function asks which cluster is closest to the input sample based on copy number signatures. To this, the cosine similarity between OVKATE and the cluster centers (mean values of the cluster) is computed.

samples<-getClusterSamples(matrix=signature_quantification, cell=exp_cell)
print(samples)

Then, data is plotted using a scatter plot following the principal component reduction idea. We selected the two signatures with highest variability across clusters to plot data. Only the closest cluster to the input sample is colored.

m<-cbind(signature_quantification,exp_cell)
plotClusters(matrix=m, samples=samples)


macintyrelab/CNpare documentation built on April 15, 2022, 4:46 a.m.