knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The SCclust package implements feature selection based on breakpoints, permutations for FDRs for Fisher test p-values and identification of the clone structure in single cell copy number profiles.
In this tutorial we show how to use SCclust package using data, prepared by sgains pipeline as described in Example usage of sGAINS pipeline. SCclust package is called as the last step in processing data from sgains pipeline. In this tutoral we show how SCclust package could be used independently from sgains pipeline.
We assume that you have an R environment and have installed SCclust package as described in the README.md
.
This tutorial is based on data published in: Navin N, Kendall J, Troge J, et al. Tumor Evolution Inferred by Single Cell Sequencing. Nature. 2011;472(7341):90-94. doi:10.1038/nature09807. In particular we will use the data for polygenomic breast tumor T10 case available from SRA. Description of samples for T10 could be found in Supplementary Table 1 | Summary of 100 Single Cells in the Polygenomic Tumor T10
We are going to run SCclust package on prepared by sgains pipeline varbin
step.
You can go through all the step in sgains T10 tutorial
and prepare this data.
For the purposes of this tutorial we recomend you to download already prepared varbin
data from
example data.
Apart from varbin
T10 data you will need the binning scheme used in the analysis, that could be found here.
And also we will need cytoBand.txt
for HG19 that you can download it from UCSC Genome Browser.
Let us create a directory, where to store all the data used in this tutorial:
mkdir T10data
cd T10data
and let us download and extract T10 varbin
data:
wget -c \
https://github.com/KrasnitzLab/SCclust/releases/download/v1.0.0RC3/navin_t10_varbin_data.tar.gz
tar zxvf navin_t10_varbin_data.tar.gz
rm navin_t10_varbin_data.tar.gz
Let us also download and extract the binning scheme used in preparation of varbin
data:
wget -c \
https://github.com/KrasnitzLab/SCclust/releases/download/v1.0.0RC3/hg19_R50_B20k_bins_boundaries.txt.gz
gunzip hg19_R50_B20k_bins_boundaries.txt.gz
And finally let us download the cytoBand.txt
for Human reference genome hg19:
wget -c \
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
gunzip cytoBand.txt.gz
Our data directory should have following structure:
. |-- T10data |-- cytoBand.txt |-- hg19_R50_B20k_bins_boundaries.txt |-- varbin |-- SRR052047.varbin.20k.txt |-- SRR052148.varbin.20k.txt |-- SRR053437.varbin.20k.txt ...
We are going to use SCclust package so let us load it:
library("SCclust")
\todo{Describe the data.}
gc_df <- read.csv("T10data/hg19_R50_B20k_bins_boundaries.txt", header = T, sep='\t') knitr::kable(head(gc_df))
\todo{Describe the data.}
cytobands <- read.csv("T10data/cytoBand.txt", header = F, sep='\t') knitr::kable(head(cytobands))
The main reason we need cytoBand.txt
is to get the location of centromeres. Since centromere areas contain a lot
of repetitive sequencies they are excluded from analysis when segmenting and clustering samples.
To find regions where centromeres are located we are using calc_centroareas
function:
centroareas <- calc_centroareas(cytobands) knitr::kable(head(centroareas, 5))
So, in centroareas
for each chromosome we have the region where the centromere is located.
\todo{Describe the data.}
For each varbin
sample
sample_df <- read.csv("T10data/varbin/SRR052047.varbin.20k.txt", header=T, sep='\t') knitr::kable(head(sample_df))
sample_df <- read.csv("T10data/varbin/SRR052148.varbin.20k.txt", header=T, sep='\t') knitr::kable(head(sample_df))
# centrobins <- calc_regions2bins(gc_df, centroareas)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.