knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The SCclust package implements feature selection based on breakpoints, permutations for FDRs for Fisher test p-values and identification of the clone structure in single cell copy number profiles.

In this tutorial we show how to use SCclust package using data, prepared by sgains pipeline as described in Example usage of sGAINS pipeline. SCclust package is called as the last step in processing data from sgains pipeline. In this tutoral we show how SCclust package could be used independently from sgains pipeline.

We assume that you have an R environment and have installed SCclust package as described in the README.md.

Data

Data for the T10 case

This tutorial is based on data published in: Navin N, Kendall J, Troge J, et al. Tumor Evolution Inferred by Single Cell Sequencing. Nature. 2011;472(7341):90-94. doi:10.1038/nature09807. In particular we will use the data for polygenomic breast tumor T10 case available from SRA. Description of samples for T10 could be found in Supplementary Table 1 | Summary of 100 Single Cells in the Polygenomic Tumor T10

We are going to run SCclust package on prepared by sgains pipeline varbin step. You can go through all the step in sgains T10 tutorial and prepare this data.

For the purposes of this tutorial we recomend you to download already prepared varbin data from example data. Apart from varbin T10 data you will need the binning scheme used in the analysis, that could be found here. And also we will need cytoBand.txt for HG19 that you can download it from UCSC Genome Browser.

Collect the Neccessary Data

Let us create a directory, where to store all the data used in this tutorial:

mkdir T10data
cd T10data

and let us download and extract T10 varbin data:

wget -c \
  https://github.com/KrasnitzLab/SCclust/releases/download/v1.0.0RC3/navin_t10_varbin_data.tar.gz
tar zxvf navin_t10_varbin_data.tar.gz
rm navin_t10_varbin_data.tar.gz

Let us also download and extract the binning scheme used in preparation of varbin data:

wget -c \
  https://github.com/KrasnitzLab/SCclust/releases/download/v1.0.0RC3/hg19_R50_B20k_bins_boundaries.txt.gz
gunzip hg19_R50_B20k_bins_boundaries.txt.gz

And finally let us download the cytoBand.txt for Human reference genome hg19:

wget -c \
  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
gunzip cytoBand.txt.gz

Our data directory should have following structure:

.
|-- T10data
    |-- cytoBand.txt
    |-- hg19_R50_B20k_bins_boundaries.txt
    |-- varbin
        |-- SRR052047.varbin.20k.txt
        |-- SRR052148.varbin.20k.txt
        |-- SRR053437.varbin.20k.txt
        ...

Explore the Dowloaded Data

We are going to use SCclust package so let us load it:

library("SCclust")

Binning schema

\todo{Describe the data.}

gc_df <- read.csv("T10data/hg19_R50_B20k_bins_boundaries.txt", header = T, sep='\t')
knitr::kable(head(gc_df))

Cytobands and Centromeres for HG19

\todo{Describe the data.}

cytobands <- read.csv("T10data/cytoBand.txt", header = F, sep='\t')
knitr::kable(head(cytobands))

The main reason we need cytoBand.txt is to get the location of centromeres. Since centromere areas contain a lot of repetitive sequencies they are excluded from analysis when segmenting and clustering samples.

To find regions where centromeres are located we are using calc_centroareas function:

centroareas <- calc_centroareas(cytobands)
knitr::kable(head(centroareas, 5))

So, in centroareas for each chromosome we have the region where the centromere is located.

Varbin Samples Data

\todo{Describe the data.}

For each varbin sample

sample_df <- read.csv("T10data/varbin/SRR052047.varbin.20k.txt", header=T, sep='\t')
knitr::kable(head(sample_df))
sample_df <- read.csv("T10data/varbin/SRR052148.varbin.20k.txt", header=T, sep='\t')
knitr::kable(head(sample_df))

Segmentation of Varbin Data

# centrobins <- calc_regions2bins(gc_df, centroareas)


KrasnitzLab/SCclust documentation built on Dec. 11, 2021, 5:12 p.m.