In daewoooo/SaaRclust: R Package to cluster long sequencing reads into chromosomes to facilitate de novo genome assembly.

Introduction

Strand-seq is a single-cell sequencing technique able to preserve contiguity of individual parental homologues in single-cell. This feature has been shown to be valuable for scaffolding early build genome assemblies as well finding chimeric or misoriented contigs. Here we introduce a SaaRclust as an R based package that implements a novel latent variable model and a corresponding Expectation Maximization (EM) algorithm in order to reliably cluster long sequencing reads by chromosome. Briefly, our approach produces, for each long read, a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads.

\newpage

Parameters

\textbf{inputfolder:} A folder name where minimap file(s) is stored. \hfill \break \textbf{outputfolder:} A folder name to export the results. \hfill \break \textbf{minimap.file:} A path to the minimap file to load. \hfill \break \textbf{num.clusters:} Expected number of clusters. (for 22 human autosomes == 44 clusters). However overclusterring (~ 54 clusters) is recommended such that smaller chromosomes are not missed \hfill \break \textbf{EM.iter:} Number of iteration of EM algorithm. \hfill \break \textbf{alpha:} Estimated level of background in Strand-seq reads. In other words expected noise in sequencing data caused either by library prepration or mapping to repetitive parts of the genome. \hfill \break \textbf{minLib:} Minimal number of different Strand-seq libraries being represent per every long read. \hfill \break \textbf{upperQ:} Filter out given percentage of long reads with the highest number of Strand-seq alignments. \hfill \break \textbf{logL.th:} Set the difference between objective function from the current and the previous interation for EM algorithm to converge. \hfill \break \textbf{theta.constrain:} Recalibrate theta values to meet expected distribution of W and C strands across all Strand-seq libraries. \hfill \break \textbf{store.counts:} Logical TRUE/FALSE if to store raw read counts aligned to each long read. \hfill \break \textbf{store.bestAlign:} If set to TRUE (best) representative alignements will be stored in RData object \hfill \break \textbf{numAlignments:} Required number of (best) representative alignmnets to be used in hard clustering. \hfill \break \textbf{HC.only:} If set to TRUE only the hard clustering will be performed and the rest of the clustering pipeline will be skipped. \hfill \break \textbf{HC.input:} A location to a filaname where the hard clustering results are stored. \hfill \break \textbf{verbose:} Set to TRUE if progress messages should be printed. \hfill \break

Quick Start

Download example data from the github repository

git clone https://github.com/daewoooo/SaaRclustExampleData

Set the location of the example data

inputfolder <- 'SaaRclust_exampleData'

Hard Clustering

In order to run only k-means based hard clustering on a example data.

# Hard clustering
# Remember to set HC.only=TRUE

runSaaRclust(inputfolder=inputfolder, outputfolder="SaaRclust_results", num.clusters=54,
EM.iter=100,alpha=0.01, minLib=10, upperQ=0.95, logL.th=1, theta.constrain=FALSE,
store.counts=FALSE, store.bestAlign=TRUE, numAlignments=3000, HC.only=TRUE, verbose=TRUE)

Soft Clustering

If RData object containing hard clustering results is already available you can run only soft clustering.

# Setting some variables

HC.input='SaaRclust_results/Clusters/hardClusteringResults.RData'
minimap.file='SaaRclust_exampleData/NA12878_WashU_PBreads_chunk9126.maf.gz'

# If theta.param & pi.param are set to NULL SaaRclust will try to load them from HC.input.

SaaRclust(minimap.file=minimap.file, outputfolder='SaaRclust_results', num.clusters=47,
EM.iter=100, alpha=0.1, minLib=10, upperQ=0.95, theta.param=NULL, pi.param=NULL, logL.th=1,
theta.constrain=FALSE, store.counts=FALSE, HC.input=HC.input)

Hard & Soft Clustering

In order to run both, hard and soft clustering in a single command.

# Hard clustering
# Remember to set HC.only=FALSE

runSaaRclust(inputfolder=inputfolder, outputfolder="SaaRclust_results", num.clusters=54,
EM.iter=100,alpha=0.01, minLib=10, upperQ=0.95, logL.th=1, theta.constrain=FALSE,
store.counts=FALSE, store.bestAlign=TRUE, numAlignments=3000, HC.only=FALSE, verbose=TRUE)

Export clustered long reads

In order export soft clustered long sequencing reads use function below. Set required threshold for probalility values and minimal number of libraries being represented per long read.

exportClusteredReads(inputfolder="SaaRclust_results", prob.th=0.5, minLib=5)

Plot clustering accuracy plots

NOTE: Working only for data from the original publication. \hfill \break Rscript below plots clustering accuracy measures presented in the orignal paper (Fig4 b,d,c) \hfill \break Before running the script make sure that biovizBase and ggplot2 packages are installed on your machine.

plotScripts = /SaaRclust/utils/postProcessing.R
inputdir = SaaRclust_results
outputdir = user_defined

run from the commnad line:
Rscript /SaaRclust/utils/runPostProcessing.R <plotScripts> <inputdir> <outputdir>

\newpage