Circular genome clustering

Optimal versus heuristic cluster borders on CpG sites of a circular bacterial genome

The fast optimal circular clustering (FOCC) [@Debnath21] and the heuristic repeated $K$-means circular clustering (HEUC) algorithms are applied on the CpG sites of the Candidatus Carsonella ruddii genome (GenBank accession number CP019943.1). Both algorithms clustered the CpG sites into 14 groups, as shown in the figure below.

library(OptCirClust)
library(ape)
library(knitr)
library(graphics)

opar <- par(mar=c(0,0,2,0))

opts_chunk$set(fig.width=6, fig.height=4) 

Event <- "CG"

K <- 14

# Seq <- read.GenBank("CP019943.1", as.character = TRUE)[[1]]
file <- system.file("extdata", "CP019943.1.fasta", package = "OptCirClust")

Seq <- ape::read.dna(file, format="fasta", as.matrix=FALSE, as.character = TRUE)

Seq <- toupper(paste(Seq$`CP019943.1 Candidatus Carsonella ruddii strain BC chromosome, complete genome`, collapse = ''))

V <- gregexpr(Event, Seq)

O <- sort(V[[1]][1:length(V[[1]])])

Circumference <- nchar(Seq)

set.seed(1)

result <- CirClust(O, K, Circumference, method = "FOCC")

plot(result, main = "Optimal circular clustering")

# arrows(.58, - 1.75, 0.48, -1.45, length = 0.125, angle = 30, code = 2, col="orange", lwd=4)
# arrows(0, -10, 0, 0, length = 0.125, angle = 30, code = 2, col="orange", lwd=4)
arrows(0.167, -0.55, 0,-0.145, length = 0.125, angle = 30, code = 1, col="orange", lwd=4)

result_km <- CirClust(O, K, Circumference, method = "HEUC")

plot(result_km, main = "Heuristic circular clustering",)

# arrows(.58, - 1.75, 0.4, -1.5, length = 0.125, angle = 30, code = 2, col="orange", lwd=4)

arrows(0.135, -0.55, 0,-0.145, length = 0.125, angle = 30, code = 1, col="orange", lwd=4)

par(opar)

The clusters obtained by FOCC algorithm are more compact and justifiable as compared to the HEUC ones. The cluster border between the C8 and C9 clusters of the optimal clustering are more subjectively justifiable as compared to the border between C4 and C8 clusters of the heuristic clustering outcome. The cluster borders are pointed by orange arrows inside the circular genome. A fixed seed for random number generation is used to force $K$-means to always return the same results.

Therefore, the advantage of optimal clustering over the heuristic clustering algorithm is evident in this example representing practical applications.

References



Try the OptCirClust package in your browser

Any scripts or data that you put into this service are public.

OptCirClust documentation built on July 28, 2021, 9:06 a.m.