SOMMD is an R package containing utilities for the analysis of biomolecular simulations, through the use of Self-Organizing Maps (SOM). SOM is a type of artificial neural network useful for effective identification of patterns in the data and has been widely used in many fields. The most interesting property of an SOM is that it performs a dimensionality reduction by mapping multidimensional data on the SOM grid, retaining topological relationships between neurons, that is, keeping similar input data close to each other on the map. Features of the SOMMD package include the ability to read trajectory data in different formats, compute descriptors for the process of interest, training of SOM and analysis of pathways.
The aim of this notebook is to provide a brief introduction to SOM training with SOMMD and kohonen packages.
Load the SOMMD package:
library(SOMMD)
Then use the command [help(package=SOMMD)
]{style="color:magenta"} to list the functions within the package and [help(FunctionName)
]{style="color:magenta"} to obtain more information about an individual function.
A number of example data sets are shipped with the SOMMD package. In the examples below we will input, process and analyze molecular dynamic trajectories of FOXP1 DNA-binding domain. This trajectory is stored in gromacs xtc format and has had all solvent atoms excluded to reduce overall file size.
The code below sets the file paths for the trajectory data (xtc) and reference structure data (gro).
utils::download.file("https://figshare.com/ndownloader/articles/26935432/versions/3", destfile = "FOXP1_dataset.zip", method = "curl") utils::unzip("FOXP1_dataset.zip") xtcfile <- "./01_FOXP1_dataset/FOXP1_WT-1-PROD.xtc" strfile <- "./01_FOXP1_dataset/FOXP1_WT.gro"
Note that in the above example the [system.file()
]{style="color:magenta"} command returns a character string with the complete path to the example dataset included with the SOMMD package. This is required as users may install the package in different locations. When using your own input files the [system.file()
]{style="color:magenta"} command will not be required.
Now read in the trajectory:
trj <- read.trj(trjfile=xtcfile, topfile=strfile)
The [read.trj
]{style="color:magenta"} function processes the input files and returns the new object [trj
]{style="color:magenta"}. We can check the basic structure of this objects with the following command:
print(trj)
We can see that the system is composed of 1482 atoms, and the trajectory have 2001 frames. A peculiarity of the [trj
]{style="color:magenta"} object is that it contains the atomic information of the system (read from the topfile
{style="color:magenta"} argument).
Self-Organizing maps can be trained using different descriptors. One of the most generic descriptor of a protein conformation is its atomic coordinate position. This is the most simple descriptor and will separate conformation based on their root mean square deviation.
In this simple example we will train a SOM using the protein C⍺ atomic coordinates. Given that this metric is sensible to the roto-translation of the system, we start performing a frame superposition on C⍺ atoms.
ca.inds <- which(trj$top$elety=="CA") trj.fit <- fit.trj(trj, trj.inds=ca.inds)
At this point the atomic xyz coordinates can be directly passed to the SOM:
SOM <- kohonen::som(trj2xyz(trj.fit, inds=ca.inds), grid = kohonen::somgrid(8, 8, "hexagonal", neighbourhood.fct="gaussian", toroidal=TRUE), dist.fcts="euclidean", rlen=500, mode='pbatch')
Note that here we are telling the SOM to use a particular toroidal shape. This means that the map is periodic, and the neurons at the border of the map are near to the neuron at the opposite side of the map.
The shape of the SOM can be inspected looking at the U-Matrix
plot(SOM, type = 'dist.neighbours', heatkey = TRUE, shape='straight', main="U-Matrix")
This plot is telling us regions of the map with high gradient of difference (yellow/white), and regions that contain neurons similar to each other (red).
An alternative and more accurate descriptor of a protein conformation, is the distance matrix. This type of descriptor have also the advantage to be independent to system roto-translation. For this reason it do not need a frame superposition step. To reduce the number of distances included as descriptors, only those between residues forming contacts in the native structure (distance between C𝛃 \< 1.1nm) will be considered. Moreover, the C- and N-terminal portion of the protein have been removed from the analysis, because if their intrinsic and noisy flexibility.
The following code read the native structure as a gro file, compute the selection of distances forming a native contacts and then compute the descriptors that will be used to train the SOM.
ref.str <- read.struct(strfile) cb.inds <- which(trj$top$elety=="CB" & trj$top$resno > 470 & trj$top$resno < 542) sele.dists <- native.cont(struct=ref.str, distance=1.1, atoms=cb.inds) DIST <- calc.distances(trj, atoms=cb.inds, sele=sele.dists)
Note that in the example above, the native.cont
{style="color:magenta"} function analyze the gro structure ad return an index of distances that are below the reference distance value in the native structure. This index is then used from the calc.distances
{style="color:magenta"} function to retain only the selected distances.
SOM is then trained based on the computed descriptors.
SOM <- kohonen::som(DIST, grid = kohonen::somgrid(8, 8, "hexagonal", neighbourhood.fct="gaussian", toroidal=TRUE), dist.fcts="euclidean", rlen=500, mode='pbatch')
As before we can plot the U-matrix to inspect the shape of the SOM.
plot(SOM, type = 'dist.neighbours', heatkey = TRUE, shape='straight', main="U-Matrix")
To further group neurons similar to each other into clusters an agglomerative clustering method can be applied to the neuron vectors. As all the hierarchical clustering methods, one should provide the number of clusters into which the neurons will be divided. To choose a reasonable number of clusters one can look at the silhouette profiles and choose a reasonable number of clusters:
par(mfrow=c(1,2)) #Plot the silhouette score sil.score <- silhouette.score(SOM, clust_method="complete", interval=seq(2,30)) plot(sil.score, type='b', pch=19, lwd=1, xlab="Number of clusters", ylab='Average silhouettes') for(i in seq(0, max(sil.score[,1]), by=2)){ abline(v=i, col="grey", lwd=0.5) } #Plot the silhouette profile for 8 number of clusters sil.prof <- silhouette.profile(SOM, Nclus=8, clust_method="complete") plot(sil.prof, main="", do.clus.stat = FALSE, do.n.k = FALSE) abline(v=mean(sil.prof[,3]), col='red', lty=3, lwd=2)
From this plot one can choose a good value for the number of clusters and plot the resulting SOM. This should ideally be a maximum in the average silhouette score plot, but looking at the single silhouette profiles (the figure on the right) one can further inspect the quality of the clusters.
The division of the SOM in clusters can be represented with the following code:
#Divide the SOM in the selected number of clusters Nclus <- 8 SOM.hc <- cutree(hclust(dist(SOM$codes[[1]], method="euclidean"), method="complete"), Nclus) #Choose a pleasing to the eye set of colors COL.SCALE <- c("#1f78b4", "#33a02c", "#e31a1c", "#ffff88", "#6a3d9a", "#a0451f", "#96c3dc", "#fbb25c", "#ff7f00", "#bea0cc", "#747474", "#f88587", "#a4db77") #Plot the SOM colored by clusters plot(SOM, type = "mapping", bgcol=COL.SCALE[SOM.hc], col=rgb(0,0,0,0), shape='straight', main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=5)
Note that here a set of cluster color is passed to the SOM as a vector of hex colors.
Every neuron and cluster have a number/name. The numbering/naming scheme can be of shown:
#Plot the SOM with neuron numbers plot(SOM, type = "mapping", bgcol=COL.SCALE[SOM.hc], col=rgb(0,0,0,0), shape='straight', main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=5) som.add.numbers(SOM, scale=0.5, col="black") som.add.clusters.legend(Nclus=8, color.scale=COL.SCALE)
To inspect the conformations associated to each neuron, a representative structure can be extracted from the SOM:
#Get a vector of representative frames for each neuron NEUR_repres <- neur.representatives(SOM) cat("Frames representatives of neurons: ") cat(NEUR_repres) #Get representatives for each cluster CL_repres <- cluster.representatives(SOM, SOM.hc) cat("\n\nNeurons representatives of each cluster: ") cat(CL_repres$neurons) cat("\n\nFrames representatives of each cluster: ") cat(CL_repres$frames)
If NA
values appear, they are associated to empty neurons/clusters.
To extract a conformation saving the pdb structure file simply use the trj2pdb
{style="color:magenta"} function:
#Estract the representative conformation of Neuron 8 trj2pdb(trj = trj, frame=NEUR_repres[8], filename = "./01_FOXP1_dataset/Neuron_8.pdb") #Estract the representative conformation of Cluster B trj2pdb(trj = trj, frame=CL_repres$frames["B"], filename = "./01_FOXP1_dataset/Cluster_B.pdb")
Note that trj2pdb
{style="color:magenta"} saves a pdb file with the coordinate of a selected frame of the simulation. Here we are passing as input the frame of the representative structures.
The population of each neuron can be represented using circles with the size proportional to the value of its population:
#Plot the SOM with circles with size proportional to the neuron population plot(SOM, type = "mapping", bgcol=COL.SCALE[SOM.hc], col=rgb(0,0,0,0), shape='straight', main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=5) population <- neur.population(SOM) som.add.circles(SOM, population, scale=0.9)
Neurons with bigger circles are populated by more frames. Note that in the function SOM.add.circles
{style="color:magenta"} the scale
argument regulates the dimension of the circles and can be used to avoid circles that exceed the dimension of the neuron.
Colors-scales can be used to represent the average property of frames belonging to each neurons. To do that, simply compute the average property for each neuron and then color the map accordingly. Here we compute the distance between the Calpha atoms of the C- and N- terminal ends of the domain and plot this property:
RMSD_file <- "./01_FOXP1_dataset/FOXP1_RMSD_WT-1-PROD.dat" RMSD <- read.table(RMSD_file) Neur.avg.RMSD <- average.neur.property(SOM, RMSD[,1]) par(mfrow=c(1,2)) #Plot the SOM with circles with size proportional to the average RMSD values plot(SOM, type = "mapping", bgcol=COL.SCALE[SOM.hc], col=rgb(0,0,0,0), shape='straight', main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=5) som.add.circles(SOM, Neur.avg.RMSD, scale=0.5) #Plot the SOM colored according to the average property value of each neuron plot(SOM, type = "property", property=Neur.avg.RMSD, shape='straight', palette.name=colorRampPalette(c("blue", "green", "red")), main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=5)
You will probably see that there is a neuron with a very extreme value of the RMSD. Any external property can be used provided that every property value is associated to a frame (the length of the vector of the property must be equal to the number of frames in trj).
As an excercise try to identify the motion that differentiate the neuron/neurons with the highest RMSD values.
If you have multiple simulations and you want to analyze them using SOM, you can concatenate the simulations and then use to SOM to inspect differences. Within SOMMD this is done using the cat.trj
{style="color:magenta"} function that concatenates two or more simulations. In the following example we will compare five replicas of the same system.
At first we read all the five xtc files:
#Read the xtc files in the folder folder <- "./01_FOXP1_dataset" xtcfiles <- list.files(folder, pattern = "FOXP1_WT-.-PROD.xtc") strfile <- "./01_FOXP1_dataset/FOXP1_WT.gro" MD1 <- read.trj(trjfile=paste(folder, "/", xtcfiles[1], sep=''), topfile=strfile) MD2 <- read.trj(trjfile=paste(folder, "/", xtcfiles[2], sep=''), topfile=strfile) MD3 <- read.trj(trjfile=paste(folder, "/", xtcfiles[3], sep=''), topfile=strfile) MD4 <- read.trj(trjfile=paste(folder, "/", xtcfiles[4], sep=''), topfile=strfile) MD5 <- read.trj(trjfile=paste(folder, "/", xtcfiles[5], sep=''), topfile=strfile)
And then we will concatenate all the simulations:
trj <- cat.trj(MD1, MD2, MD3, MD4, MD5)
Note that this command also update the trj\$start and trj\$end fields that will be useful later in the notebook.
To save memory, remove the single simulations once you have completed the concatenation.
rm(MD1, MD2, MD3, MD4, MD5)
At this point you can use the concatenated trajectory to train a SOM as done in the previous exercise:
ref.str <- read.struct(strfile) cb.inds <- which(trj$top$elety=="CB" & trj$top$resno > 470 & trj$top$resno < 542) sele.dists <- native.cont(struct=ref.str, distance=1.1, atoms=cb.inds) DIST <- calc.distances(trj, atoms=cb.inds, sele=sele.dists) SOM <- kohonen::som(DIST, grid = kohonen::somgrid(8, 8, "hexagonal", neighbourhood.fct="gaussian", toroidal=TRUE), dist.fcts="euclidean", rlen=1000, mode='pbatch')
Choose the best number of clusters
sil.score <- silhouette.score(SOM, clust_method="complete", interval=seq(2,30)) plot(sil.score, type='b', pch=19, lwd=1, xlab="Number of clusters", ylab='Average silhouettes') for(i in seq(0, max(sil.score[,1]), by=2)){ abline(v=i, col="grey", lwd=0.5) }
NClus=8 SOM.hc <- cutree(hclust(dist(SOM$codes[[1]], method="euclidean"), method="complete"), Nclus)
And do the plots.
COL.SCALE <- c("#1f78b4", "#33a02c", "#e31a1c", "#ffff88", "#6a3d9a", "#a0451f", "#96c3dc", "#fbb25c", "#ff7f00", "#bea0cc", "#747474", "#f88587", "#a4db77") #Plot the SOM with neuron numbers plot(SOM, type = "mapping", bgcol=COL.SCALE[SOM.hc], col=rgb(0,0,0,0), shape='straight', main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=5) som.add.numbers(SOM, scale=0.5, col="black") som.add.clusters.legend(Nclus=8, color.scale=COL.SCALE)
To inspect the space sampled by each replica, plot the population of each replica:
par(mfrow=c(2,3)) for(rep in 1:5){ plot(SOM, type = "mapping", bgcol=COL.SCALE[SOM.hc], col=rgb(0,0,0,0), shape='straight', main="") kohonen::add.cluster.boundaries(SOM, SOM.hc, lwd=3) population <- neur.population(SOM, start=trj$start, end=trj$end, N=rep) som.add.circles(SOM, population, scale=0.4) }
From this plots it is possible to understand which simulations sampled an overlapping space, and which cluster was sampled by unique replicas.
As an exercise, select a cluster/neuron exclusive for a particular replica and identify which are the structural differences that make it different from the others.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.