Prodyna is a versatile software package for the analysis of protein dynamics data (typically taken from MD simulations) written for the R environment. The library includes tools for coordinate transformation and selection, dimensionality reduction and clustering. Moreover, several convenience functions exist, e.g. to easily plot comprehensible diagrams or to interact with visualization tools like VMD.
The R statistical programming environment was chosen as host language because of its universal availability and the plethora of options for advanced statistical analysis and machine learning.
Prodyna is free and open source software, published under the simplified BSD license.
You can find all source code on Github.
For performance reasons, prodyna is tightly connected several high-performance tools which are also available as free software, either on https://github.com/lettis/prodyna or from other resources.
We assume you have a recent Linux environment. While prodyna and its dependencies may also run other systems (most probably also on MacOS, maybe on windows), the software has not been tested on these. If you are trying to install the software on such a system, do not hesitate to ask us for help. However, we cannot give you any guarantees…
For the rest of the tutorial, we will assume that prodyna and all its third-party dependencies are installed and working.
Using the R-devtools package, prodyna can easily be installed directly from github. To this end, install devtools via
if ( ! ("devtools" %in% installed.packages())) { install.packages("devtools") }
Then, install prodyna and its R-dependencies (that might take a while) via
devtools::install_github("lettis/prodyna")
As mentioned above, prodyna employs a set of high-performance software packages that need to be installed additionally:
Gromacs
For most of the data handling regarding MD data, prodyna uses the Gromacs software suite. Gromacs can be downloaded from http://www.gromacs.org/Downloads. It is advised to use the most recent version, but at least version 5.X. The downloaded package includes installation instructions, which however are also available on the Gromacs website.
FastPCA
FastPCA is a parallelized C++ implementation of standard PCA and dPCA+ (TODO: citation). Its code and installation instructions are available at https://github.com/lettis/FastPCA.
Clustering
The clustering packages provides methods for density-based geometric clustering, dynamic clustering, frame filtering and variable dynamic coring. Being written in C++ and CUDA, it is a high-performance package for the efficient clustering of trajectories with millions of frames (as long as one has a decent Nvidia-GPU available). Code and documentation are available at https://github.com/lettis/Clustering.
VMD
To visualize protein structures, VMD is one of the best solutions. The software is free for academic use and can be downloaded and installed after registration on the VMD website.
This tutorial comes with a small data set of HP-35 NLE/NLE, subsampled (every 100th frame) from a larger data set thankfully made available by D.E. Shaw research (TODO: cite). HP-35 NLE/NLE is a mutant of the chicken villin headpiece, consists of 35 amino acids and forms three alpha-helices, connected by two short loops. In its native state, HP-35 folds to form a small hydrophobic cavity between the helices.
The data is already adapted to prodyna/Gromacs, with
HP35.pdb
HP35_every100.xtc
residuetypes.dat
)The data can be downloaded from TODO.
You can view the reference structure using VMD:
prodyna::vmd.run()
When VMD has properly started, load the structure file and let VMD depict the protein in the familier cartoon style:
prodyna::vmd.loadPdb("HP35.pdb") prodyna::vmd.setStyle("NewCartoon")
The dynamics of proteins are very complicated. A typical protein has hundreds to thousands of amino acids (HP35, the protein used in the test data is a very small example) and thousands to ten thousands of atoms. With the position of each atom distinctly defined by three Cartesian coordinates, a protein structure can be identified as a point in a very high-dimensional space.
However, we are in general interested in the functional motion and thermodynamic properties like energy distributions, etc., which are intricately connected to the protein structure. The high dimensionality of the data makes a direct analysis impossible due to the so-called curse of dimensionality. Therefore, we need to reduce the number of coordinates for analysis. This is either done by constructing physically ‘meaningful’ coordinates, e.g. certain inter-molecular distances, contacts, or angles measuring the torsions of the protein chain, or by applying statistical methods like PCA to project the essential motion of protein onto a view principal components. Usually, one performs a combination of both.
Having defined proper reaction coordinates, further analysis methods like clustering are applied to construct metastable states. Then, the dynamics of the protein might be described in a simple state model with certain, memoryless transition probabilities from one state to another (i.e., a Markov chain, hence called Markov state models). Every state can be described by its associated molecular structure, level of free energy, equilibrium population, etc.
Currently, the construction of reaction coordinates and state models and the visualization of related observables is the focus of the prodyna functionality - and hence the focus of this tutorial.
A more detailed theoretical background of the protein analysis procedure described in the following can be found in Sittel & Stock (2016), Robust density-based clustering to identify metastable conformational states of proteins. In addition, the clustering tutorial may provide additional information.
TODO. The computation of contacts and distances is not yet implemented in prodyna.
Being a typical folder, i.e. a protein that constantly folds and unfolds, HP-35 is best described by the backbone dihedral angles, which encode the secondary structure. For a first assessment of the data, we will construct the dihedral angles with
prodyna::set.binary("gmx", "/usr/bin/gmx") prodyna::generate.dihedrals(ref = "HP35_NLE.pdb", traj = "HP35_every100.xtc", skipCA = c(1, 35))
Here, the Gromacs software is used, so make sure the path to the gmx
binary is set correctly.
This will produce two files
HP35_every100.xtc.dih
with columns corresponding to $\varphi$ and $\psi$ dihedral angles of residues 2 to 34
(first and last residue is skipped) HP35_every100.xtc.dih.info
containing information on the reference and trajetory files, the total number of residues, as well as the indices of skipped residues. # Read phi and psi angles for residues 2 and 3 dih <- prodyna::read.dihedrals("HP35_every100.xtc.dih", resnos=c(2,3)) str(dih)
dihInfo <- prodyna::read.dihedrals.info("HP35_every100.xtc.dih.info") str(dihInfo)
Review the Ramachandran plots (with complete dihedral angle pairs) of residues 2 and 3 by running
for (residue in c(2,3)) { show(prodyna::plt.ramachandran(residue, dihedrals=dih)) }
To reduce the dimensionality of the data set, we run a dPCA+ on the freshly generated dihedrals. Since prodyna uses the fastpca binary, which is not installed in a standard path, we need to make it aware of its existence. Adapt this to your needs.
prodyna::set.binary("fastpca", "/usr/local/bin/fastpca") prodyna::run.dPCAplus("HP35_every100.xtc.dih", corr = T)
This will produces the following files:
HP35_every100.xtc.dih.val
contains the eigenvalues.HP35_every100.xtc.dih.vec
contains the eigenvectors.HP35_every100.xtc.dih.cov
contains the covariance matrix. HP35_every100.xtc.dih.proj
contains the projected data.HP35_every100.xtc.dih.stats
contains mean values, sigmas and boundary shifts.To check on the results, we plot on overview for the first ten principal components (PCs). This shows in the lower triangle the 2D projections of the PC pairs, their 1D projections on the diagonal, and the eigenvector contributions in the upper triangle.
prodyna::plt.pcaOverview(coords="HP35_every100.xtc.dih", pcs=1:10)
The projections suggest, that PCs higher than seven do not really show structure and probably encode thermal fluctuations only. Furthermore, PC six does not seem to be very interesting. Therefore, we perform dimensionality reduction by selecting PCs 1-5 and 7 as reaction coordinates.
prodyna::generate.reactionCoordinates(coords = "HP35_every100.xtc.dih.proj", columns = c(1:5, 7), output = "reaction_coords")
This produces the file HP35_every100.xtc.dih.proj.1-5.7
containing only the selected coordinates, as well as the file HP35_every100.xtc.dih.proj.1-5.7.info
which stores which columns of which files were selected.
The reduced coordinate set will from now on define our reaction coordinates.
rc <- "reaction_coords" # reaction coordinates file
To identify metastable states, i.e. structures that are low in free energy and thus often sampled, we perform density-based geometric clustering on the reduced coordinate set.
prodyna::set.binary("clustering", "/usr/local/bin/clustering")
First, we need to find a suitable radius for density estimation. We therefore compute neighbourhood populations and corresponding free energies for different neighbourhood radii.
As a whole bunch of files are generated during the (geometric) clustering process, we create a separate directory.
dir.create("geomClust") prodyna::clustering.estimate.radii(rc = rc, radii = c(0.25, 0.3, 0.35, 0.4, 0.45), fe_prefix = "geomClust/freeEnergy", pop_prefix = "geomClust/population")
This will create a directory called geomClust that contains files named freeEnergy_<radius>
and population_<radius>
where <radius>
takes the values 0.2, 0.3, 0.4, 0.5, 0.6.
Choosing a large radius will ensure good sampling. However, the population distribution should still show an exponential decay. This can be checked by plotting the neighbourhood populations per frame sorted in descending order.
pops <- prodyna::read.populations(pop_prefix = "geomClust/population", radii=NULL) prodyna::plt.populations(pops=pops, radii=NULL, logy=T, select = 10)
Based on this plot we choose a radius of 0.35 which is now used to compute the neighbourhood per frame and store the result in a file neighbours.
radius <- 0.35 prodyna::clustering.compute.neighborhood(rc = rc, radius = radius, fe_prefix = "geomClust/freeEnergy", nn = "geomClust/neighbours")
Next, screening of the energy landscape is performed, i.e. frames below a certain energy threshold are clustered into microstates based on their distances. Here, the considered energy thresholds are 0.1, 0.2, 0.3, ... up to the maximum free energy level in the dataset.
prodyna::clustering.screening(rc = rc, radius = radius, fe_prefix = "geomClust/freeEnergy", nn = "geomClust/neighbours", clust_prefix = "geomClust/clust", min = 0.1, step = 0.1, max = NULL)
This will create files named clust.<threshold>
in the clustering directory that contain the microstate assignments per frame for different energy thresholds.
The density network describes the geometrical similarity of microstates at different levels of free energy. It can be computed via:
prodyna::clustering.densityNetwork(minpop = 100, clust_prefix = "geomClust/clust", step = 0.1)
This will produces several files in the geomClust
directory:
remapped_clust_<feThreshold>
network_nodes.dat
, network_links.dat
, network_leaves.dat
network_end_node_traj.dat
network_visualization.html
The remapped_clust
files contain the same state assignments as the clust
files. However, microstates are given a new id that is unique w.r.t. the various energy levels.
The .dat
files describes the (multi-)tree structure of the density network. A visualization of this network is given by the .html
file.
The network_end_node_traj.dat
file contains state assignments per frame where each frame is given a non-zero id only if it is assigned to a leaf node. These state assignments are used as a seed for separating the free energy landscape into different microstates.
prodyna::clustering.microstates(rc = rc, radius = radius, fe_prefix = "geomClust/freeEnergy", nn = "geomClust/neighbours", init = "geomClust/network_end_node_traj.dat", microstates = "geomClust/microstates", sorted = T)
The file microstates
contains the final state assignments per frame obtained by the geometric clustering procedure. Note that the state ids may have large values and are not in consecutive order.
To obtain states that are both geometrically similar and somewhat stable over time, the MPP (most-probable-path) algorithm is used to cluster the microstates that resulted from geometric clustering by their transition times.
Here, the paramter lagtime
specifies how many frames should be skipped when computing transition probabilities from one state to another. This way processes on very short timescales will be discarded and -given a high enough lagtime- the resulting system will be approximately markovian.
The metastability criterion \code{Q} controls how stable a state has to be to remain as single state. States with a self-transition probability lower than the metastability \code{Q} will be combined with the state to which their transition probability is highest.
dir.create("dynamicClust") prodyna::clustering.mpp(microstates = "geomClust/microstates", fe_prefix = "geomClust/freeEnergy", radius = 0.3, lagtime = 25, mpp = "dynamicClust/mpp", Q = 0.3)
This will generate files called mpp_pop_0.200.dat
and mpp_traj_0.200.dat
. (TODO: and other output files ...)
The population files hold the population information of the clusters for the given metastibility, while the trajectory files contain the resulting cluster/state trajectories.
Compute the waiting time distributions resulting from coring with different window sizes.
# window sizes to test ws <- c(5,10,15,20) prodyna::clustering.wtDistributions(traj = "dynamicClust/mpp_traj_0.300.dat", wtd_prefix = "dynamicClust/wtDist", wsizes = ws) wtd <- prodyna::read.wtDistributions(wtd_prefix = "dynamicClust/wtDist", states = c(1,2,3), wsizes = ws) prodyna::plt.wtDistributions(wtd, max_frame = 300, logy=F)
We use a coring window of 10 frames for state 1 and 2, and a coring window of 5 for state 3. The cored state trajectory is obtained via:
prodyna::clustering.coring(traj = "dynamicClust/mpp_traj_0.300.dat", traj_cored = "dynamicClust/coredTrajectory", states = c(1,2,3), wsizes = c(10,10,5)) prodyna::plt.stateTrajComparison("dynamicClust/mpp_traj_0.300.dat", "dynamicClust/coredTrajectory")
The cored state trajectory is approximately Markovian. The transition matrix of this system can be estimated from the trajectory.
T <- prodyna::msm.transitionMatrix(traj = "dynamicClust/coredTrajectory", lag = 1) simulated_trajectory <- prodyna::msm.sim(T, 1, 100)
The metastable states resulting from the clustering procedure can be described by a Ramacolor plot which shows the color-coded dihedral angle content per residue.
prodyna::plt.ramacolor(statetraj = "dynamicClust/coredTrajectory", dihedrals = "HP35_every100.xtc.dih")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.