Analysis of Protein Dynamics: The Prodyna Tutorial

Introduction

Prodyna is a versatile software package for the analysis of protein dynamics data (typically taken from MD simulations) written for the R environment. The library includes tools for coordinate transformation and selection, dimensionality reduction and clustering. Moreover, several convenience functions exist, e.g. to easily plot comprehensible diagrams or to interact with visualization tools like VMD.

The R statistical programming environment was chosen as host language because of its universal availability and the plethora of options for advanced statistical analysis and machine learning.

Prodyna is free and open source software, published under the simplified BSD license.

You can find all source code on Github.

For performance reasons, prodyna is tightly connected several high-performance tools which are also available as free software, either on https://github.com/lettis/prodyna or from other resources.

Installation

We assume you have a recent Linux environment. While prodyna and its dependencies may also run other systems (most probably also on MacOS, maybe on windows), the software has not been tested on these. If you are trying to install the software on such a system, do not hesitate to ask us for help. However, we cannot give you any guarantees…

For the rest of the tutorial, we will assume that prodyna and all its third-party dependencies are installed and working.

Prodyna

Using the R-devtools package, prodyna can easily be installed directly from github. To this end, install devtools via

if ( ! ("devtools" %in% installed.packages())) {
  install.packages("devtools")
}

Then, install prodyna and its R-dependencies (that might take a while) via

devtools::install_github("lettis/prodyna")

Third-party packages

As mentioned above, prodyna employs a set of high-performance software packages that need to be installed additionally:

Gromacs

For most of the data handling regarding MD data, prodyna uses the Gromacs software suite. Gromacs can be downloaded from http://www.gromacs.org/Downloads. It is advised to use the most recent version, but at least version 5.X. The downloaded package includes installation instructions, which however are also available on the Gromacs website.

FastPCA

FastPCA is a parallelized C++ implementation of standard PCA and dPCA+ (TODO: citation). Its code and installation instructions are available at https://github.com/lettis/FastPCA.

Clustering

The clustering packages provides methods for density-based geometric clustering, dynamic clustering, frame filtering and variable dynamic coring. Being written in C++ and CUDA, it is a high-performance package for the efficient clustering of trajectories with millions of frames (as long as one has a decent Nvidia-GPU available). Code and documentation are available at https://github.com/lettis/Clustering.

VMD

To visualize protein structures, VMD is one of the best solutions. The software is free for academic use and can be downloaded and installed after registration on the VMD website.

Available test data

This tutorial comes with a small data set of HP-35 NLE/NLE, subsampled (every 100th frame) from a larger data set thankfully made available by D.E. Shaw research (TODO: cite). HP-35 NLE/NLE is a mutant of the chicken villin headpiece, consists of 35 amino acids and forms three alpha-helices, connected by two short loops. In its native state, HP-35 folds to form a small hydrophobic cavity between the helices.

The data is already adapted to prodyna/Gromacs, with

The data can be downloaded from TODO.


You can view the reference structure using VMD:

prodyna::vmd.run()

When VMD has properly started, load the structure file and let VMD depict the protein in the familier cartoon style:

prodyna::vmd.loadPdb("HP35.pdb")
prodyna::vmd.setStyle("NewCartoon")

Analysis of protein dynamics

The dynamics of proteins are very complicated. A typical protein has hundreds to thousands of amino acids (HP35, the protein used in the test data is a very small example) and thousands to ten thousands of atoms. With the position of each atom distinctly defined by three Cartesian coordinates, a protein structure can be identified as a point in a very high-dimensional space.

However, we are in general interested in the functional motion and thermodynamic properties like energy distributions, etc., which are intricately connected to the protein structure. The high dimensionality of the data makes a direct analysis impossible due to the so-called curse of dimensionality. Therefore, we need to reduce the number of coordinates for analysis. This is either done by constructing physically ‘meaningful’ coordinates, e.g. certain inter-molecular distances, contacts, or angles measuring the torsions of the protein chain, or by applying statistical methods like PCA to project the essential motion of protein onto a view principal components. Usually, one performs a combination of both.

Having defined proper reaction coordinates, further analysis methods like clustering are applied to construct metastable states. Then, the dynamics of the protein might be described in a simple state model with certain, memoryless transition probabilities from one state to another (i.e., a Markov chain, hence called Markov state models). Every state can be described by its associated molecular structure, level of free energy, equilibrium population, etc.

Currently, the construction of reaction coordinates and state models and the visualization of related observables is the focus of the prodyna functionality - and hence the focus of this tutorial.

A more detailed theoretical background of the protein analysis procedure described in the following can be found in Sittel & Stock (2016), Robust density-based clustering to identify metastable conformational states of proteins. In addition, the clustering tutorial may provide additional information.

Contacts and distances

TODO. The computation of contacts and distances is not yet implemented in prodyna.

Dihedral angles

Being a typical folder, i.e. a protein that constantly folds and unfolds, HP-35 is best described by the backbone dihedral angles, which encode the secondary structure. For a first assessment of the data, we will construct the dihedral angles with

prodyna::set.binary("gmx", "/usr/bin/gmx")
prodyna::generate.dihedrals(ref  = "HP35_NLE.pdb", 
                            traj = "HP35_every100.xtc",
                            skipCA = c(1, 35))

Here, the Gromacs software is used, so make sure the path to the gmx binary is set correctly.


This will produce two files

# Read phi and psi angles for residues 2 and 3
dih <- prodyna::read.dihedrals("HP35_every100.xtc.dih", resnos=c(2,3))
str(dih)
dihInfo <- prodyna::read.dihedrals.info("HP35_every100.xtc.dih.info")
str(dihInfo)

Review the Ramachandran plots (with complete dihedral angle pairs) of residues 2 and 3 by running

for (residue in c(2,3)) {
  show(prodyna::plt.ramachandran(residue, dihedrals=dih))
}

dPCA

To reduce the dimensionality of the data set, we run a dPCA+ on the freshly generated dihedrals. Since prodyna uses the fastpca binary, which is not installed in a standard path, we need to make it aware of its existence. Adapt this to your needs.

prodyna::set.binary("fastpca", "/usr/local/bin/fastpca")
prodyna::run.dPCAplus("HP35_every100.xtc.dih", corr = T)

This will produces the following files:


To check on the results, we plot on overview for the first ten principal components (PCs). This shows in the lower triangle the 2D projections of the PC pairs, their 1D projections on the diagonal, and the eigenvector contributions in the upper triangle.

prodyna::plt.pcaOverview(coords="HP35_every100.xtc.dih", pcs=1:10)

The projections suggest, that PCs higher than seven do not really show structure and probably encode thermal fluctuations only. Furthermore, PC six does not seem to be very interesting. Therefore, we perform dimensionality reduction by selecting PCs 1-5 and 7 as reaction coordinates.

prodyna::generate.reactionCoordinates(coords  = "HP35_every100.xtc.dih.proj",
                                      columns = c(1:5, 7),
                                      output  = "reaction_coords")

This produces the file HP35_every100.xtc.dih.proj.1-5.7 containing only the selected coordinates, as well as the file HP35_every100.xtc.dih.proj.1-5.7.info which stores which columns of which files were selected.

The reduced coordinate set will from now on define our reaction coordinates.

rc <- "reaction_coords"  # reaction coordinates file

Geometric, density-based clustering: Finding metastable states

To identify metastable states, i.e. structures that are low in free energy and thus often sampled, we perform density-based geometric clustering on the reduced coordinate set.

prodyna::set.binary("clustering", "/usr/local/bin/clustering")

First, we need to find a suitable radius for density estimation. We therefore compute neighbourhood populations and corresponding free energies for different neighbourhood radii.

As a whole bunch of files are generated during the (geometric) clustering process, we create a separate directory.

dir.create("geomClust")
prodyna::clustering.estimate.radii(rc         = rc,
                                   radii      = c(0.25, 0.3, 0.35, 0.4, 0.45),
                                   fe_prefix  = "geomClust/freeEnergy",
                                   pop_prefix = "geomClust/population")

This will create a directory called geomClust that contains files named freeEnergy_<radius> and population_<radius> where <radius> takes the values 0.2, 0.3, 0.4, 0.5, 0.6.

Choosing a large radius will ensure good sampling. However, the population distribution should still show an exponential decay. This can be checked by plotting the neighbourhood populations per frame sorted in descending order.

pops <- prodyna::read.populations(pop_prefix = "geomClust/population", radii=NULL)
prodyna::plt.populations(pops=pops, radii=NULL, logy=T, select = 10)

Based on this plot we choose a radius of 0.35 which is now used to compute the neighbourhood per frame and store the result in a file neighbours.

radius <- 0.35
prodyna::clustering.compute.neighborhood(rc        = rc, 
                                         radius    = radius,
                                         fe_prefix = "geomClust/freeEnergy",
                                         nn        = "geomClust/neighbours")

Next, screening of the energy landscape is performed, i.e. frames below a certain energy threshold are clustered into microstates based on their distances. Here, the considered energy thresholds are 0.1, 0.2, 0.3, ... up to the maximum free energy level in the dataset.

prodyna::clustering.screening(rc           = rc, 
                              radius       = radius, 
                              fe_prefix    = "geomClust/freeEnergy",
                              nn           = "geomClust/neighbours",
                              clust_prefix = "geomClust/clust",
                              min          = 0.1,
                              step         = 0.1, 
                              max          = NULL)

This will create files named clust.<threshold> in the clustering directory that contain the microstate assignments per frame for different energy thresholds.


The density network describes the geometrical similarity of microstates at different levels of free energy. It can be computed via:

prodyna::clustering.densityNetwork(minpop       = 100,
                                   clust_prefix = "geomClust/clust",
                                   step         = 0.1)

This will produces several files in the geomClust directory:

The remapped_clust files contain the same state assignments as the clust files. However, microstates are given a new id that is unique w.r.t. the various energy levels.

The .dat files describes the (multi-)tree structure of the density network. A visualization of this network is given by the .html file.

The network_end_node_traj.dat file contains state assignments per frame where each frame is given a non-zero id only if it is assigned to a leaf node. These state assignments are used as a seed for separating the free energy landscape into different microstates.

prodyna::clustering.microstates(rc          = rc, 
                                radius      = radius,
                                fe_prefix   = "geomClust/freeEnergy",
                                nn          = "geomClust/neighbours", 
                                init        = "geomClust/network_end_node_traj.dat", 
                                microstates = "geomClust/microstates",
                                sorted      = T) 

The file microstates contains the final state assignments per frame obtained by the geometric clustering procedure. Note that the state ids may have large values and are not in consecutive order.

Dynamical clustering

To obtain states that are both geometrically similar and somewhat stable over time, the MPP (most-probable-path) algorithm is used to cluster the microstates that resulted from geometric clustering by their transition times.

Here, the paramter lagtime specifies how many frames should be skipped when computing transition probabilities from one state to another. This way processes on very short timescales will be discarded and -given a high enough lagtime- the resulting system will be approximately markovian.

The metastability criterion \code{Q} controls how stable a state has to be to remain as single state. States with a self-transition probability lower than the metastability \code{Q} will be combined with the state to which their transition probability is highest.

dir.create("dynamicClust")
prodyna::clustering.mpp(microstates = "geomClust/microstates",
                        fe_prefix   = "geomClust/freeEnergy",
                        radius      = 0.3,
                        lagtime     = 25,
                        mpp         = "dynamicClust/mpp",
                        Q           = 0.3)

This will generate files called mpp_pop_0.200.dat and mpp_traj_0.200.dat. (TODO: and other output files ...)

The population files hold the population information of the clusters for the given metastibility, while the trajectory files contain the resulting cluster/state trajectories.


Compute the waiting time distributions resulting from coring with different window sizes.

# window sizes to test
ws <- c(5,10,15,20)

prodyna::clustering.wtDistributions(traj       = "dynamicClust/mpp_traj_0.300.dat",
                                    wtd_prefix = "dynamicClust/wtDist",
                                    wsizes     = ws)

wtd <- prodyna::read.wtDistributions(wtd_prefix  = "dynamicClust/wtDist",
                                      states      = c(1,2,3),
                                      wsizes      = ws)
prodyna::plt.wtDistributions(wtd, max_frame = 300, logy=F)

We use a coring window of 10 frames for state 1 and 2, and a coring window of 5 for state 3. The cored state trajectory is obtained via:

prodyna::clustering.coring(traj       = "dynamicClust/mpp_traj_0.300.dat",
                           traj_cored = "dynamicClust/coredTrajectory",
                           states = c(1,2,3),
                           wsizes = c(10,10,5))

prodyna::plt.stateTrajComparison("dynamicClust/mpp_traj_0.300.dat", "dynamicClust/coredTrajectory")

The cored state trajectory is approximately Markovian. The transition matrix of this system can be estimated from the trajectory.

T <- prodyna::msm.transitionMatrix(traj = "dynamicClust/coredTrajectory",
                                   lag  = 1)
simulated_trajectory <- prodyna::msm.sim(T, 1, 100)

The metastable states resulting from the clustering procedure can be described by a Ramacolor plot which shows the color-coded dihedral angle content per residue.

prodyna::plt.ramacolor(statetraj = "dynamicClust/coredTrajectory",
                       dihedrals = "HP35_every100.xtc.dih")


lettis/prodyna documentation built on May 21, 2019, 5:11 a.m.