Analysis of Protein Dynamics: xgbAnalysis Tutorial

Introduction

The xgbAnalysis package is a tool to find suitable reaction coordinates of biomolecular systems, e.g. proteins, using the XGBoost algorithm that can be found on github. Given a trajectory from a molecular dynamics simulation and suitable corresponding states, it evaluates, which of the input coordinates describe the system best, resulting in a low dimensional reaction coordinate of directly interpretable original coordinates. To obtain states for a given trajectorie the prodyna package can be used.

Installation

Install the xgbAnalysis package, using the R-devtools package

if ( ! ("devtools" %in% installed.packages())) {
  install.packages("devtools")
}
devtools::install_github("sbbrandt/xgbAnalysis")
library(xgbAnalysis)

Available test data

To this end internal data of HP-35 with every 100th frame for a quick testrun:

Path: /u3/rc_classification/testrun/

Including test files:

Parameter

You can find an introduction to boosted decision trees here.

While XGBoost has many parameter to define, how the decision trees are build, the following are the most important and are usually used improve the learining process.

When training a model to precisely predict states for unseen data, a good parameter choice is crucial. For the feature analysis however, the parameter choice is not that important, as the XGBoost algorithm is strong against overfitting and comes with a good default choice of training parameter. Here, it is most important, that the build model is "complex" enough with enough training rounds (nrounds) of trees that are not too small (max_depth), so that enough features (coordinates) are used in the evaluation process for split candidates in the split nodes. Usually the default parameter suit that requirement.

Analysis

Set Up Working Directory and Import Data

In a first step the trajectory and states will be separated into a training set trainsplit = 0.7 => 70% and a test set.

getwd()
import.data(output_dir = "./data", 
            coords     = "villin_360K_every100.dih", 
            states     = "macrostates_every100",
            labels     = "dihedrals", 
            trainsplit = 0.7)

This will create the ouput directory, here data, containing the following files

Train a Model

To simply train a model that predicts the corresponding states of the trajectory with the default parameter and 2 training rounds, use

train.model(data_dir   = "./data",
            output_dir = "./model",
            nrounds    = 2
            )

As the data set is small and the number of training rounds is small, the accuracy will not be good. For better accuracy, try using nrounds = 20 training rounds for which the computation time will be 10x larger.

This will produce the following files in the output directory

Feature Selection

TODO description

feature.selection(output_dir = "./featureSelection",
                  data       = "./data",
                  decreasing = F,
                  eta        = 0.3,
                  max_depth  = 6,
                  nrounds    = 20,
                  nthread    = 0,
                  savemode   = T)

This will produce the following files in the ouput directory

If savemode = TRUE the following files will be produced every iteration round

The index n gives the number of coordinates (features) that are dismissed. In round 0, the model is trained with all coordinates and in round 1 the most/least important coordinate from round 0 has been removed from the data set. Given 66 dihedral angles for HP-35, at selectround 65 the model is trained using just the last remaining coordinate.

The prediction accuracy can be plotted in dependency of the number of dismissed coordinates (features).

plt.feature.selection(dir      = "./featureSelection",
                      saveplot = T)

If saveplot = TRUE, the folder ./featureSelection/plot/, containing the plot of the feature selection will be created. If FALSE the plot will be returned.

Dismissing the coordinates decreasingly, the most important coordinate will be dismissed every round.

feature.selection(output_dir = "./featureSelection_decreasing",
                  data       = "./data",
                  decreasing = T,
                  eta        = 0.3,
                  max_depth  = 6,
                  nrounds    = 20,
                  nthread    = 0,
                  savemode   = T)

Plot the feature selection decreasing

plt.feature.selection(dir      = "./featureSelection_decreasing",
                      saveplot = F)

If you want to continue a previous not finished feature selection or you want to dismiss certain features from the beginning, you can give a vector (fdismissed) with the names of the features (coordinates), that will be dismissed before the first iteration. To define how many features (coordinates) should be dismissed at all, use ndismiss

feature.selection(output_dir = "./featureSelection_decreasing",
                  data       = "./data",
                  decreasing = T,
                  ndismiss   = 8,
                  fdismissed = c("Phi2", "Psi13", "Psi4"),
                  eta        = 0.3,
                  max_depth  = 6,
                  nrounds    = 20,
                  nthread    = 0,
                  savemode   = T)

This will remove the coordinates "Phi2", "Psi13" and "Psi4" from the data set before starting to remove the most important coordinate for 5 iterations.

Single Class Importance

Given a trained model, the feature importance for every state can be obtained. Here, for every single state, the individual importance of all coordinates for that state is calculated and normalized, so that the importance of coordinates for one state add up to 1. The results are shown in a plot that can either be obtained setting a minimum importance, a coordinate has to contribute for at least one state or the number of coordinates that should be shown. Hence, setting the minimum importance to colmin = 0 results in a plot showing all coordinates and setting the number of coordinates to nfeatures = 6 results in a plot showing the 6 most important coordinates, weighted by their population.

plt.single.class.importance(pre    = "./singleClassImportance/sci",
                            model  = "./model/xgb.model",
                            names  = "./data/feature.names",
                            colmin = 0)

This function will produce the file sci_colmin0.png in the directory singleClassImportance with sci as the prefix.

plt.single.class.importance(pre       = "./singleClassImportance/sci",
                            model     = "./model/xgb.model",
                            names     = "./data/feature.names",
                            nfeatures = 6)

This function will produce the file sci_n6.png in the directory singleClassImportance with sci as the prefix.

Multiple plots can be obtained as well

plt.single.class.importance(pre       = "./singleClassImportance/sci",
                            model     = "./model/xgb.model",
                            names     = "./data/feature.names",
                            colmin    = c(0,0.2,0.5),
                            nfeatures = c(4,5,6))

Parameter Testing

To test the influence of different parameter on the prediction accuracy of the trained model, the parameter test fuction can be used. The defpar parameter defines, which training parameter should be used as base. defpar = 'default' uses the XGBoost default parameter. defpar = NA will take the parameter set in the train.parameter file within the data directory. The training parameter can also be set manually with defpar = list(eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1). The parameters to test should be given as data frame in the following format: data.frame(parameter = c(start, stop, stepsize, adjust nrounds(T/F))). As some parameter influence the computation time, more training rounds are neccesary to get comparable results of the prediction accuracy. Therefore, T/F parameter defines wether to adjust the training rounds by nrounds*(max_value/current_value).

parameter.test(data       = "./data",
               output_dir = "./parameterTest",
               nthread    = 0,
               nrounds    = 10,
               defpar     = 'default',
               testpar    = data.frame(eta = c(0.1, 0.5, 0.1, T),
                                       max_depth = c(3, 10, 1, T))
               )

This will produce two plots in the output directory, showing the the prediction accuracy in dependence of the learning rate eta and the maximum tree depth max_depth.



sbbrandt/xgbAnalysis documentation built on May 29, 2019, 9:11 a.m.