In sbbrandt/xgbAnalysis: Reaction coordinates analysis

Analysis of Protein Dynamics: xgbAnalysis Tutorial

Introduction

The xgbAnalysis package is a tool to find suitable reaction coordinates of biomolecular systems, e.g. proteins, using the XGBoost algorithm that can be found on github. Given a trajectory from a molecular dynamics simulation and suitable corresponding states, it evaluates, which of the input coordinates describe the system best, resulting in a low dimensional reaction coordinate of directly interpretable original coordinates. To obtain states for a given trajectorie the prodyna package can be used.

Installation

Install the xgbAnalysis package, using the R-devtools package

if ( ! ("devtools" %in% installed.packages())) {
  install.packages("devtools")
}
devtools::install_github("sbbrandt/xgbAnalysis")
library(xgbAnalysis)

Available test data

To this end internal data of HP-35 with every 100th frame for a quick testrun:

Path: /u3/rc_classification/testrun/

Including test files:

trajectory from MD simulation converted to dihedral angles villin_360K_every100.dih
states obtained from maxgap shifted density based clustering macrostates_every100

Parameter

You can find an introduction to boosted decision trees here.

While XGBoost has many parameter to define, how the decision trees are build, the following are the most important and are usually used improve the learining process.

eta learning rate (default=0.3): defines how fast each training round will improve the missclassification done by the model so far. A low value will need more training rounds to get a good prediction accuracy, while a high value will lead to fast overfitting.
max_depth maximum tree depth (default=6): defines the maximum depth of the single decision trees, e.g. the maximum number of decision nodes, until a prediction value is assigned.
nrounds number of training rounds: defines how many rounds the model is trained. Usually, more training rounds result in a better prediction accuracy, but too many training rounds will lead to overfitting, when the prediction error on the training set diverges from the prediction error on the test set.
nthread number of CPU cores to use (default=0): nthread=0 uses all cores

When training a model to precisely predict states for unseen data, a good parameter choice is crucial. For the feature analysis however, the parameter choice is not that important, as the XGBoost algorithm is strong against overfitting and comes with a good default choice of training parameter. Here, it is most important, that the build model is "complex" enough with enough training rounds (nrounds) of trees that are not too small (max_depth), so that enough features (coordinates) are used in the evaluation process for split candidates in the split nodes. Usually the default parameter suit that requirement.

Analysis

Set Up Working Directory and Import Data

In a first step the trajectory and states will be separated into a training set trainsplit = 0.7 => 70% and a test set.

getwd()
import.data(output_dir = "./data", 
            coords     = "villin_360K_every100.dih", 
            states     = "macrostates_every100",
            labels     = "dihedrals", 
            trainsplit = 0.7)

This will create the ouput directory, here data, containing the following files

import.data.parameter with the parameter used in the import function, as some future functions need to reload the original trajectory.
train.parameter with the default training parameter for the XGBoost algorithm and the number of classes (states) that are needed for the training process. Parameters can be changed using the set.parameter function.
train.index as the training set is choosen randomly from the data, this file contains the indices of the training set to ensure that the same set can be taken again.
test.index with the indices of the test set.
feature.names with the names (labels) of the coordinates of the trajectory file.

train.xgb.Dmatrix the training set in the XGBoost dense matrix format.
test.xgb.Dmatrix the test set in the XGBoost dense matrix format.

Train a Model

To simply train a model that predicts the corresponding states of the trajectory with the default parameter and 2 training rounds, use

train.model(data_dir   = "./data",
            output_dir = "./model",
            nrounds    = 2
            )

As the data set is small and the number of training rounds is small, the accuracy will not be good. For better accuracy, try using nrounds = 20 training rounds for which the computation time will be 10x larger.

This will produce the following files in the output directory

xgb.model the trained model that now can be used to predict states.
xgb.dump.model containing the tree structure, that has been trained.
importance with the feature importance from the training process.

Feature Selection

TODO description

feature.selection(output_dir = "./featureSelection",
                  data       = "./data",
                  decreasing = F,
                  eta        = 0.3,
                  max_depth  = 6,
                  nrounds    = 20,
                  nthread    = 0,
                  savemode   = T)

This will produce the following files in the ouput directory

feature.selection containing the list of features that are dismissed, the overall accuracy and the accuracy for every state for every iteration round.
parameter containing the parameter values used for feature selection.

If savemode = TRUE the following files will be produced every iteration round

selectround<n>.importance containing the feature importance of the iteration round.
selectround<n>.model the trained model.
selectround<n>.prediction with the state prediction made by the model of this iteration.

The index n gives the number of coordinates (features) that are dismissed. In round 0, the model is trained with all coordinates and in round 1 the most/least important coordinate from round 0 has been removed from the data set. Given 66 dihedral angles for HP-35, at selectround 65 the model is trained using just the last remaining coordinate.

The prediction accuracy can be plotted in dependency of the number of dismissed coordinates (features).

plt.feature.selection(dir      = "./featureSelection",
                      saveplot = T)

If saveplot = TRUE, the folder ./featureSelection/plot/, containing the plot of the feature selection will be created. If FALSE the plot will be returned.

Dismissing the coordinates decreasingly, the most important coordinate will be dismissed every round.

feature.selection(output_dir = "./featureSelection_decreasing",
                  data       = "./data",
                  decreasing = T,
                  eta        = 0.3,
                  max_depth  = 6,
                  nrounds    = 20,
                  nthread    = 0,
                  savemode   = T)

Plot the feature selection decreasing

plt.feature.selection(dir      = "./featureSelection_decreasing",
                      saveplot = F)

If you want to continue a previous not finished feature selection or you want to dismiss certain features from the beginning, you can give a vector (fdismissed) with the names of the features (coordinates), that will be dismissed before the first iteration. To define how many features (coordinates) should be dismissed at all, use ndismiss

feature.selection(output_dir = "./featureSelection_decreasing",
                  data       = "./data",
                  decreasing = T,
                  ndismiss   = 8,
                  fdismissed = c("Phi2", "Psi13", "Psi4"),
                  eta        = 0.3,
                  max_depth  = 6,
                  nrounds    = 20,
                  nthread    = 0,
                  savemode   = T)

This will remove the coordinates "Phi2", "Psi13" and "Psi4" from the data set before starting to remove the most important coordinate for 5 iterations.

Single Class Importance

Given a trained model, the feature importance for every state can be obtained. Here, for every single state, the individual importance of all coordinates for that state is calculated and normalized, so that the importance of coordinates for one state add up to 1. The results are shown in a plot that can either be obtained setting a minimum importance, a coordinate has to contribute for at least one state or the number of coordinates that should be shown. Hence, setting the minimum importance to colmin = 0 results in a plot showing all coordinates and setting the number of coordinates to nfeatures = 6 results in a plot showing the 6 most important coordinates, weighted by their population.

plt.single.class.importance(pre    = "./singleClassImportance/sci",
                            model  = "./model/xgb.model",
                            names  = "./data/feature.names",
                            colmin = 0)

This function will produce the file sci_colmin0.png in the directory singleClassImportance with sci as the prefix.

plt.single.class.importance(pre       = "./singleClassImportance/sci",
                            model     = "./model/xgb.model",
                            names     = "./data/feature.names",
                            nfeatures = 6)

This function will produce the file sci_n6.png in the directory singleClassImportance with sci as the prefix.

Multiple plots can be obtained as well

plt.single.class.importance(pre       = "./singleClassImportance/sci",
                            model     = "./model/xgb.model",
                            names     = "./data/feature.names",
                            colmin    = c(0,0.2,0.5),
                            nfeatures = c(4,5,6))

Parameter Testing

To test the influence of different parameter on the prediction accuracy of the trained model, the parameter test fuction can be used. The defpar parameter defines, which training parameter should be used as base. defpar = 'default' uses the XGBoost default parameter. defpar = NA will take the parameter set in the train.parameter file within the data directory. The training parameter can also be set manually with defpar = list(eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1). The parameters to test should be given as data frame in the following format: data.frame(parameter = c(start, stop, stepsize, adjust nrounds(T/F))). As some parameter influence the computation time, more training rounds are neccesary to get comparable results of the prediction accuracy. Therefore, T/F parameter defines wether to adjust the training rounds by nrounds*(max_value/current_value).

parameter.test(data       = "./data",
               output_dir = "./parameterTest",
               nthread    = 0,
               nrounds    = 10,
               defpar     = 'default',
               testpar    = data.frame(eta = c(0.1, 0.5, 0.1, T),
                                       max_depth = c(3, 10, 1, T))
               )

This will produce two plots in the output directory, showing the the prediction accuracy in dependence of the learning rate eta and the maximum tree depth max_depth.

sbbrandt/xgbAnalysis documentation built on May 29, 2019, 9:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com