library(xnet) data(drugtarget)
Networks or graphs[^graphs] are one of the most powerful tools to represent and analyze complex, structured information. Nearly all sciences ubiquitously use networks, e.g.\ in chemistry (e.g.\ molecular graphs), molecular biology (e.g.\ protein-interaction networks), ecology (e.g.\ plant-pollinator networks) and sociology (e.g.\ social networks). Networks and graphs can both be seen as both models and data, explaining their success. On the one hand, graphs can be seen as minimalistic mathematical models, conveying how individual components in a system are connected. On the other hand, graphs are also data structures; they can be constructed by performing experiments or gathering observational data.
[^graphs]: The term 'network' is prevalent in sciences, whereas 'graph' is a term predominantly used in mathematics and computer science. In this work, we will use both interchangeably.
Given the general importance of graphs, a plethora of machine learning methods exists to predict networks based on data. We can divide network prediction methods into two groups: unsupervised and supervised. Unsupervised network inference, also called de novo inference [@Vert2008], does not use a dataset of known interactions to predict the network. These methods often use the guilt-by-association principle: components that occur together in space or time are assumed to interact. Such approaches are commonly used in computational biology, in particular for inferring gene regulatory networks (e.g.\ @Huynh-Thu2010; @Marbach2012; @Maetschke2014) or to infer species interaction networks from co-occurrence data (e.g.\ @Basnou2015; @Cazelles2016). Recently, however, studies found that correlation-based methods perform poorly to uncover the underlying network, at least for microbial networks [@Weiss2016;@Coenen2018]. Furthermore, a significant limitation of the unsupervised methods is that they are unable to generalize beyond the original data: they are not capable of predicting interactions for previously unobserved nodes.
Supervised network prediction methods depart from an initial network to train a machine learning model. This model can then predict interactions for nodes within or outside the training network. For even a small number of example interactions, supervised methods outcompete their unsupervised counterparts [@Cazelles2016]. Supervised network prediction methods are particularly popular in molecular network prediction -- see @Vert2008 and @Ding2013 for an overview. In this work, we will often refer to supervised network prediction using the more general term pairwise learning: predicting properties of pairs of objects using machine learning. In casu, predicting the edges between two nodes.
We present our \proglang{R} package xnet, a toolbox for performing pairwise learning using two-step kernel ridge regression [@Jung2013;@Pahikkala2014;@Romera-paredes2015]. Two-step kernel ridge regression is a straightforward combination of two standard kernel ridge regression methods to extend these to pairwise learning. This method yields a bilinear prediction model, capable of learning arbitrarily complex relations between two objects. Such models have been used extensively in molecular network prediction, e.g.\ @Vert2007; @VanLaarhoven2011a; @Pahikkala2015; @Pelossof2015 as well as for more general problems [@Brunner2012;@Pahikkala2013a;@Liu2015a].
Due to its simplicity, the two-step kernel ridge regression method has many computational advantages compared to other methods. It can be fitted exceptionally efficiently for large datasets, and the regularization parameters can be changed at no additional cost [@Stock2017tskrr]. Furthermore, performing leave-one-out cross-validation for the different prediction settings that arise in pairwise learning can also be done in constant time provided that a model is pretrained [@Stock2017tskrr]. We provide the necessary functionality to assess variable importance, impute missing values, and visualize the model. The connection with other packages for computing kernel matrices ensures that \pkg{xnet} can cope with a wide variety of complex and possibly structured data.
The remainder of this work is structured as follows. In Section \ref{sec:snp}, we give a short, self-contained background on kernel-based pairwise learning and two-step kernel ridge regression. We demonstrate the API of the package in Section \ref{sec:useofpackage}, while we present some small case studies in Section \ref{sec:casestudy}. Finally, in Section \ref{sec:relatedsoftware}, we provide some pointers to related software and end with the conclusion in Section \ref{sec:concl}.
\label{sec:snp}
In this section, we will give a concise overview of the methods provided by \pkg{xnet}. First, in Section \ref{sec:pwl}, we explain how two-step kernel ridge regression for pairwise learning can be motivated by standard kernel ridge regression. In Section \ref{sec:imputation}, we discuss how two-step kernel ridge regression can be used when the data is not complete, i.e.\ there is not always a label for each possible pairwise combination of objects. Finally, in Section \ref{sec:cvshortcuts}, we handle model selection and validation. We discuss cross-validation schemes and permutation methods tailored to the pairwise prediction setting to assess the model performance correctly. This section is introductive. We refer to standard works and original publications for in-depth motivations, derivations, and discussion of the various methods.
\label{sec:pwl}
Classical supervised machine learning methods depart from a labeled training set. The training set $T$ is a set of $l$ labeled instances $(x_k, y_k), $k=1,\ldots,l$, in $\mathcal{X}\times \mathcal{Y}$}. Here, $\mathcal{X}$ denotes the space of the instances (also called cases, observations or inputs) and $\mathcal{Y}$ the set of labels (also called targets or outputs). In this work, we assume that $\mathcal{Y}$ is numeric, i.e.\ the label could be real-valued, resulting in a regression problem, or binary, resulting in a binary classification problem.
The goal of supervised learning is to find a suitable prediction function $f : \mathcal{X}\rightarrow\mathcal{Y}$ such that $f(x_k) \approx y_k$ for all $k$. A vital step in designing the prediction function is choosing a suitable feature map $\phi:\mathcal{X}\rightarrow \mathcal{H}$ for the instances. This function links the abstract space of the instances to a concrete Hilbert space $\mathcal{H}$ in which the supervised learning problem is easy to solve. Depending on the data, $\phi(\cdot)$ might be a nonlinear function, and $\mathcal{H}$ could be of very high or even infinite dimension.
Rather than designing a feature mapping explicitly, kernel methods construct a Hilbert space implicitly [@Scholkopf2002]. A kernel function is a symmetric and positive-definite function $\Gamma : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ that expresses the similarity between two instances. The kernel trick states that any valid kernel function corresponds to a dot product in some Hilbert space $\mathcal{H}$, i.e.
$$ \Gamma(x,x') = \langle \phi(x),\phi(x')\rangle_\mathcal{H}\,. $$ The dot product is a fundamental mathematical operation in linear algebra. From a practical point of view, this means that the kernel trick allows for fitting linear machine learning models in the Hilbert space $\mathcal{H}$ by using only the kernel function and not the explicit mapping $\phi(\cdot)$. Using kernels has two important advantages, an algorithmic one, and a conceptual one. From an algorithmic point of view, it is often more efficient to work with kernels. The time and memory requirements[^memory] scale with the number of observations rather than the size of the feature space, which might become of prohibitively large dimensionality. Moreover, it is often easier to describe instances in terms of similarity than by designing a meaningful vector representation directly, especially if the instances are structured objects such as strings (e.g.\ @Vishwanathan2004) or graphs (e.g.\ @Lafferty2002; @Nikolentzos2019). The quintessential example is that of biological sequences, for which the sequence alignment score provides a biologically relevant degree of similarity between two DNA or protein sequences. Countless kernel functions have been developed to process both structured and unstructured data [@Shawe-Taylor2004].
Linear prediction functions in the Hilbert space admit the following dual representation: $$ f(x) = \sum_{k=1}^l w_k \Gamma(x_k,x)\,, $$ with $\mathbf{w} = (w_1,\ldots, w_l)^T$ the weights that are obtained when fitting the model to data.
There exists a plethora of learning methods that can find the weights for the general kernel-based prediction function. In this work, we use kernel ridge regression as a building block for pairwise learning. Kernel ridge regression (KRR) minimizes the squared loss on the objective function with $L2$ regularization. The parameters that minimize this objective function can be obtained in closed form:
$$ \mathbf{w} = (\mathbf{\Gamma}+\lambda_\Gamma I)^{-1}\mathbf{y}\,, $$
with $\mathbf{\Gamma}=[\Gamma(x_i,x_j)]$ the $l\times l$ Gram matrix containing the kernel function values of all pairwise combinations of the training set instances and $\lambda_\Gamma$ a regularization hyperparameter preventing overfitting. An attractive property of KRR is that if one computes the eigenvalue decomposition of the Gram matrix $\mathbf{\Gamma}$ (which can be done with a time complexity of $\mathcal{O}(l^3)$), the weights for any value of $\lambda_\Gamma$ can be obtained efficiently, e.g.\ see @Stock2017tskrr. This preprocessing makes it easy to tune the model. The hat matrix of kernel ridge regression is defined as $H_\Gamma=\mathbf{\Gamma}(\mathbf{\Gamma}+\lambda_\Gamma I)^{-1}$ and can be used to transform the vector of labels into a vector of predictions:
\begin{align} \mathbf{f} &= (f(x_1),\ldots, f(x_l))^T\ &=\mathbf{\Gamma}(\mathbf{\Gamma}+\lambda_\Gamma I)^{-1}\mathbf{y}\ &= H_\Gamma\mathbf{y}\,. \end{align}
[^memory]: For large datasets, using standard kernel methods often becomes infeasible, as most of them scale with $l^2$ in memory. The augmenting kernel method to deal with large datasets is an well-developed and active field of research.
In pairwise learning settings, the goal is to learn properties of pairs of objects. As such, the instances in a dataset consist of pairs, i.e.\ $x=(u, v) \in\mathcal{U}\times\mathcal{V}$, with $\mathcal{U}$ and $\mathcal{V}$ the respective object spaces. Suppose the training data consists of $n$ unique objects in $\mathcal{U}$ and $m$ unique objects in $\mathcal{V}$. {A pairwise dataset $T={(u_i, v_j, y_{ij}) \mid i\in {1,\ldots,n}, j\in {1,\ldots,m}}$ is called complete because there is a label for each of the $nm$ unique combinations.} In this case, the labels can be represented as an $n\times m$ label matrix $Y$, with the rows representing the objects in $\mathcal{U}$ and the columns the objects in $\mathcal{V}$.
For both $\mathcal{U}$ and $\mathcal{V}$, we have corresponding kernels that imply a suitable Hilbert feature space. Given two kernel functions $k: \mathcal{U} \times \mathcal{U} \rightarrow \mathbb{R}$ and $g: \mathcal{V} \times \mathcal{V} \rightarrow \mathbb{R}$, we use the following pairwise prediction function:
$$ f(u, v) = \sum_{i=1}^n\sum_{j=1}^m w_{ij} k(u_i,u)g(v_j, v) \,. $$ This prediction function arises when one uses the so-called Kronecker kernel to represent pairs of objects. Analogously to the previous section, we can also find the matrix of weights $W=[w_{ij}]$ using Kronecker KRR:
$$ \text{vec}(W) =(G\otimes K + \lambda I)^{-1} \text{vec}(Y)\,, $$ with $\text{vec}(\cdot)$ the vectorization operator and $\otimes$ the Kronecker product. Computing the above solution might be infeasible naively for problems of even moderate size due to the $nm\times nm$ Gram matrix. In practice, this can be solved efficiently using algebraic tricks, see e.g.\ @pahikkala2013conditional.
Two-step kernel ridge regression (TSKRR) computes the weights differently:
$$ W = (K+\lambda_k I)^{-1}Y(G+\lambda_g I)^{-1}\,. $$ This approach can be motivated as applying KRR twice, once for objects in $\mathcal{U}$ and once for objects in $\mathcal{V}$ [@Pahikkala2013a;@Romera-paredes2015]. Each of these KRR steps has its own regularization parameter: $\lambda_k$ and $\lambda_g$. Studies have shown that it behaves similarly as Kronecker KRR [@Stock2017tskrr] and is also very fast to fit for sizable complete pairwise datasets.
Given the respective hat matrices $H_k = K(K+\lambda_k I )^{-1}$ and $H_g = G(G+\lambda_g I )^{-1}$, the matrix containing the predictions is computed as $$ F = [f(u_i, v_j)] = KWG =H_kYH_g\,. $$ The above derivation focuses on the bipartite case, i.e.\ where $\mathcal{U}$ and $\mathcal{V}$ are different sets. In the homogeneous case, the two objects are in the same space (e.g.\ predicting protein-protein interactions). In this setting, $Y$ usually is a square adjacency matrix and one typically uses the kernel function $k(\cdot,\cdot)$ for both objects: $$ f(u, u) = \sum_{i=1}^n\sum_{j=1}^n w_{ij} k(u_i,u)k(u_j, u) \,. $$ The methodology for the bipartite case can be adapted to the homogeneous case by just replacing $g(\cdot,\cdot)$ by $k(\cdot,\cdot)$.
\label{sec:imputation}
In the previous section, we assumed that the training set was complete. In many cases, we only possess a subset of these labels. Since the matrix $Y$ cannot be formed in such case, it is not possible to directly use the closed-form solutions for the parameters as given above. @Airola2017genvectric recommend using gradient-based methods together with a specialized algebraic trick to fit the pairwise model. In \pkg{xnet}, we provide a more straightforward iterative method to process incomplete data.
Suppose that $\mathcal{O}\subset {1,\ldots, n} \times {1,\ldots, m}$ is the subset of index pairs for which a label is available. We define an imputed label matrix $\tilde{Y}^{(t)}$ at iteration $t$ as follows: $$ \tilde{Y}{ij}^{(t)}={\begin{cases} Y{ij} & \text{if } (i,j)\in \mathcal{O}\,,\ \tilde{F}{ij}^{(t-1)} & \text{otherwise}\,.\end{cases}} $$ Here, $\tilde{F}{ij}^{(0)}$ are the initial guesses for the missing labels. In \pkg{xnet}, the average of the observed labels is used to this end. This imputed matrix $\tilde{Y}^{(t)}$ is subsequently used to construct the corresponding predictions, i.e. $$ \tilde{F}^{(t)} = H_k \tilde{Y}^{(t)} H_g\,. $$ This process is repeated until the predictions and the imputed labels converge. Concretely, one can show that when $\lambda_k$ and $\lambda_g$ are greater than zero, $\lim_{t\rightarrow \infty}\tilde{F}^{(t)}$ contains the predictions obtained by a transductive model on the observed labels, regardless of the initial guesses [@Stock2017phd].
\label{sec:cvshortcuts}
A vital aspect of building data-driven models is to perform a correct validation and testing of the model. Validation both ensures that the best model is selected (w.r.t.\ the feature descriptors and hyperparameter values), and the reported performance is representative for new real-world data. It is well established that for supervised network prediction validation is more complicated than in traditional supervised settings [@Park2012; @Schrynemackers2013;@Pahikkala2015], since for matrix-valued network data the assumption of independence is violated: the individual objects occur in multiple pairs.
In earlier work, we have defined several specialized leave-one-out cross-validation schemes tailored for the pairwise prediction problem [@Stock2018cvshortcuts]. We distinguish between bipartite networks (where $\mathcal{U}\ne \mathcal{V}$, e.g.\ protein-ligand or plant-pollinator networks) and homogeneous networks (where $\mathcal{U} = \mathcal{V}$, e.g.\ protein-protein networks and food webs). For bipartite networks, we can leave out one element at a time (I), one row (R), one column (C), or both (B). For the homogeneous networks, we can leave out edges (E) or vertices (V). For validation settings I and E, we can set the interaction values to zero, rather than discarding the pairs altogether. We refer to these cases as setting I0 and E0, respectively. Figure \ref{fig:CV} depicts these different cross-validation settings. \pkg{xnet} provides efficient computational shortcuts for performing all network cross-validations. These allow for both selecting the best possible model for each setting, as well as exploring in which prediction settings the model performs well and in which prediction setting it does not.
In addition to the cross-validation schemes, \pkg{xnet} also provides permutation-based methods to analyze the models. Here, the Gram matrices $K$, $G$, or both are randomly permutated: the rows and corresponding columns of the Gram matrices are re-arranged in an arbitrary order. Such permutation destroys the link between the objects and their kernel function values. The decrease in model performance in leave-one-out cross-validation can be recorded. The average decrease in performance over several repetitions is an indication for which kernel is most relevant of which prediction setting.
\label{sec:useofpackage}
In this section, we briefly describe the reasoning behind the design to describe the general user interface of \pkg{xnet}. We also demonstrate the use of the main functions of the package. To this end, we use two datasets, which are also contained within the package. We use a drug-target dataset as an example of a bipartite network and a protein-protein interaction dataset as an example of a homogeneous dataset.
Even though the package \pkg{xnet} has been developed with performance in mind, all functions were written entirely in \proglang{R}. We traced and, where possible, remedied all bottlenecks to obtain acceptable performance for more extensive networks without adding extra complexity to the code base. Adding compiled code to an \proglang{R} package also comes with extra dependencies and possibly makes maintenance of the package more complicated than necessary [@Claes2014]. As an extra benefit, this decision also gives the user the possibility to optimize the performance of the core matrix calculations by combining \proglang{R} with a BLAS library optimal for their own system [@RInstallAdmin].
The package uses the S4 system of class inheritance. We provided three virtual classes named tskrr for general models, tskrrTune for models that were tuned automatically, and tskrrImpute for models with imputed data. While the class tskrr provides the necessary slots to store the model information and outcome, the classes tskrrTune and tskrrImpute provide additional slots to store information on tuning or imputation of missing values.
Two real classes tskrrHeterogeneous and tskrrHomogeneous inherit from tskrr and provide the object structure for heterogeneous and homogeneous models, respectively. Also, for tuned models and models with imputed data, real classes with suffix Heterogeneous and Homogeneous can be found in the package. These inherit from the respective virtual class and the real classes for heterogeneous and homogeneous network models. The scheme ensures that the methods for both types of tskrr models also work for the tuned and imputed versions of these models. A schematic representation of the class structure can be found in Figure \ref{fig:Class}. A detailed description of the provided slots for each class can be found in the vignette "S4 class structure of the \pkg{xnet} package", contained within the package. It can be accessed using:
vignette("xnet_ClassStructure", package = "xnet")
Care has been taken to embed the user interface in the standard workflows of \proglang{R}. We provide methods for the generics dim, fitted, predict, update, plot, labels, dimnames, colnames and rownames. Thanks to the adapted inheritance scheme, the number of specific methods for these and other generics in the package can be kept limited.
In the following sections, we briefly describe and demonstrate the overall use of the main functions in the package. More elaborate examples are included in the vignette "A short introduction to cross-network analysis with \pkg{xnet}", accessible via:
vignette("xnet_ShortIntroduction", package = "xnet")
\label{sec:data}
The package \pkg{xnet} comes with two example datasets that can be used to illustrate the functionality of the package. The dataset drugtarget, first presented by @Yamanishi2008, serves as an example of a bipartite network. It is a small dataset containing all binary interactions between 26 nuclear receptor protein targets and 54 drug-like compounds. We took the adjacency matrix as well as the similarity matrix for the targets from the paper's supplementary materials. As the provided similarity matrix turned out to be not entirely symmetric, we opted to calculate these similarities using the Tanimoto coefficient. This method has been shown to be an appropriate alternative for fingerprint-based similarity calculations [@Ralaivola2005; @Bajusz2015]. The chemical compound structure was first downloaded from the KEGG database [@KEGG2019] using the \pkg{ChemmineR} package [@Cao2008]. The similarities between the chemical compounds were then calculated based on the Tanimoto coefficient, using the \pkg{fmcsR} package [@Wang2013].
The processed dataset consists of three matrices:
drugTargetInteraction;targetSim;drugSim.The example dataset proteinInteraction originates from a paper by @Yamanishi2004. The data describes the interactions between a subset of 150 proteins taken from the datasets provided by the KEGG pathway database [@KEGG2019]. It consists of two matrices:
proteinInteraction where 1 indicates an interaction between proteins;Kmat_y2h_sc describing the similarity between the proteins, based on the two-hybrid method. For details on this dataset, we refer to @Yamanishi2004. A more elaborate description of the example data, including the procedure used to prepare them for analysis, can be found in the vignette "Preparation of the example data" contained within the package. It can be accessed using:
vignette("Preparation_example_data", package = "xnet")
In this section, we explain the general use of the three main modelling functions, being tskrr for fitting a two-step kernel ridge regression model, tune for performing a grid search to find optimal values of $\lambda$, and impute_tskrr for imputing missing values as explained in Section \ref{sec:imputation}. We illustrate how the user interface is kept consistent throughout the package and explain the controls specific to each function.
One can fit a two-step kernel ridge regression model using the function tskrr() with the following argument signature:
tskrr(y, k, g = NULL, lambda = 1e-04,
testdim = TRUE, testlabels = TRUE, keep = FALSE,
symmetry = c("auto", "symmetric", "skewed"))
This function takes as input the interaction matrix y and either one kernel matrix k in the case of a homogeneous network, or two kernel matrices k and g in the case of a heterogeneous network. It also allows for setting the tuning parameters $\lambda_k$ and, if applicable, $\lambda_g$ using the argument lambda. By default, the function uses $10^{-4}$ for both[^defaulthp]. If one fits a heterogeneous network but only specifies a single value for the argument lambda, this value will be used for both $\lambda_k$ and $\lambda_g$.
[^defaulthp]: These default values ensure a nearly unbiased estimation of the regression parameters. They are strictly greater than zero to ensure numerical stability.
Apart from fitting the model as described in Section \ref{sec:pwl}, the function tskrr also carries out some sanity checks. Checking whether dimensions of the different matrices are compatible can be turned off with testdim = FALSE. The mechanism to check whether row and column labels of the matrices match can be turned off with testlabels = FALSE. To speed up predictions from the model, users can also opt to store the hat matrices in the resulting model object by setting keep = TRUE.
In the case of a homogeneous network, one expects the interaction matrix y to be either symmetric or skew-symmetric. The function tskrr checks by default whether this condition is fulfilled. However, this check can be somewhat time-consuming on larger matrices. Hence the user can set the symmetry if known, using the argument symmetry.
The function tune implements a grid search to find the most optimal values for lambda. It has the following argument signature:
tune(x, k, g = NULL, lambda = NULL,
testdim = TRUE, testlabels = TRUE, keep = FALSE,
symmetry = c("auto", "symmetric", "skewed"),
lim = c(1e-04, 1), ngrid = 10,
fun = loss_mse, exclusion = "edges", replaceby0 = FALSE,
onedim = TRUE)
It calculates the value of a given loss function for a range of values of $\lambda$ on a one-dimensional or two-dimensional grid, and can be used similarly as the function tskrr. Alternatively, one can first build a model using tskrr and then perform a grid search using the created model object as the input for the tune function.
To control which lambda values are used for the grid search, tune can take a vector or a list with two elements specifying the values of $\lambda$ used for constructing the grid. Alternatively, one can pass boundaries for the values of $\lambda$ with the argument lim and the number of values of $\lambda$ in each dimension with the argument ngrid. The package function create_grid will use that input to search through a logarithmically evenly spaced grid.
The performance of the model is evaluated using loss functions provided by the argument fun. This function can either be one of the built-in loss functions loss_mse or loss_auc, or a function provided by the user. The loss is calculated based on observed interactions versus leave-one-out predictions. The exact nature of the leave-one-out procedure can be set using the arguments exclusion and replaceby0 (see also Section \ref{sec:performloo}).
For imputation of missing values as explained in Section \ref{sec:imputation}, the package \pkg{xnet} offers a function impute_tskrr with the following usage:
impute_tskrr(y, k, g = NULL, lambda = 0.01,
testdim = TRUE, testlabels = TRUE, keep = FALSE,
symmetry = c("auto", "symmetric", "skewed"),
niter = 10000, tol = sqrt(.Machine$double.eps),
start = mean(y, na.rm = TRUE), verbose = FALSE)
Again this function provides the same arguments and checks as the function tskrr. Additionally, the function impute_tskrr allows the user to set the maximum number of iterations, the tolerance for convergence and the starting values for the algorithm. If not provided, the average over the interaction matrix is chosen as a starting value for imputation. The deviation between two subsequent iterations is computed as the sum of the squared differences between the predicted values in each iteration. This value determines whether the algorithm has converged. The imputation function also gives the possibility to track the changes in deviation during the algorithm run by setting the argument verbose to a value larger than~1.
\label{sec:performloo}
The various shortcuts for leave-one-out cross-validation (LOO-CV), as described by @Stock2018cvshortcuts, form the most significant contribution of \pkg{xnet}. In the package \pkg{xnet}, the function loo() can perform LOO-CV using any of the settings mentioned in Section \ref{sec:cvshortcuts}. Moreover, multiple other functions make use of LOO-CV in their calculations. The previously discussed function tune is just one example. In every function that can make use of LOO-CV, the exact settings for performing LOO-CV are controlled by the same two arguments exclusion and replaceby0. The translation of different settings to values for the argument exclusion is shown in Table \ref{tab:shortcutsettings}.
\begin{table}
\begin{tabular}{l|ll} \textbf{LOO-CV setting} & \textbf{heterogeneous network} & \textbf{homogeneous network} \ \hline one element (I) & \verb|"interaction"| & \verb|"edges"| \ a row (R) & \verb|"row"| & not applicable \ a column (C) & \verb|"column"| & not applicable \ both row and column (B) & \verb|"both"| & \verb|"vertices"| \ \end{tabular} \caption{This table shows how the different options for the argument \texttt{exclusion} translate to specific cross-validation settings explained in Section \ref{sec:cvshortcuts} } \label{tab:shortcutsettings} \end{table}
In the case of observational data, not observing an interaction (a zero in the adjacency matrix) is no proof of the absence of the interaction in practice. In those cases, it is more sensible to compute the LOO-CV values by replacing the interaction by 0 instead of removing it. By setting replaceby0 = TRUE functions use the I0 and E0 cross-validation settings for bipartite and homogeneous networks, respectively.
\label{sec:casestudy}
In this section, we illustrate the use of the different functions on the drug-target dataset described in Section \ref{sec:data}. To illustrate how to perform predictions using the model, we treat the first three drugs in the data as new drugs for which any interaction with neural receptors is unknown. The interactions between these drugs and neural receptors are removed from both the interaction matrix and the kernel matrix with drug similarities. In order to be able to make predictions, later on, we also create a kernel matrix testdrugSim, which contains the kernel function values for every combination of the drugs of the training set and the drugs in the test set.
data(drugtarget) interaction <- drugTargetInteraction[, -c(1:3)] traindrugSim <- drugSim[-c(1:3), -c(1:3)] testdrugSim <- drugSim[1:3, -c(1:3)]
In order to minimize the loss, we first use the tune function to look for optimal values for the parameters $\lambda$. For illustrative purposes, we use a 2D grid search, setting different parameters for the grid in each dimension. As the interaction matrix is a binary indicator of presence/absence, we use the area under the curve as a loss function. As this is an observational dataset, we choose to calculate the leave-one-out values by replacing interactions by 0. The function lambda() will return the values at which the lowest loss is obtained.
trained <- tune(interaction, targetSim, traindrugSim, lim = list(k = c(1e-4,10), g = c(1e-3,10)), ngrid = list(k = 20, g = 10), fun = loss_auc, exclusion = "interaction", replaceby0 = TRUE) lambda(trained)
Note that one does not need to refit the model with the obtained values of $\lambda$. Due to the inheritance structure, the resulting object trained can be treated as if it were a standard model fitted with the lambda values that minimize the loss. It is also reported in the console output when inspecting the model object.
trained
The package \pkg{xnet} offers two different visualizations for the resulting model. First, we provide a plot.tskrr method to generate heatmaps for predictions, response, LOO-CV values or residuals. This way, the function plot() can be used to examine both the original interaction matrix and the standard, tuned or imputed model results. Additionally, the similarity information contained in the k and g kernel matrices can be represented by an optional dendrogram along the rows and columns, respectively. Using the arguments rows and cols, a subset of the predicted interactions can be plotted. Figure \ref{fig:modelplot} shows a subset of the residuals for the training model created earlier, plotted using the following code:
plot(trained, dendro = "none", which = "residuals", cols = 10:30, rows = 10:25, main = "Residuals for drug target interaction.")
The results indicate possible interactions not present in the original data, but a stark warning is warranted here. Due to the linear nature of two-step kernel ridge regression, predictions on a binary interaction matrix might fall outside the domain $[0,1]$ of possible values. It is tempting to interpret the predicted values as a measure of probability for interaction, but those values can only be considered a crude approximation of this probability at best. In reality, they are scores, suggesting whether an interaction is possible or not. The correct way of processing them is by ranking. The procedure followed in this example fits well in an exploratory analysis, but any result has to be confirmed through other means.
Specifically for tuned models, a function plot_grid gives a graphical representation of the loss values for different (combinations of) values of $\lambda$. With the following code, one can inspect the loss surface for the training model. The result is shown in Figure \ref{fig:lossplot}.
plot_grid(trained, main = "Loss surface for drug-target interaction")
Following the general user interface for models in R, a predict method is available in \pkg{xnet} for prediction of new interactions based on the kernel similarities for the nodes in the training dataset. These predictions can be made for new rows, new columns, or both. Predictions are made for the cases presented in the rows of every kernel matrix. In the current example, similarities between the drugs are represented by the g matrix. Hence we need the kernel similarities between every new drug and the drugs included in the training data. The new drugs are presented in the rows. The first six columns are shown below.
testdrugSim[, 1:6]
Using this matrix as the kernel for g, one can calculate the predictions for the three new drugs using the function predict. With the roc and auc functions of the package \pkg{pROC} [@pROC2011], the result can be compared with the actual documented interactions. Even with the limited information in this small sample, the model achieves an AUC value well enough for exploratory analyses.
library(pROC) newpreds <- predict(trained, g = testdrugSim) pvec <- as.vector(newpreds) rvec <- as.vector(drugTargetInteraction[, 1:3]) curve <- roc(rvec, pvec) auc(curve)
\label{sec:imputestudy}
Next to two-step kernel ridge regression, the package \pkg{xnet} also implements the iterative algorithm for missing value imputation as explained in Section \ref{sec:imputation}. In this example, we demonstrate how this can be carried out on a homogeneous network describing protein-protein interactions. The origin and structure of the data is described in Section \ref{sec:data}.
To illustrate the algorithm, we first remove a set of values from the original data. The resulting matrix, called proteiny in the code below, then serves as an example of a homogeneous network where some vertices are not yet annotated.
data("proteinInteraction") idrow <- seq(10,150, by = 10) idcol <- seq(5, 145, by = 10) proteiny <- proteinInteraction proteiny[idrow,idcol] <- proteiny[idcol,idrow] <- NA
Next, we use this interaction matrix and the kernel matrix Kmat_y2h_sc to impute the missing values. It is tempting to select a minimal value for $\lambda$ based on these results, but a smaller value of $\lambda$ also means a smaller difference between iterations and hence a slower convergence. One can even set the value of $\lambda$ so small that the difference between iterations becomes smaller than the specified tolerance, stopping the algorithm at the second iteration. For this reason, the package uses a higher default value for the argument lambda compared to the other functions.
The choice for the starting value with which the missing values are replaced has less effect on the algorithm. The function impute_tskrr uses the average of the interaction matrix by default. As the protein-protein interaction matrix only contains 163 interactions between a set of 150 proteins, this average is very low. In such a setting, one can choose a more neutral starting value, as we do in this example.
imputed <- impute_tskrr(y = proteiny, k = Kmat_y2h_sc, start = 0.5)
To examine the imputed missing values quickly, the package \pkg{xnet} provides several convenient functions. The function is_imputed returns a matrix where a logical TRUE value indicates the position of an imputation. Using the function fitted, one can extract all fitted values from the resulting model object. Combining both, it becomes trivial to extract the imputed values. Moreover, one can use this information in combination with the rows and cols arguments of the plot.tskrr method to visualize the predictions for only those proteins for which a missing value is imputed. This is illustrated in the following code block. Note that we plot the response, i.e.\ the original interaction matrix with the imputed values added.
id <- is_imputed(imputed) rowid <- rowSums(id) > 0 colid <- colSums(id) > 0 plot(imputed, rows = rowid, cols = colid, dendro = "none", which = "response")
The results, represented in Figure \ref{fig:plotimpute}, show a stark difference in value between the known interactions (dark red) and imputed values. Due to the sparse nature of the interaction matrix, strong predictions for new interactions are not to be expected.
To evaluate the imputation, we compare the imputed values again with the original data using the AUC. To this end, we can use additional functions to extract information from the imputed model. The function which_imputed returns an integer vector with the positions for which the values are imputed. This allows to extract the original values from the matrix proteinInteraction, and the imputations from the response matrix contained in the imputed model object. With these two vectors, calculating the AUC becomes trivial.
id <- which_imputed(imputed) orig <- proteinInteraction[id] imp <- response(imputed)[id] curve <- roc(orig, imp) auc(curve)
Even though the imputed values are all rather small, also in this example, we obtained a moderate accuracy with limited information.
\label{sec:relatedsoftware}
Pairwise learning is a viable approach for various machine learning problems, including network inference, missing value imputation, collaborative filtering, multi-task learning, transfer learning and zero-shot learning. These problems can all be posed as a supervised network prediction problem, or, equivalently, predicting values, rows, and columns in a matrix. As such, the TSKRR model, as provided by our \pkg{xnet} package provides a similar functionality as numerous other packages.
The conceptually most similar package is \pkg{RLScore} [@Pahikkala2016], a \proglang{Python} library implementing many least-squares-based kernel learning algorithms, including Kronecker-kernel ridge regression and two-step kernel ridge regression. The \proglang{Julia} library \pkg{Kronecker.jl}^kronecker provides all the analytic shortcuts to develop Kronecker-based learning methods, though it is designed with a researcher in mind, rather than an end-user. The linear filtering method [@Stock2016a], as a special case of the TSKRR model [@Stock2017tskrr], is provided in the \pkg{EcologicalNetworks} package [@Poisot2019EN], also in the \proglang{Julia} language. This package contains various methods to analyze ecological networks.
Although the \proglang{R} programming language has several packages implementing kernel methods, our package is the first to offer this pairwise learning framework. Since \pkg{xnet} only provides functions for training, validation, and visualization, we recommend using other packages for computing the kernel matrices. For example, \pkg{kernlab} [@Karatzoglou2004] provides the most popular kernels for vectorial data, \pkg{graphkernels}^graphkernels provides kernel functions to compare graphs, while the \pkg{SHOGUN} toolbox [@Sonnenburg2010] provides, among other functionalities, string kernels.
Given R's large ecosystem of packages for analyzing biological data, there exist many packages for network prediction. For example, for gene regulatory network prediction, SIRENE [@Mordelet2008], minet [@Meyer2008] and Parmigene [@Sales2011] provide unsupervised methods for gene-gene interaction prediction. The importance of correct validation is illustrated by packages such as NetBenchmark [@Bellot2015a], which provide tools to test the robustness of different network inference methods using simulators. xnet is a comprehensive package that adds a supervised network prediction method to this pool of unsupervised approaches.
R also has packages for missing value imputation, of which MICE (Multivariate Imputation by Chained Equations) [@VanBuuren2011] and MissForest [@Stekhoven2012] are well-known examples. Our package does not have the ambition to replace these. Most specialized methods for problems such as missing value imputation will likely outperform it, even if only by a small margin. The goal of xnet is to provide a single method that can handle various machine learning problems, including missing value imputation, but also supervised network prediction, multi-task learning, and zero-shot learning.
\label{sec:concl}
We have presented the xnet package, an easy-to-use software tool for supervised network prediction using two-step kernel ridge regression. As a kernel-based method, it allows one to work with complex, structured objects and learn general nonlinear associations. We provide advanced cross-validation shortcuts, visualization methods, and a test to tune, validate, and analyze the models correctly. We hope that xnet can be of use in a variety of disciplines.
\vspace{0.5cm} Acknowledgements
MS is supported by the Research Foundation - Flanders (FWO17/PDO/067). This research received funding from the Flemish Government under the ``Onderzoeksprogramma Artifici\"ele Intelligentie (AI) Vlaanderen'' program.
\vspace{0.5cm} Author contributions
Conceptualization: MS, BDB; data curation: MS, JM; methodology: MS; software: MS, JM; writing - original draft: MS, JM; writing - review \& editing: MS, JM; supervision: BDB.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.