In lkshrsch/gLVInterNetworks: Inference of interaction networks based on generalised Lotka Volterra dynamics

Installation

library(knitr)
opts_chunk$set(out.width='350px', dpi=200, fig.width = 5, fig.height=5, background = "lightcyan3")

To download the package from my github account use:

devtools::install_github("lkshrsch/gLVInterNetworks", build_vignettes = TRUE)

Note: After installing the package for the first time, or updating to a new version, restart the RStudio session before usage.

To load the package into the R environment

library(gLVInterNetworks)

Generation of in silico data

To generate in silico data sets we use the following command with the following mandatory arguments:

species: The amount of variables in the system
number_of_interactions: The amount of interactions between different variables
timepoints: Starting from 0, the number of measurements the output will have
noise: The standard deviation of the stochasticity added to the data (variance of the gaussian noise)
testData: The percentage of ending datapoints left out for validation purposes.

data <- gLVgenerateData(species = 2,
                        number_of_interactions = 2,
                        timepoints = 40,
                        noise = 0.01,
                        testData = 20)

Note that depending on the size of the system to simulate (number of species, etc) this command can output messages which can be ignored regarding the difficulty of the numerical integration of the solution.

data("exampleData1")

The variable data is a list containing the following attributes

names(exampleData1)

From which species, timepoints, noise, sparsity, and testData correspond to the input arguments.

Parms : Is the randomly generated parameter matrix under the given input constraints. It contains one row per variables / species in the system and the parameters correspond to :
obs : Corresponds to the solution using the estimated model parameters for the time interval given. This are the values plotted when using

To visualize the representation of the system as an interaction network, use the plotGraph() function over the object. Arguments vsize and edgeSize adjust the size of the vertices and edges respectively and must be tuned by the user when visualizing different sized networks. The "verbose" argument or flag prints also the numeric value of the interaction parameters over the respective edges connecting the variables.

plotGraph(x = exampleData1, 
          vsize = 50, 
          edgeSize = 1, 
          main="Interaction network for exampleData1", 
          verbose = TRUE)

Optionally, the network can be plotted with sizes of the vertices and edges relative to the magnitude they represent by activating the relativeVertexSize and relativeEdgeSize flags (setting them to TRUE, with default value FALSE). In this case, the vertex sizes are proportional to the mean abundance of the respective variable they represent, weighted by an arbitrary and user set scaling factor given by the argument "vsize". The "relativeEdgeSize" flag sets the width of the edges relative to the magnitude of the interaction parameter they represent.

plotGraph(x = exampleData1,
          relativeVSize = TRUE ,
          vsize = 25, 
          relativeEdgeSize = TRUE, 
          main="Relative vertex and edge sizes")

Basic plotting of the time series data:

plot(exampleData1, pch = 1:exampleData1$species)
legend("topright", 
       legend = colnames(exampleData1$obs[,-1]), 
       pch = 1:exampleData1$species, 
       col=1:exampleData1$species)

Or alternatively with the legend flag set to TRUE for an easy plot generation

plot(exampleData1, legend = T)

Loading data sets

The package comes with a small example of raw data stored in a csv file with a time column with the dates when the observations were taken. The .csv file is a text file with entries sepparated by semicolons ";". To get the absolute path of the file 'example1.csv' installed in the package directory 'rawdata/' use

fpath <- system.file("rawdata", "example1.csv", package="gLVInterNetworks")

The load the file

df <- read.csv2(file = fpath, header = TRUE)

head(df)

matplot(df[,-1])

We can convert the data.frame to a numerical matrix using

exampleData <- data.matrix(df)

With which we have transformed the dates into numerical values. The original measurement time rate (per day) will be the changing rate for the estimated parameters after fitting the model to the data.

head(exampleData)

Fitting the generalized Lotka Volterra model to the data

The generalized Lotka Volterra model consists in a set of differential equations, nonlinear in the variables $x$. It describes the rate of change in time of the abundance of a variable (species, or taxonomical/functional unit)

$$\frac{dx_i}{dt} = \alpha_i x_i + \sum{ \beta_{ij} x_i x_j }$$

Or written simpler

$$\dot{x_i} = x_i ( \alpha_i + \sum{ \beta_{ij} x_j })$$

Written in matrix form for a two species system:

$$\begin{pmatrix} \dot{x_1} \ \dot{x_2} \end{pmatrix} = \begin{pmatrix} x_1 \ x_2 \end{pmatrix} \times \left[ \begin{pmatrix} \alpha_1 & \beta_{11} & \beta_{12} \ \alpha_2 & \beta_{21} & \beta_{22} \end{pmatrix} \cdot \begin{pmatrix} 1 \ x_1 \ x_2 \end{pmatrix}\right]$$

This matrix form for the parameter matrix with the growth term in the first column and interaction parameters in the right-side quadratic matrix corresponds to the output of the parameter in the R code:

data("exampleData1")
print(exampleData1$Parms)

$$ = \begin{pmatrix} \alpha_1 & \beta_{11} & \beta_{12} \ \alpha_2 & \beta_{21} & \beta_{22} \end{pmatrix}$$

And for bigger systems such as the six species simulation of "exampleData2"

data("exampleData2")
print(exampleData2$Parms)

Model parameters for variables x~1~ to x~6~

$$= \begin{pmatrix} \alpha_1 & \beta_{11} & \beta_{12} & \beta_{13} & \beta_{14} & \beta_{15} & \beta_{16} \ \alpha_2 & \beta_{21} & \beta_{22} & \beta_{23} & \beta_{24} & \beta_{25} & \beta_{26} \ \alpha_3 & \beta_{31} & \beta_{32} & \beta_{33} & \beta_{34} & \beta_{35} & \beta_{36} \ \alpha_4 & \beta_{41} & \beta_{42} & \beta_{43} & \beta_{44} & \beta_{45} & \beta_{46} \ \alpha_5 & \beta_{51} & \beta_{52} & \beta_{53} & \beta_{54} & \beta_{55} & \beta_{56} \ \alpha_6 & \beta_{61} & \beta_{62} & \beta_{63} & \beta_{64} & \beta_{65} & \beta_{66} \end{pmatrix}$$

Nonlinear Regression to estimate values for the model parameters

To estimate the parameter matrix from the equations above, use

nlrData2 <- gLVnonlinearRegression(data = exampleData2)

data("nlrData2")

The resulting list containts r length(names(nlrData2)) objects, named r names(nlrData2).

SSR : The value of the sum of squared residuals of the found solution
residual_SD : The estimated standard deviation of the data stochasticity in respect to the solution found
SE : A matrix contatining the standard errors of each estimated parameter
residuals_t.test : The result of a t test with the null hypothesis of the residuals from the model solution being normally distributed with standard error 'residual_SD'
message : A text message related to the reason why the optimization algorithm stopped
obs : A matrix with the calculated abundances on each timestep according to the estimated parameters
Fit : The object returned by the optimization algorithm. It is a list containing further details of the optimization algorithm run
df : The degrees of freedom
quantitative : Only when validating against in Silico data : The ratio of estimated parameters that contain the original value within the 95\% confidence interval given by their standard error (displayed in 'SE')
qualitative1 : Only when validating against in Silico data : The ratio of correctly retrieved edges from the original interaction network
qualititative2 : Only when validating against in Silico data : The ratio of correctly retrieved edges from the original interaction network after setting estimated parameters that are not significantly different from zero to zero (i.e. parameters that contain the zero in their 95\% confidence interval as given by their standard error)

To see explanations for each input argument and output values at any time during the R session use the help function

?gLVnonlinearRegression.

Interpretation of results and model identifiability

Without knowing the original mechanisms and causes that generated the data, the only reasonable interpretation of these results is the goodness/usefulness of the approximation given by the solution found and the uniqueness of this solution.

In this case, the algorithm achieved a fit with a total sum of squared residuals of

print(nlrData2$SSR)

Note that this value as a standalone value does not have any objective meaning, only relative, when comparing different solutions.

The estimated parameter matrix is

print(nlrData2$Parms)

With standard deviations

nlrData2$SE

With the standard errors and degrees of freedom we can compute standard t tests with a null hypothesis claiming that the parameter value is equal to zero (i.e. no interaction). This is the output of the summary function on the regression result:

summary(nlrData2)

We can plot the solution of the fitted model with estimated parameters against the original observations using the points() function

plot(nlrData2, type="l", main="Solution of fitted model vs original observations")
points(exampleData2)

The points function can also be used to compare single trajectories:

plot(nlrData2, type="l", main="Solution vs observations from gate 2")
points(exampleData2, index = 2)

Sensitivity analysis

We can perform a sensitivity analysis on the estimated parameter set with the function sensitivityAnalysis(). This function requires the original observations saved under the name "data".

data <- exampleData2  
ident <- sensitivityAnalysis(Parms = nlrData2$Parms)

This function produces two large tables, saved under the names "sens" and "coll"

names(ident)

Sens corresponds to a sensitivity matrix. This evaluates how much the output of the model varies by changing the value of a single parameter, in other words, how sensitive the model is to each single parameter. Parameters that have a large effect in the model output are usually easier to identify, while parameter that have little effect in the model output are often harder to identify to a precise value. This evaluation only evaluates parameters single-handedly and thus does not account for correlations and multicollinearity issues between parameters.

head(ident$sens)

An overview plot showing the sensitivity of the parameters for variable 2 can be generated with the plot function on ident$sens

plot(ident$sens, which=2, legpos = -30)

And if we wanted to add a legend to try to identify the corresponding parameter to which sensitivity trajectory:

plot(ident$sens, which=2, type="b" , pch=c(1:10,1:10,1:10,1:14), col=1:44)

In this overpacked and difficult to visualize plot we see that most parameters have little effect of the output for variable 2 (all lines near the x-axis zero), and some parameters have a larger effect in the abundance of variable 2 (all lines further away from the x-axis zero, for example the gray line of parameter 16 , which correspond to $\beta_{42}$ i.e. the effect of variable 4 on variable 2 ). Trajectories that are x-axis symmetric correspond to parameters that have a reciprocal sensitivity to the model output, meaning that they are potentialy correlated and changing the value of them simultaneously can produce a similar fitting solution.

When dealing with large systems with a lot of parameters, visualization of parameter sensitivities like above can be confusing and hard to evaluate. The main message we can retrieve is that a most parameters seem to have similar shaped trajectories of sensitivites to model output, which can be troublesome for finding unique solutions to the estimation problem.

Less sensitivity (standard: lower L2 value) means that the parameter is harder to identify and the result is more correlated to the values that other parameter obtain. This is seen in the summary statistic of the parameter matrix with each standard deviation. Note that only more sensitive parameters (parameters ) are the ones with narrower standard deviation (significantly different from zero):

sSens <- summary(ident$sens)
sSens[order(sSens$L2, decreasing = TRUE),c("value","L2")]

Most sensitive parameters: r rownames(sSens[order(sSens$L2, decreasing = TRUE),c("value","L2")])[1:10] ...

s <- cbind(1:42,summary(nlrData2))
s[order(s[,5]),]

Most certain parameter values: r order(s[,5])[1:10] ...

In these two list, eight out of ten parameters coincide (40, 3, 30, 6, 1, 5, 35, 37)

Parameter collinearities

In order to evaluate the parameter collinearities and problems which may arise from similar sensitivity trajectories, we can take a look at the second output from the sensitivity() function, namely the "coll" table:

head(ident$coll)

In contrast to the "sens" table, here groups of parameters are analyzed simultaneously to evalute the "grade of linear dependency" between them, in other words, how a similar output of the model (and thus solution to the fitting problem) can be achieved using different values for the parameters involved, and at which precision level.

Each column corresponds to a parameter in the model which are either present in the subset to evaluate (= 1) or absent (= 0). The second to last column named "N" displays how large the subset is (i.e. how many parameters are contained in the subset to evaluate the collinearity). The last column named "collinearity" displays the collinearity index for the subset of selected parameters. For example: Row 6 evaluates the subset of size 2 containing parameters 1 and 7, which have a collinearity index of "2.7". The collinearity index shows the precision we can approximate the solution given by the model, by simultaneously changing the parameters in the subset. In this case it means that we can compensate a change in the value of parameter 1 by changing the value of parameter 7, and doing this appropiately we can reach a fraction of 1/2.7 of the original model solution.

The higher the collinearity index is, the nearer we can approximate a model solution. Two linear dependent parameter (for example $\alpha$ and $\beta$ in the model $y = x(\alpha + \beta)$ would have a collinearity index of infinite, as we can achieve exactly the same solution with different values for $\alpha$ and $\beta$). As a rule of thumb, collinearity indexes of over 30 are considered troublesome for estimation algorithms.

The function gLVnonlinearRegression() attempts to estimate all parameters simultaneously. Therefore, relevant for this is the collinearity index for the full set of parameters:

ident$coll[ident$coll[,"N"]== length(nlrData2$Parms),]

We can retrieve this value

coll_index <- ident$coll[ident$coll[,"N"]==42,"collinearity"]
paste0("Collinearity index for the complete set of 42 parameters: ", coll_index)

Thus the least level of precision required for a unique solution with this parameterization is 1/coll_index $\sim 3.9*10^{-4}$. This can be compared to the level of precision that the model solution fits the original data, described by the standard deviation of the residuals left between them:

Level of precision achieved by the solution:

nlrData2$residual_SD

Level of precision achievable by different parameterizations (different numeric solutions for the whole parameter matrix):

1/coll_index

Is the solution unique?

nlrData2$residual_SD < 1/coll_index

Linear regression

A linearized algebraic and discrete version of the model equations is possible through decoupling of the variables

$$log(x_i(t+1) ) = \alpha_i + \sum{ \beta_{ij} x_j(t) }$$

An estimate for the parameters in this form is quickly and easy achievable through linear regression

lrData1 <- gLVlinearRegression(data = exampleData1)

We can inspect the solution just like we did before with the results from the nonlinear regression.

In contrast to the nonlinear regression, the parameter for the linear model are a unique solution as the sum of squared error has a global optimum.

Due to errors introduced in the discretization, slope approximation and due to the neglection of the time-correlation between residuals, the solution to the linear regression is distorted and only accurate in very small and simple systems. Nevertheless, it is often useful to compute the linear regression parameter estimates to provide them as alternative starting values (parms0 argument) for the nonlinear regression when it fails to converge from the zero start vector

nlrData1 <- gLVnonlinearRegression(data = exampleData1, parms0 = lrData1$Parms)

Often in large systems the linear regression returns very high parameter estimates, which are not useful as start vectors for the nonlinear regression nor have realistic biological interpretations. For these cases it is useful to perform shrinkage of the parameter matrix to avoid numerical instabilites. This can be done through regularization:

lrData2 <- gLVlinearRegression(data = exampleData2)
lrData2$Parms

paste0("range of estimated parameters: [", round(min(lrData2$Parms),1), " , ", round(max(lrData2$Parms),1), "]")

lrData2_regularized <- gLVlinearRegression(data = exampleData2, regularization = TRUE, alpha = 0.5)
lrData2_regularized$Parms

paste0("range of estimated parameters after regularization: [", round(min(lrData2_regularized$Parms),1), " , ", round(max(lrData2_regularized$Parms),1), "]")

Note that regularization is computed through an optimization method involving cross-validation with random partitions of the data, so the solution is not unique and likely to change in different runs.

Comparing different Network Structures

In order to compare explicit network structures based on prior knowledge of the system or to test for different solutions, the functions networkStructures() and compareStructures() can be used.

A network structure or topology is defined here as the qualitative values that the edges connecting the nodes take, in this case differentiating between possitive "+" , negative "-", and nonexistent interactions "0". Because the optimization algorithms never estimate a parameter value to be exactly zero, the results of a fitted model will always contain just "+" or "-" edges. Further inspection of the estimated values can give hints on values that approximate zero.

The codification of a network structure is then a vector derived from the parameter matrix.

$$ \begin{pmatrix} \alpha_1 & \beta_{11} & \beta_{12} \ \alpha_2 & \beta_{21} & \beta_{22} \end{pmatrix}$$

As a vector will look like the following:

$$ ( \alpha_1 , \alpha_2 , \beta_{11} , \beta_{21} , \beta_{21}, , \beta_{22} ) $$

For example, the following parameter matrix from "exampleData1":

print(exampleData1$Parms)

Has the following structure:" + - - + - - "

The function networkStructures() generates either a complete list containing all possible network topologies for n species, or generates a list based on a pre-specified list of network structures to test.

## generate all possible structures for a network containing 2 species:
s <- networkStructures(n = 2)
head(s$bounds)
head(s$ub)
head(s$lb)

After having generated the list of the structures to test, and their respective lower and upper bounds for the parameter estimation process (described by s\$ub and s\$lb), we can give this as an input to the function compareStructures().

iterations Number of times the regression algorithm should be performed on each structure. The default is 1 for only one parameter estimation run per structure. If set to integers x > 1 the algorithm runs x times and keeps the best solution according to the goodness of fit. The number of different solutions found and the number of times the optimal solution was achieved is kept in the regression attributes in order to evaluate how many local maximas were found, and how certain the user can be that the kept optimal could approximate a global optimum point.

Having no prior knowledge on characteristics of this system which could allow us to test qualititative constraints on specific interactions, we can first test all 64 possible structures and see which ones perform better for explaining the data. We set iteration = 2 to perform two parameter estimation steps on each structure each time with different randomly generated starting parameters (obeying the qualitative structure of the network to test). In general it is not practical to perform these evaluations, as it takes a lot of time.

In practice this takes 28 minutes to complete (DELL - OS: Windows7 - intel core i3-2330M CPU 2.20GHz )

comparisonAllStructuresData1 <- compareStructures(data = exampleData1, structures = s, iterations = 2)

data(comparisonAllStructuresData1)

The result contains a list with all network structures tested, and their goodness of fit described by the sum of squared residuals of the best solution, and the estimated variance of the data stochasticity:

comparisonAllStructuresData1$networks

First we see that most networks ( in total 38 ) failed to achieve a viable solution for the observations on two tries each. This could either mean that the starting vectors on each try generated too unstable numeric solution to allow for convergence, or that the constraints on the parameters just do not allow any solution which doesnt exponentially grow to infinite for the given timepoints (this is likely to be the reason behind the failing of network structures with a lot of "+" parameters which result in uncontrolled exponentially growing variables).

We can inspect the detailed result from the nonlinear regression by giving the index of the network in question to the saved runs in $runs:

plot(comparisonAllStructuresData1$runs[[29]], type="l")
points(exampleData1)

The best 10 networks:

as.character(comparisonAllStructuresData1$networks[1:10,"Network"])

Now we use the networkStructures to generate a list of networks to test based directly on two structures given by ourselves. Note that the structures must contain either the characters "+" or "-" and be as long as the number of parameters to estimate.

structures4testing <- networkStructures(testStructures = c("++++--","+--+--"))
structures4testing

comparisonTwoStructuresData1 <- compareStructures(data = exampleData1, structures = structures4testing, iterations = 5)

data(comparisonTwoStructuresData1)

comparisonTwoStructuresData1$networks

Here we see how even in a noisy system, with a good prior guess on proper data structure we can focus the evaluation on a subset of structures and even find better solutions with alternative structures which might explain the data better. In this case the "hunch"" for testing the structure "+ - - + - - " came from knowing the original data structure, but this can come from knowledge from biological information, previous literature research, etc. Testing of different hypothesis is a crucial step when using fitted models to noisy data, and for presentation and further usage of the results.

lkshrsch/gLVInterNetworks documentation built on May 21, 2019, 7:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com