FIT: a statistical modeling tool for transcriptome dynamics under fluctuating field conditions

Share:

Description

Provides functionality for constructing statistical models of transcriptomic dynamics in field conditions. It further offers the function to predict expression of a gene given the attributes of samples and meteorological data.

Overview

The FIT package is an R implementation of a class of transcriptomic models that relates gene expressions of plants and weather conditions to which the plants are exposed. (The reader is referred to [Nagano et al.] for the detail of the class of models concerned.)

By providing (a) gene expression profiles of plants brought up in a field condition, and (b) the relevant weather history (temperature etc.) of the said field, the user of the package is able to (1) construct optimized models (one for each gene) for their expressions, and (2) use them to predict the expressions for another weather history (possibly in a different field).

Below, we briefly explain the construction of the optimized models (“training phase”) and the way to use them to make predictions (“prediction phase”).

Model training phase

The model of [Nagano et al.] belongs to the class of statistical models called “linear models” and are specified by a set of “parameters” and “(linear regression) coefficients”. The former are used to convert weather conditions to the “input variables” for a regression, and the latter are then multiplied to the input variables to form the expectation values for the gene expressions. The reader is referred to the original article [Nagano et al.] for the formulas for the input variables. (See also [Iwayama] for a review.)

The training phase consists of three stages:

  1. Init: fixes the initial model parameters

  2. Optim: optimizes the model parameters

  3. Fit: fixes the linear regression coefficients

The user can configure the training phase through a custom data structure (“recipe”), which can be constructed by using the utility function FIT::make.recipe().

The role of the first stage Init is to fix the initial values for the model parameters from which the parameter optimization is performed. At the moment two methods, 'manual' and 'gridsearch', are implemented. With the 'manual' method the user can simply specify the set of initial values that he thinks is promising. For the 'gridsearch' method the user discretizes the parameter space to a grid by providing a finite number of candidate values for each parameter. FIT then performs a search over the grid for the “best” combinations of the initial parameters.

The second stage Optim is the main step of the model training, and FIT tries to gradually improve the model parameters using the Nelder-Mead method.

This stage could be run one or more times where each can be run using the method 'none', 'lm' or 'lasso'. The 'none' method passes the given parameter as-is to the next method in the Optim pipeline or to the next stage Fit. (Basically, the method is there so that the user can skip the entire Optim stage, but the method could be used for slightly warming-up the CPU as well.)

The 'lm' method uses the a simple (weighted) linear regression to guide the parameter optimization. That is, FIT first computes the “input variables” from the current parameters and associated weather data, and then finds the set of linear coefficients that best explains the “output variables” (gene expressions). Finally, the quadratic residual is used as the measure for the error and is fed back to the Nelder-Mead method.

The 'lasso' method is similar to the 'lm' method but uses the (weighted) Lasso regression (“linear” regression with an L1-regularization for the regression coefficients) instead of the simple linear regression. FIT uses the glmnet package to perform the Lasso regression and the strength of the L1-regularization is fixed via a cross validation. (See cv.glmnet() from the glmnet package. The Lasso regression is said to suppress irrelevant input variables automatically and tends to create models with better prediction ability. On the other hand, 'lasso' runs considerably slower than 'lm'.

For example, passing a vector c('lm', 'lasso') to the argument optim (of make.recipe()) creates a recipe that instructs the Optim stage to (1) first optimize using the 'lm' method, (2) and then fine tunes the parameters using the 'lasso' method.

After fixing the model parameters in the Optim stage, the Fit stage can be used to fix the linear coefficients of the models. Here, either 'fit.lm' or 'fit.lasso' can be used to find the “best” coefficients, the main difference being that the coefficients are penalized by an L1-norm for the latter. Note that it is perfectly okay to use 'fit.lasso' for the parameters optimized using 'lm'.

In order to prepare for the possibly huge variations of expression data as measured by RNA-seq, FIT provides a way to weight regression penalties from each sample with different weights as in sum_{s in samples} (weight_s) (error_s)^2.

Prediction phase

For each gene, the trained model of the previous subsection can be thought of as a black box that maps the field conditions (weather data), to which a plant containing the gene is exposed, to its expected expression. FIT provides a simple function FIT::predict() that does just this.

FIT::predict() takes as its argument a list of pretrained models as well as actual/hypothetical plant sample attributes and weather data, and returns the predicted values of gene expressions.

When there is a set of actually measured expressions, an associated function FIT::prediction.errors()) can be used to check the validity of the predictions made by the models.

Namespece contamination

The FIT package exports fairly ubiquitous names auch as optim, predict etc.\ as its API. Users, therefore, are advised to load FIT via requireNamespace('FIT') and use its API function with a namaspace qualifier (e.g.~FIT::optim()) rather than loading and attaching it via library('FIT').

Sample training and prediction data

XXX See extdata.

Basic usage

See vignettes for examples of actual scripts that use FIT.

References

[Nagano et al.] A.J.~Nagano, et al. “Deciphering and prediction of transcriptome dynamics under fluctuating field conditions,” Cell~151, 6, 1358–69 (2012)

[Iwayama] K.~Iwayama, et al. “FIT: statistical modeling tool for transcriptome dynamics under fluctuating field conditions,” (in preparation)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## Not run: 
# The following snippet shows the structure of a typical
# driver script of the FIT package.
# See vignettes for examples of actual scripts that use FIT.

##############
## training ##
##############
## discretized parameter space (for 'gridsearch')
grid.coords <- list(
  clock.phase = seq(0, 23*60, 1*60),
  # :
  gate.radiation.amplitude = c(-5, 5)
)

## End(Not run)


## create a training recipe
recipe <- FIT::make.recipe(c('temperature', 'radiation'),
                           init  = 'gridsearch',
                           init.data = grid.coords,
                           optim = c('lm'),
                           fit   = 'fit.lasso',
                           time.step = 10, 
                           opts =
                             list(lm    = list(maxit = 900),
                             lasso = list(maxit = 1000))
                           )

## names of genes to construct models
genes <- c('Os12g0189300', 'Os02g0724000')



## Not run: 
## load training data
training.attribute  <- FIT::load.attribute('attribute.2008.txt')
training.weather    <- FIT::load.weather('weather.2008.dat', 'weather')
training.expression <- FIT::load.expression('expression.2008.dat', 'ex', genes)

## End(Not run)

## models will be a list of trained models (length: ngenes)
models <- FIT::train(training.expression,
                     training.attribute,
                     training.weather,
                     recipe)

################
## prediction ##
################

## Not run: 
## load validation data
prediction.attribute  <- FIT::load.attribute('attribute.2009.txt');
prediction.weather    <- FIT::load.weather('weather.2009.dat', 'weather')
prediction.expression <- FIT::load.expression('expression.2009.dat', 'ex', genes)

## End(Not run)



## predict
prediction.result <- FIT::predict(models[[1]],
                                 prediction.attribute,
                                 prediction.weather)