The TDCOR main function

Share:

Description

This is the main function to run the TDCOR algorithm and reconstruct the gene network topology.

Usage

1
TDCOR(dataset,l_genes, TPI, DPI, ...)

Arguments

dataset

Numerical matrix storing the non-log2 transcriptomic data (average of replicates). The rows of this matrix must be named by gene codes (e.g. the AGI gene code for Arabidopsis datasets). The columns must be organized in chronological order from the left to the right.

l_genes

A character vector containing the (AGI) gene codes of the genes one wishes to build the network with (gene codes -e.g. "AT5G26930"- by opposed to gene names -e.g."GATA23"- which are provided by the optional l_names argument).

TPI

A TPI database generated by CalculateTPI which contains some necessary statistical information for triangle motifs pruning. In particular it must have an entry for all the regulators included in the network analysis. The TPI database may also contain data for genes that are not included in l_genes.

DPI

A DPI database generated by CalculateDPI which contains some necessary statistical information for diamond motifs pruning. In particular it must have an entry for all the regulators included in the network analysis. The DPI database may also contain data for genes that are not included in l_genes.

...

Additional arguments to be passed to the TDCOR function (Some are necessary if dataset is not the LR dataset):

  • l_names: A character vector containing the 'everyday names' of the genes included in the analysis (e.g. "LBD16") . These are the names by which genes will be refered as to in the final network table. Gene names must be unique; repeats of the same name are not allowed. If no l_names parameter is given, it is by default equal to l_genes. The i-th element of the vector l_names contains the 'everyday name' of the i-th gene in l_genes.

  • l_prior: A numerical vector containing the prior information on the genes included in the analysis. By defining the l_prior, the user defines which genes are positive regulators, which are negative regulators and which can only be targets. The prior code is defined as follow: -1 for negative regulator; 0 for non-regulator (target only); 1 for positive regulator; 2 for genes that can act as both positive and negative regulators. If no l_prior parameter is provided, it is by default equal to a vector of 2s, meaning that all genes are regarded as being both potential activators and repressors. The i-th element of the vector l_prior contains the prior to associate to the i-th gene in l_genes.

  • times: A numerical vector containing the successive times (in hours) at which the samples were collected to generate the time-series transcriptomic dataset.If no times parameter is given, it is by default equal to the times parameter used for the lateral root transcriptomic dataset.

  • n0: An integer corresponding to the number of iterations to be performed in the external bootstrap loop. At the beginning of every iteration of this external loop, new random parameter values are sampled in the user-defined bootstrapping interval. If no n0 parameter is given, it is by default equal to 1000.

  • n1 : An integer corresponding to the number of iterations to be performed in the internal bootstrap loop. In this loop parameter values are kept the same but the order of node analysis is randomized at each iteration. If no n1 parameter is provided, it is by default equal to 10.

  • time_step: A positive number corresponding to the time step (in hours) i.e. the temporal resolution at which the gene profiles are analysed. If no time_step parameter is provided, it is by default equal to 1 hour.

  • delayspan: A positive number. It is the maximum time shift (in hours) which is analysed. It should be high enough for the time shift estimation process to be successful but argueably relatively small in comparison to the overall duration of the time series. (e.g. for the LR dataset which has data spanning over 54 hours, delayspan was set to 12 hours).

  • tol: A strictly positive number corresponding to the tolerance threshold on the final score of time shift estimates. The score is a positive number that measures the "level of disagreement" between the four time shift estimators. A time shift estimate is regarded as meaningful if it scores lower than the tol threshold. If all four estimators agree on a certain value of time shift, the estimate obtains the best possible score, which is 0. Increasing tol make the time shift estimation process LESS stringent. For more information see estimate.delay.

  • delaymin: A numerical vector containing one or two positive elements corresponding to the boundaries of the boostrapping interval for the minimum time shift above which putative interactions are regarded as possible. If no delaymin parameter is provided, it is by default equal to 0 hour. Gene pairs with time shift lower than or equal to delaymin are regarded as co-regulation and are therefore not included in the network.

  • delaymax: A numerical vector containing one or two positive elements corresponding to the boundaries of the boostrapping interval for the maximum time shift above which putative interactions are regarded as indirect. If no delaymax parameter is provided, it is by default equal to 3 hours. Putative indirect interactions are included in the network only when the putative target is not predicted any direct regulator.

  • thr_cor: A numerical vector containing one or two positive elements between 0 and 1 corresponding to the boundaries of the boostrapping interval for the threshold of Pearson's correlation. A gene pair is included in the preliminary network only if the correlation between the profiles (with the time shift correction) is higher than or equal to the thr_cor threshold. If no thr_cor parameter is provided, it is by default equal to [0.7;0.9]. Note that increasing thr_cor makes the correlation filter MORE stringent.

  • delay: A positive number corresponding to the most likely time shift (in hours) one could expect between the profile of a regulator and the profile of its direct targets. This parameter enables one to generate the reference profiles of the ideal regulator when calculating the index of directness (ID). If no delay parameter is provided, it is by default equal to 3 hours. Note that similar parameters serving the same purpose are also used to calculate the triangle pruning index and the diamond pruning index. But TDCOR reads the value to use for calculating those indices directly from the TPI and DPI databases (for consistency reasons).

  • thr_ind1: A numerical vector containing one or two positive elements corresponding to the boundaries of the boostrapping interval for the index of directness (ID) lower threshold. Gene pairs showing an ID below this threshold will be regarded as co-regulation and therefore eliminated from the network. If no thr_ind1 parameter is provided, it is by default equal to 0.5. Reminder: For direct interaction one expects ID values around 1. For indirect interactions one expect values greater than 1. For co-regulated genes, ID should be smaller than 1. Note that increasing thr_ind1 makes the ID-based "anti-coregulation filter" MORE stringent.

  • thr_ind2: A numerical vector containing one or two positive elements corresponding to the boundaries of the boostrapping interval for the index of directness (ID) upper threshold. Putative interactions showing an ID above this threshold are regarded as indirect. If no thr_ind2 parameter is provided, it is by default equal to 4.5. Reminder: For direct interaction one expects ID values around 1. For indirect interactions one expect values greater than 1. For co-regulated genes, ID should be smaller than 1. Note that increasing thr_ind2 makes the ID-based filter against indirect interactions LESS stringent.

  • thr_overlap : A numerical vector containing one or two positive elements smaller than 1. These correspond to the boundaries of the boostrapping interval for the index of overlap. If no thr_overlap parameter is provided, it is by default equal to [0.5,0.6]. Note that increasing thr_overlap makes the overlap filter MORE stringent. This filter aims at removing unlikely negative interactions where the putative regulator switches on too late to downregulate the putative target. Keep in mind that the filter is sensitive to the noise level in the data. It should only be used if the data has a very low level of noise. To inactivate the filter set the thr_overlap parameter to 0.

  • thrpTPI: A numerical vector containing one or two positive numbers smaller or equal to 1 in increasing order. These correspond to the boundaries of the boostrapping interval for the probability threshold used in the triangle filter. If no thrpTPI parameter is provided, it is by default equal to [0.5,0.75]. Note that increasing thrpTPI makes the triangle filter LESS stringent.

  • thrpDPI: A numerical vector containing one or two positive numbers smaller or equal to 1 in increasing order. These correspond to the boundaries of the boostrapping interval for the probability threshold used in the diamond filter. If no thrpTPI parameter is provided, it is by default equal to [0.8,0.9]. Note that increasing thrpDPI makes the diamond filter LESS stringent.

  • thr_isr: A numerical vector containing one or two positive elements corresponding to the boundaries of the boostrapping interval for the threshold of the index of directness above which the gene is predicted to negatively self-regulate. Genes will be predicted to positively self-regulate if the index of directness is smaller than 1/thr_isr. If no thr_isr parameter is provided, it is by default equal to [3,6]. Note that increasing thr_isr makes the search for self-regulating genes MORE stringent.

  • search.EP: A boolean to control whether Master-Regulator-Signal-Transducer (MRST) or signal Entry Point (EP) should be looked for or not. (If yes, set on TRUE which is the default value)

  • thr_bool_EP: A number between 0 and 1 used as threshold to convert normalized expression profiles (values between 0 and 1) into boolean expression profiles (values equal to 0 or 1). If no thr_bool_EP parameter is provided, it is by default equal to 0.8. The conversion of the continuous profiles into boolean profiles is part of the process of MRST analysis.

  • MinTarNumber: An integer. Minimum number of targets a regulator should have in order to be regarded as a potential MRST. If no MinTarNumber parameter is provided, it is by default equal to 5. Note that increasing MinTarNumber makes the search for MRST genes MORE stringent.

  • MinProp: A number between 0 and 1. Minimum proportion of targets which are not at steady state at t=0 that a regulator should have in order to be regarded as a potential MRST. If no MinProp parameter is provided, it is by default equal to 0.75. Note that increasing MinProp makes the search for MRST genes MORE stringent.

  • MaxEPNumber: An integer. Maximum number of MRST that can be predicted at each iteration. If no MaxEPNumber parameter is provided, it is by default equal to 1.

  • regmax: An integer. Maximum number of regulators that a target may have. If no regmax parameter is provided, it is by default equal to 6.

  • outfile_name: A string. Name of the file to print the network table in. By default it is "TDCor_output.txt".

Details

The default values are certainly not the best values to work with. The TDCOR parameters have to be optimized by the user based on its own knowledge of the network, the quality of the data etc... Because TDCOR works by pruning interactions, it is probably easier (as a first go) to optimize the parameter values following the order of the filters.

Before starting inactivate all the filters using the less stringent parameter values possible or for the MRST filter by setting search.EP to FALSE. You should as well set the bootstrap parameters to a relatively low value (e.g. n0=100 and n1=1). Hence the runs will be quick and you will be able to rapidly assess whether the changes you made in the parameter values were a good thing.

Start by optimizing the parameters involved in time shifts estimation. That is to say, essentially delayspan, time_step, tol and delaymax. The latter (together with delaymin) is a biological parameters and the range of possible values is argueably limited. Though they ought to be adapted to the organism (e.g. in prokaryotes, the delays are extremely short since polysomes couple transcription and translation). Note that the estimate.delay function can be very helpful to optimize these various parameters thanks to the visual output. Use it with pairs of genes that have been shown to interact directly or indirectly in your system and for which the relationship in the dataset in clearly linear. For network reconstruction with TDCor, good time shift estimation is absolutely crucial. Once this is done, proceed with optimizing the threshold for correlation thr_cor and the thresholds on the index of directness (thr_ind1, thr_ind2). Then optimize the parameters of the triangle and diamond pruning filters (thrpTPI and thrpDPI). You may have to try a couple of different TPI and DPI databases (i.e. databases built with different input parameters). In particular increasing the noise level when generating these database enables one to decrease the stringency of the triangle and diamond filters, when increasing the thrpTPI and thrpDPI value is not sufficient. Subsequently fine-tune the parameters of the MRST filter (thr_bool_EP, MinTarNumber, MinProp, MaxEPNumber) if you want it on. Remember to set search.EP back to TRUE first. Next optimize thr_isr (self-regulation). Finally, restrict the number of maximum regulators if necessary (regmax).

Value

The TDCOR main function returns a list containing 7 elements

input

A list containing the input parameters (as a reminder).

intermediate

A list containing three intermediate matrices. mat_cor is the matrix that stores the correlations, mat_isr stores the indices of self-regulations and mat_overlap contains the indices of overlap.

network

A matrix containing the network. The element [i,j] of this matrix contains the bootstrap value for the edge "gene j to gene i". The sign indicates the sign of the predicted interaction.

ID

A matrix containing the computed indices of directness (ID). The element [i,j] contains the ID for the edge "gene j to gene i".

delay

A matrix containing the computed time shifts. The element [i,j] of this matrix contains the estimated time shift between the profile of gene j and the profile of gene i.

EP

A vector containing the bootstrap values for the MRST predictions.

predictions

The edge predictions in the form of a table. The columns are organized in following order: Regulator name, Type of interaction (+ or-), Target name, Bootstrap, Index of Directness, Estimated time shift between the target and regulator profiles.

The table of predictions (without header) and the input parameters are printed at the end of the run in two separate text files located in the current R working directory (If you are not sure which directory this is, use the command getwd()).

Note

For a parameter to be involved in the bootstrapping process, one must feed the function a vector containing two values as input. These two values are respectively the lower and upper boundaries of the bootstrapping interval. If one chooses not to use a parameter for bootstrapping, one can either feed the function an input vector containing twice the same value, or only one value.

Author(s)

Julien Lavenus jl.tdcor@gmail.com

References

Lavenus et al., 2015, The Plant Cell

See Also

See also CalculateDPI, CalculateTPI, UpdateDPI, UpdateTPI, TDCor-package.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
## Not run: 
# Load the lateral root transcriptomic dataset
data(LR_dataset)

# Load the vectors of gene codes, gene names and prior
data(l_genes)
data(l_names)
data(l_prior)

# Load the vector of time points for the LR_dataset
data(times)

# Generate the DPI databases

DPI15=CalculateDPI(dataset=LR_dataset,l_genes=l_genes,l_prior=l_prior,
times=times,time_step=1,N=10000,ks_int=c(0.5,3),kd_int=c(0.5,3),delta_int=c(0.5,3),
noise=0.15,delay=3)

# Generate the TPI databases

TPI10=CalculateTPI(dataset=LR_dataset,l_genes=l_genes,
l_prior=l_prior,times=times,time_step=1,N=10000,ks_int=c(0.5,3),
kd_int=c(0.5,3),delta_int=c(0.5,3),noise=0.1,delay=3)

# Check/update if necessary the databases (Not necessary here though.
# This is just to illustrate how it would work.)

TPI10=UpdateTPI(TPI10,LR_dataset,l_genes,l_prior)
DPI15=UpdateDPI(DPI15,LR_dataset,l_genes,l_prior)


### Choose your TDCOR parameters ###
 
# Parameters for time shift estimatation 
# and filter on time shift value
ptime_step=1			
ptol=0.13			 
pdelayspan=12
pdelaymax=c(2.5,3.5)
pdelaymin=0

# Parameter of the correlation filter
pthr_cor=c(0.65,0.8)

# Parameters of the ID filter
pdelay=3		
pthr_ind1=0.65			
pthr_ind2=3.5

# Parameter of the overlap filter
pthr_overlap=c(0.4,0.6)

# Parameters of the triangle and diamond filters
pthrpTPI=c(0.55,0.8)
pthrpDPI=c(0.65,0.8)	
pTPI=TPI10			
pDPI=DPI15

# Parameter for identification of self-regulations				
pthr_isr=c(4,6)

# Parameters for MRST identification			
pMinTarNumber=5			
pMinProp=0.6

# Max number of regulators		
pregmax=5

# Bootstrap parameters
pn0=1000			
pn1=10

# Name of the file to print network in
poutfile_name="TDCor_output.txt"	

### Reconstruct the network ###

tdcor_out= TDCOR(dataset=LR_dataset,l_genes=l_genes,l_names=l_names,n0=pn0,n1=pn1,
l_prior=l_prior,thr_ind1=pthr_ind1,thr_ind2=pthr_ind2,regmax=pregmax,thr_cor=pthr_cor,
delayspan=pdelayspan,delaymax=pdelaymax,delaymin=pdelaymin,delay=pdelay,thrpTPI=pthrpTPI,
thrpDPI=pthrpDPI,TPI=pTPI,DPI=pDPI,thr_isr=pthr_isr,time_step=ptime_step,
thr_overlap=pthr_overlap,tol=ptol,MinProp=pMinProp,MinTarNumber=pMinTarNumber,
outfile_name=poutfile_name)


## End(Not run)