generateSeeds-methods: Generate seeds for biclustering
In rqubic: Qualitative biclustering algorithm for expression data analysis in R

Description Methods Details Note Author(s) Examples

generateSeeds takes either matrix or an ExpressionSet object to generate seeds. Seeds are defined as pairs of genes (edges) which share coincident expression levels in samples. The higher the coincidence, the higher the score of the seeds will be. The seeds are generated by subsequent comparing each pair of genes. When all seeds have been produced, they are sorted by the coincidence scores and returned as an object. See the details section for notes on implementation.

In the rqubic package, generateSeeds currently supports two data types: ExpressionSet (an inherited type of eSet, or numeric matrix.

Both methods requires in addition a parameter, minColWidth, specifying the minimum number of conditions shared by the two genes of each seed. Its default value is 2. When this default value is used, the minimum coincidence score is defined as max(2, ncol/20), where ncol represents the number of conditions. When a non-default value is provided, the value is used to select seeds.

signature(object = "eSet"): An object representing expression data. Note that the exprs must be a matrix of integers, otherwise the method warns and coerces the storage mode of matrix into integer.
signature(object = "matrix"): A matrix of integers. In case filled by non-integers, the method warns and coerces the storage mode into integer

The function compares all pairs of genes, namely all edges of a complete graph composed by genes. The weight of each edge is defined as the number of samples, in which two genes have the same expression level. This weight, also known as the coincidence score, reflects the co-regulation relationship between two genes.

The seed is chosen by picking edges with higher scores than the minimum score, provided by the minColWidth parameter (default: 2).

To implement such a selection algorithm, a Fibonacci heap is constructed in the C codes. Its size is predefined as a constant, which should be reduced in case the gene number is too large to run the algorithm. A new seed, which was selected by having a higher coincidence score than the minimum, is inserted to the heap. And dependent on whether the heap is full or not, it is either inserted by squeezing the minimum seed out, or put into the heap directly.

Once the heap is filled by examining all pairs of genes, it is dumped into an array of edge pointers, with decreasingly ordered edge pointers by their scores. This array is captured as an external pointer, attached as an attribute of an rqubicSeeds object.

An rqubicSeeds object holds an integer, which records the height of the heap. It has (besides the class identifier) two attributes: one for the external pointer, and the other one for the threshold of the coincidence score.

In the rqubic implementation, the variable arr_c[i][j] holds the level symbols (-1, 0, 1 in the default case), whereas in the QUBIC implementation, this variable holds the index of level symbols, and the level symbols are saved in the global variable symbols.

Jitao David Zhang <jitao_david.zhang@roche.com>

data(sample.ExpressionSet, package="Biobase")
sample.disc <- quantileDiscretize(sample.ExpressionSet)
sample.seeds <- generateSeeds(sample.disc)
sample.seeds

## with higher threshold of incidence score
sample.seeds.higher <- generateSeeds(sample.disc, minColWidth=5)
sample.seeds.higher

rqubic sorted seeds (associating C pointer):
  Number of seeds: 37161
  Minimum sample number: 2
rqubic sorted seeds (associating C pointer):
  Number of seeds: 2388
  Minimum sample number: 5