rm.get.threshold: Estimate an objective threshold for signal-noise separation...
In RMThreshold: Signal-Noise Separation in Random Matrices by using Eigenvalue Spectrum Analysis

Description Usage Arguments Details Value Note Author(s) References See Also Examples

This is the main function of the package. A suppositional threshold is incremented monotonically, thereby recording the eigenvalue spacing distribution (nearest neighbor spacing distribution, NNSD) of the thresholded matrix. According to Random Matrix Theory, that distribution undergoes a characteristic change when the threshold properly separates signal from noise. By subsequent removal of the below-threshold matrix elements, the modular structure of the matrix - or of a network characterized by this matrix - can be unraveled. The function works for real symmetric matrices.

 rm.get.threshold(rand.mat, nr.thresholds = 51, 
    unfold.method = "gaussian", bandwidth = "nrd0", nr.fit.points = 51,
    dist.method = "LL", nr.breaks = 51, discard.outliers = TRUE, 
    discard.zeros = FALSE, min.mat.dim = 40, max.ev.spacing = 3, interval = NULL,
    interactive = TRUE, smooth.par = 0.5, plot.comp = TRUE,
    save.fit = FALSE, plot.spacing = FALSE, wait.seconds = 0)

`rand.mat`	A random, real-valued, symmetric matrix.
`nr.thresholds`	Number of equidistant thresholds being probed and recorded.
`unfold.method`	A string variable that determines which type of unfolding algorithm is used. Must be one of 'gaussian' (Gaussian kernel density) or 'spline' (cubic spline interpolation on the cumulative distribution function).
`bandwidth`	Bandwidth used to calculate the Gaussian kernel density. Only active if `unfold.method = 'gaussian'` is used. See description of the `density` function.
`dist.method`	A string variable that determines which type of distance estimation to the limiting distributions is used. Must be one of 'LL' (Log Likelihood) or 'KLD' (Kullback-Leibler Distance).
`nr.fit.points`	Number of supporting points used for the cubic spline to the empirical cumulative distribution function. Only active if `unfold.method = 'spline'` is used.
`nr.breaks`	Number of bins used in the analysis to subdivide the empirical eigenvalue spacing distribution.
`smooth.par`	Parameter controlling the degree of smoothing in the loess regression curve presented in the final plot (distance vs. threshold).
`discard.outliers`	A logical variable that determines if outliers are to be discarded from the spectrum of eigenvalues (see 'Details').
`discard.zeros`	A logical variable specifying if rows and columns exclusively containing zeros outside the main diagonal are to be removed from the matrix after applying each threshold (see 'Details').
`min.mat.dim`	By thresholding a matrix, rows and columns exclusively containing zeros in the off-diagonal matrix elements likely emerge. The parameter `min.mat.dim` determines the minimum number of non-zero rows and columns of the probed matrix during the thresholding. The thresholding loop is stopped if the number of non-zero rows and columns is getting below `min.mat.dim`.
`max.ev.spacing`	A number determining the maximum eigenvalue spacing that is considered when plotting and analyzing the empirical eigenvalue spacing distribution (see 'Details').
`interval`	Interval of thresholds that is searched through. A numeric vector with two components (minimum and maximum threshold). The default is `interval = NULL` which means that the search interval ranges from the minimum absolute value of all matrix elements to the maximum absolute value of all matrix elements.
`interactive`	A logical variable that determines if the user wants to choose the candidate thresholds interactively, by mouse clicks.
`plot.comp`	A logical variable that determines if the empirical distribution of the eigenvalue spacings is displayed in a plot window during function execution.
`save.fit`	A logical variable that determines if a plot of the spline fit to the empirical cumulative distribution function is saved for each threshold. Can be used to check if fitting works well.
`plot.spacing`	A logical variable that determines if a scatterplot showing the eigenvalue spacing is saved at each suppositional threshold. Can be used to check if unfolding of the eigenvalue spacing works correctly (see 'Details').
`wait.seconds`	A numerical variable that, if set to non-zero values, enables viewing the plots with the eigenvalue spacing distribution during function execution for a predetermined duration. Useful on fast computers. Setting the variable to a specific number will cause the function to show the actual eigenvalue spacing distribution at least that number of seconds.

The function rm.get.threshold is the main function of the package. It takes a random matrix as input and applies a sequence of suppositional thresholds on it by setting all matrix elements to zero whose absolute value is below the actual threshold. The eigenvalues of the modified matrix are calculated at each threshold by using the eigen function of the R base package. The eigenvalue spectrum is then unfolded, i.e. it is calibrated in such a way that the average global spacing between the eigenvalues is constant over the whole spectrum. The latter can be tracked by setting plot.spacing = T. Two methods are provided for unfolding: one method is based on calculation of the Gaussian kernel density of the eigenvalue spectrum; another method is based on fitting a cubic spline function to the cumulative empirical eigenvalue distribution. The latter is determined by the parameter unfold.method. See the references for details of the unfolding procedure. For each threshold, a distance between the empirical eigenvalue spacing distribution (NNSD) and both limiting distributions (Wigner-Dyson distribution and Exponential distribution, respectively) is estimated. Two methods of distance computation are implemented: a method based on computation of the log likelihood of the empirical eigenvalue spacing presupposing each of the limiting distributions, and a method based on calculation of the Kullback-Leibler divergence between empirical eigenvalue spacing distribution and these limiting distributions. If the assumed modular structure of the matrix is completely covered by noise, the empirical eigenvalue spacing distribution is close to the Wigner-Dyson distribution, a curve that approaches zero for small eigenvalue spacings. In the opposite case, when the modular structure of the matrix is prevailing, the empirical eigenvalue spacing distribution comes closer to an Exponential distribution (which represents the distribution of the intervals between two consecutive events in a Poisson process). If the matrix possesses a modular structure (hidden by noise), we expect that the NNSD changes gradually from the Wigner-Dyson case to the Exponential case when the threshold is increased stepwise. This change is monitored in a plot window if the plot.comp variable is left at it's default (plot.comp = TRUE). Two additional parameters are critical for proper functioning of the algorithm. For some types of input matrices, it might be necessary to remove the outliers of the eigenvalue distribution, in order to correctly investigate the bulk of the eigenvalue spectrum. This is achieved by the setting discard.outliers = TRUE, which is the default setting. In some other cases, it might be useful to retain the outliers during analysis. Another critical parameter is discard.zeros. If set to TRUE, rows and columns exclusively containing zeros outside the main diagonal are removed from the matrix at each threshold. This causes the matrix to shrink during the loop over thresholds. Setting discard.zeros = TRUE can be especially useful when the NNSD piles up at the left tail of the histogram shown during program execution. For very fast computers, the argument wait.seconds can be set to a non-zero value, in order to enable the user to follow that change of the NNSD visually. The distance between NNSD and the limiting distributions is not calculated over the whole range of eigenvalue spacings but over the interval (0, max.ev.spacing). At a spacing of zero, the difference between the Wigner-Dyson distribution and the Exponential distribution is mostly pronounced. The maximum spacing considered in the distance calculation is determined by the parameter max.ev.spacing. This parameter should not be lower than √(2 / π), where the peak of the Wigner-Dyson distribution lies. On the other hand, is does not make sense to choose too high values for max.ev.spacing, because both the Wigner-Dyson and the Exponential distribution assume rather low values in the right tail (which might cause numerical errors). If the algorithm works well, a relatively sharp transition from the Wigner-Dyson case to the Exponential case should become apparent. The latter is (hopefully) confirmed in a plot which is shown after completion of the loop over the sequence of suppositional thresholds. This plot shows the calculated distance between the NNSD and both limiting distributions versus applied threshold. The user can interactively choose candidate thresholds by clickung with the left mouse button on the points of the red-coloured curve. The selection is terminated by a right mouse-click somewhere on the plot. Likewise, candidate thresholds can be chosen in a plot showing the p-values for a Kolmogorov-Smirnov test (testing for exponentiality), and in a plot showing the Sum of Squared Errors between the empirical NNSD and the Exponential distribution versus threshold. The hereby chosen candidate thresholds are returned by the function. The analysis can now be refined by narrowing down the search interval for the thresholds by specifying the interval parameter.

A list containing the following entries:

`unfold.method`	The method that was used for eigenvalue unfolding. Either 'gaussian' or 'spline'.
`dist.method`	The method that was chosen to estimate the distance to the limiting distributions. Either 'LL' (Log Likelihood) or 'KLD' (Kullback-Leibler Divergence).
`tested.thresholds`	A vector containing the probed thresholds.
`dist.Wigner`	A vector containing the numerical values of the estimated distance between empirical eigenvalue spacing distribution and Wigner-Dyson distribution.
`dist.Expon`	A vector containing the numerical values of the estimated distance between empirical eigenvalue spacing distribution and Exponential distribution.
`nr.zeros`	A vector containing the number of zero-valued matrix elements for each threshold.
`nr.uniq.ev`	An integer vector indicating the number of unique eigenvalues at each threshold (when degenerated eigenvalues are counted as one).
`max.ev.mult`	An integer vector indicating the maximum eigenvalue multiplicity (degeneracy) at each threshold.
`nr.spacings`	An integer vector containing the number of eigenvalue spacings at each threshold. This number is smaller than the matrix dimension if eigenvalues are degenerated.
`nr.small.spacings`	An integer vector containing the number of small spacings (`< max.ev.spacing/1000`) for each probed threshold.
`perc.small.spacings`	A vector containing the percentage of small spacings for each probed threshold.
`eff.dimension`	An integer number specifying the 'effective dimension' of the matrix after thresholding, i.e. the number of non-zero rows and columns.
`comparison.plots`	A character vector containing the names of the plots comparing the empirical eigenvalue spacing distribution with both limiting distributions (only if `plot.comp = T`).
`rm.dimension`	An integer vector indicating the dimension of the matrix after each thresholding (only if `discard.zeros = TRUE`).
`nr.outliers.removed`	An integer vector containing the number of outliers in the eigenvalue spectrum removed at each threshold. Only if `discard.outliers = TRUE`.
`p.ks`	A vector containing the p-values for the Kolmogorov-Smirnov test at each probed threshold.
`sse.exp`	A vector containing the Sum of Squared Errors (SSE) between observed NNSD and Exponential distribution for each probed threshold.
`number.zeros.plot`	A character string with the name of the plot depicting the number of zero-valued matrix elements versus threshold.
`number.uniq.ev.plot`	The name of the plot showing the number of unique eigenvalues vs. threshold.
`max.ev.mult.plot`	The name of the plot showing the maximum eigenvalue multiplicity vs. threshold.
`mat.dimension.plot`	The name of the plot showing the matrix dimension after each thresholding step (only if `discard.zeros = TRUE`).
`num.ev.spacings.plot`	The name of the plot showing the number of eigenvalue spacings vs. threshold.
`distance.plot`	The name of the (main) plot showing the distance of the empirical eigenvalue spacing distribution to the limiting distributions at each threshold.
`cumfit.plots`	A character vector containing the names of the plots showing the spline-function fitting of the cumulative eigenvalue spacing distribution (only if `save.fit = TRUE`).
`space.plots`	A character vector containing the names of the scatter plots with the eigenvalue spacing (only if `plot.spacing = TRUE`).
`chosen.thresholds`	A vector containing the potential thresholds chosen by the user from the distance plot.
`p.ks.plot`	The name of the plot showing the p-values for the Kolmogorov-Smirnow test versus probed thresholds.
`p.ks.chosen`	A vector containing the candidate thresholds chosen by the user based on the Kolmogorov-Smirnow test.
`sse.plot`	The name of the plot showing the SSE between observed NNSD and Exponential distribution versus probed thresholds.
`sse.chosen`	A vector containing the candidate thresholds chosen by the user based on the SSE test.

It is recommended to check the input matrix using rm.matrix.validation before running this function. If the histogram of empirical eigenvalue spacings piles up at zero, one should set discard.zeros = TRUE

Uwe Menzel uwemenzel@gmail.com

https://en.wikipedia.org/wiki/Random_matrix

Wigner, E. P. , Characteristic vectors of bordered matrices with infinite dimensions, Ann. Math. 62, 548-564, 1955.

Mehta, M., Random Matrices, 3nd edition. Academic Press, 2004.

Furht, B. and Escalante, A. (eds.), Handbook of Data Intensive Computing, Springer Science and Business Media, 2011.

Luo, F. et al., Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinformatics, 2007.

Creating a random matrix: create.rand.mat

## Run with pre-defined random matrix:
set.seed(777)
random.matrix <- create.rand.mat(size = 1000, distrib = "norm")$rand.matr
dim(random.matrix)		# 1000 1000

## Not run: 
res <- rm.get.threshold(random.matrix)	# threshold might be 3.21
str(res)								# List of 26
rm.show.plots(res$comparison.plots)  	# watch sequence of plots once more

## End(Not run)

## Try other parameters:
## Not run: 
res <- rm.get.threshold(random.matrix, unfold.method = "spline")	
res <- rm.get.threshold(random.matrix, dist.method = "KLD")			
res <- rm.get.threshold(random.matrix, discard.outliers = FALSE) # might cause problems
res <- rm.get.threshold(random.matrix, wait.seconds = 2)	 	 # slow down
res <- rm.get.threshold(random.matrix, discard.zeros = TRUE)    	
res <- rm.get.threshold(random.matrix, discard.zeros = TRUE, dist.method = "KLD")	

## End(Not run)

## Refine analysis by choosing narrower threshold range 
## Not run: 
res <- rm.get.threshold(random.matrix, interval = c(2.5, 3.5))		

## End(Not run)

## Apply the identified threshold to the matrix
cleaned.matrix <- rm.denoise.mat(random.matrix, threshold = 3.21)	
cleaned.matrix <- rm.discard.zeros(cleaned.matrix)	 					  
dim(cleaned.matrix)	# smaller

## Find the clusters in the thresholded matrix:
## Not run: 
  library(igraph)
  g  <- graph.adjacency(cleaned.matrix, mode = "undirected")
  clusters(g)		

## End(Not run)


## Not run: 

  ## Create modular matrix and validate:
  matlist = list()
  for (i in 1:4) matlist[[i]] = get.adjacency(erdos.renyi.game(250, 0.1))	
  mat <- bdiag(matlist)					# create block-diagonal matrix 		 
  rm.matrix.validation(as.matrix(mat))	# Exponential case, modular matrix
  
  ## Add noise:
  mat1 = add.Gaussian.noise(as.matrix(mat), mean = 0, stddev = 0.1)
  
  ## Find threshold, denoise, reconstruct the modules:
  res <- rm.get.threshold(mat1)	# threshold possibly about 0.46
  # a smaller interval had been ok as well:
  res <- rm.get.threshold(mat1, interval = c(0, 0.8)) 
  cleaned <- rm.denoise.mat(mat1, 0.5)
  matr <- cleaned != 0
  g  <- graph.adjacency(matr, mode = "undirected")
  clusters(g)	# 4 clusters reconstructed
  

## End(Not run)