cluster.diagnostic: Function for Plotting Summary Cluster Diagnostic Plots

View source: R/MVRr.r

cluster.diagnosticR Documentation

Function for Plotting Summary Cluster Diagnostic Plots

Description

Plot similarity statistic profiles and the optimal joint clustering configuration for the means and the variances by group.

Plot quantile profiles of means and standard deviations by group and for each clustering configuration, to check that the distributions of first and second moments of the MVR-transformed data approach their respective null distributions under the optimal configuration found, assuming independence and normality of all the variables.

Usage

    cluster.diagnostic(obj, 
                       span = 0.75, 
                       degree = 2, 
                       family = "gaussian", 
                       title = "Cluster Diagnostic Plots", 
                       device = NULL, 
                       file = "Cluster Diagnostic Plots",
                       path = getwd(),
                       horizontal = FALSE, 
                       width = 8.5, 
                       height = 11, ...)

Arguments

obj

Object of class "MVR" returned by mvr.

title

Title of the plot. Defaults to "Cluster Diagnostic Plots".

span

Span parameter of the loess() function (R package stats), which controls the degree of smoothing. Defaults to 0.75.

degree

Degree parameter of the loess() function (R package stats), which controls the degree of the polynomials to be used. Defaults to 2. (Normally 1 or 2. Degree 0 is also allowed, but see the "Note" in loess stats package.)

family

Family distribution in "gaussian", "symmetric" of the loess() function (R package stats), used for local fitting. If "gaussian" fitting is by least-squares, and if "symmetric" is used, a re-descending M estimator is used with Tukey's biweight function. Defaults to "gaussian".

device

Graphic display device in {NULL, "PS", "PDF"}. Defaults to NULL (standard output screen). Currently implemented graphic display devices are "PS" (Postscript) or "PDF" (Portable Document Format).

file

File name for output graphic. Defaults to "Cluster Diagnostic Plots".

path

Absolute path (without final (back)slash separator). Defaults to working directory path.

horizontal

Logical scalar. Orientation of the printed image. Defaults to FALSE, that is potrait orientation.

width

Numeric scalar. Width of the graphics region in inches. Defaults to 8.5.

height

Numeric scalar. Height of the graphics region in inches. Defaults to 11.

...

Generic arguments passed to other plotting functions.

Details

In a plot of a similarity statistic profile, one checks the goodness of fit of the transformed data relative to the hypothesized underlying reference distribution with mean-0 and standard deviation-1 (e.g. N(0, 1)). The red dashed line depicts the LOESS scatterplot smoother estimator. The subroutine internally generates reference null distributions for computing the similarity statistic under each cluster configuration. The optimal cluster configuration (indicated by the vertical red arrow) is found where the similarity statistic reaches its minimum plus/minus one standard deviation (applying the conventional one-standard deviation rule). A smaller cluster number configuration indicates under-regularization, while over-regularization starts to occur at larger numbers. This over/under-regularization must be viewed as a form of over/under-fitting (see Dazard, J-E. and J. S. Rao (2012) for more details). The quantile diagnostic plots uses empirical quantiles of the transformed means and standard deviations to check how closely they are approximated by theoretical quantiles derived from a standard normal equal-mean/homoscedastic model (solid green lines) under a given cluster configuration. To assess this goodness of fit of the transformed data, theoretical null distributions of the mean and variance are derived from a standard normal equal-mean/homoscedastic model with independence of the first two moments, i.e. assuming i.i.d. normality of the raw data. However, we do not require i.i.d. normality of the data in general: these theoretical null distributions are just used here as convenient ones to draw from. Note that under the assumptions that the raw data is i.i.d. standard normal ($N(0, 1)$) with independence of first two moments, the theoretical null distributions of means and standard deviations for each variable are respectively: N(0, \frac{1}{n}) and √{\frac{χ_{n - G}^{2}}{n - G}}, where G denotes the number of sample groups. The optimal cluster configuration found is indicated by the most horizontal red curve. The single cluster configuration, corresponding to no transformation, is the most vertical curve, while the largest cluster number configuration reaches horizontality. Notice how empirical quantiles of transformed pooled means and standard deviations converge (from red to black) to the theoretical null distributions (solid green lines) for the optimal configuration. One should see a convergence towards the target null, after which overfitting starts to occur (see Dazard, J-E. and J. S. Rao (2012) for more details). Both cluster diagnostic plots help determine (i) whether the minimum of the Similarity Statistic is observed within the range of clusters (i.e. a large enough number of clusters has been accommodated), and (ii) whether the corresponding cluster configuration is a good fit. If necessary, run the procedure again with larger value of the nc.max parameter in the mvr as well as in mvrt.test functions until the minimum of the similarity statistic profile is reached.

Option file is used only if device is specified (i.e. non NULL).

Value

None. Displays the plots on the chosen device.

Acknowledgments

This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. This project was partially funded by the National Institutes of Health (P30-CA043703).

Note

End-user function.

Author(s)

Maintainer: "Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu

References

  • Dazard J-E. and J. S. Rao (2010). "Regularized Variance Estimation and Variance Stabilization of High-Dimensional Data." In JSM Proceedings, Section for High-Dimensional Data Analysis and Variable Selection. Vancouver, BC, Canada: American Statistical Association IMS - JSM, 5295-5309.

  • Dazard J-E., Hua Xu and J. S. Rao (2011). "R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization." In JSM Proceedings, Section for Statistical Programmers and Analysts. Miami Beach, FL, USA: American Statistical Association IMS - JSM, 3849-3863.

  • Dazard J-E. and J. S. Rao (2012). "Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data." Comput. Statist. Data Anal. 56(7):2317-2333.

See Also

loess (R package stats) Fit a polynomial surface determined by one or more numerical predictors, using local fitting.

Examples

## Not run: 
    #===================================================
    # Loading the library and its dependencies
    #===================================================
    library("MVR")

    #===================================================
    # MVR package news
    #===================================================
    MVR.news()

    #================================================
    # MVR package citation
    #================================================
    citation("MVR")

    #===================================================
    # Loading of the Synthetic and Real datasets
    # (see description of datasets)
    #===================================================
    data("Synthetic", "Real", package="MVR")
    ?Synthetic
    ?Real

    #===================================================
    # Mean-Variance Regularization (Real dataset)
    # Multi-Group Assumption
    # Assuming unequal variance between groups
    # Without cluster usage
    ===================================================
    nc.min <- 1
    nc.max <- 30
    probs <- seq(0, 1, 0.01)
    n <- 6
    GF <- factor(gl(n = 2, k = n/2, length = n), 
                 ordered = FALSE, 
                 labels = c("M", "S"))
    mvr.obj <- mvr(data = Real, 
                   block = GF, 
                   log = FALSE, 
                   nc.min = nc.min, 
                   nc.max = nc.max, 
                   probs = probs,
                   B = 100, 
                   parallel = FALSE, 
                   conf = NULL,
                   verbose = TRUE,
                   seed = 1234)

    #===================================================
    # Summary Cluster Diagnostic Plots (Real dataset)
    # Multi-Group Assumption
    # Assuming unequal variance between groups
    #===================================================
    cluster.diagnostic(obj = mvr.obj, 
                       title = "Cluster Diagnostic Plots 
                       (Real - Multi-Group Assumption)",
                       span = 0.75, 
                       degree = 2, 
                       family = "gaussian",
                       device = NULL,
                       horizontal = FALSE, 
                       width = 8.5, 
                       height = 11)

    
## End(Not run)

jedazard/MVR documentation built on July 16, 2022, 10:55 p.m.