distChooseCensored: Choose Best Fitting Distribution Based on Goodness-of-Fit...
In EnvStats: Package for Environmental Statistics, Including US EPA Guidance

distChooseCensored

R Documentation

Choose Best Fitting Distribution Based on Goodness-of-Fit Tests for Censored Data

Description

Perform a series of goodness-of-fit tests for censored data from a (possibly user-specified) set of candidate probability distributions to determine which probability distribution provides the best fit for a data set.

Usage

distChooseCensored(x, censored, censoring.side = "left", alpha = 0.05,
    method = "sf", choices = c("norm", "gamma", "lnorm"),
    est.arg.list = NULL, prob.method = "hirsch-stedinger",
    plot.pos.con = 0.375, warn = TRUE, keep.data = TRUE,
    data.name = NULL, censoring.name = NULL)

Arguments

`x`	a numeric vector containing data for the goodness-of-fit tests. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed.
`censored`	numeric or logical vector indicating which values of `x` are censored. This must be the same length as `x`. If the mode of `censored` is `"logical"`, `TRUE` values correspond to elements of `x` that are censored, and `FALSE` values correspond to elements of `x` that are not censored. If the mode of `censored` is `"numeric"`, it must contain only `1`'s and `0`'s; `1` corresponds to `TRUE` and `0` corresponds to `FALSE`. Missing (`NA`) values are allowed but will be removed.
`censoring.side`	character string indicating on which side the censoring occurs. The possible values are `"left"` (the default) and `"right"`.
`alpha`	numeric scalar between 0 and 1 specifying the Type I error associated with each goodness-of-fit test. When `method="proucl"` the only allowed values for `alpha` are 0.01, 0.05, and 0.1. The default value is `alpha=0.05`.
`method`	character string defining which method to use. Possible values are: `"sw"`. Shapiro-Wilk. `"sf"`. Shapiro-Francia; the default. `"ppcc"`. Probability Plot Correlation Coefficient. `"proucl"`. ProUCL. The Shapiro-Wilk method is only available for singly censored data. See the DETAILS section for more information.
`choices`	a character vector denoting the distribution abbreviations of the candidate distributions. See the help file for `Distribution.df` for a list of distributions and their abbreviations. The default value is `choices=c("norm", "gamma", "lnorm")`, indicating the Normal, Gamma, and Lognormal distributions. This argument is ignored when `method="proucl"`.
`est.arg.list`	a list containing one or more lists of arguments to be passed to the function(s) estimating the distribution parameters. The name(s) of the components of the list must be equal to or a subset of the values of the argument `choices`. For example, if `choices=c("norm", "gammaAlt", "lnormAlt")`, setting `est.arg.list=list(lnormAlt=list(method="bcmle"))` indicates using the bias-corrected maximum-likelihood estimators (see the help file for `elnormAltCensored`). See the section Estimating Distribution Parameters in the help file EnvStats Functions for Censored Data for a list of available estimating functions for censored data. The default value is `est.arg.list=NULL` so that all default values for the estimating functions are used. In the course of testing for some form of normality (i.e., Normal, Lognormal), the estimated parameters are saved in the `test.results` component of the returned object, but the choice of the method of estimation has no effect on the goodness-of-fit test statistic or p-value. This argument is ignored when `method="proucl"`.
`prob.method`	character string indicating what method to use to compute the plotting positions (empirical probabilities) when `test="sf"` or `test="ppcc"`. Possible values are: `"modified kaplan-meier"` (modification of product-limit method of Kaplan and Meier (1958)) `"nelson"` (hazard plotting method of Nelson (1972)) `"michael-schucany"` (generalization of the product-limit method due to Michael and Schucany (1986)) `"hirsch-stedinger"` (generalization of the product-limit method due to Hirsch and Stedinger (1987)) The default value is `prob.method="hirsch-stedinger"`. The `"nelson"` method is only available for `censoring.side="right"`, and the `"modified kaplan-meier"` method is only available for `censoring.side="left"`. See the help files for `gofTestCensored` and `ppointsCensored` for more information.
`plot.pos.con`	numeric scalar between 0 and 1 containing the value of the plotting position constant to use when `test="sf"` or `test="ppcc"`. The default value is `plot.pos.con=0.375`. See the help files for `gofTestCensored` and `ppointsCensored` for more information.
`warn`	logical scalar indicating whether to print a warning message when observations with `NA`s, `NaN`s, or `Inf`s in `y` are removed. The default value is `warn=TRUE`.
`keep.data`	logical scalar indicating whether to return the original data. The default value is `keep.data=TRUE`.
`data.name`	optional character string indicating the name of the data used for argument `x`.
`censoring.name`	optional character string indicating the name for the data used for argument `censored`.

Details

The function distChooseCensored returns a list with information on the goodness-of-fit tests for various distributions and which distribution appears to best fit the data based on the p-values from the goodness-of-fit tests. This function was written in order to compare ProUCL's way of choosing the best-fitting distribution (USEPA, 2015) with other ways of choosing the best-fitting distribution.

Method Based on Shapiro-Wilk, Shapiro-Francia, or Probability Plot Correlation Test
(method="sw", method="sf", or method="ppcc")

For each value of the argument choices, the function distChooseCensored runs the goodness-of-fit test using the data in x assuming that particular distribution. For example, if
choices=c("norm", "gamma", "lnorm"), indicating the Normal, Gamma, and Lognormal distributions, and method="sf", then the usual Shapiro-Francia test is performed for the Normal and Lognormal distributions, and the extension of the Shapiro-Francia test is performed for the Gamma distribution (see the section Testing Goodness-of-Fit for Any Continuous Distribution in the help file for gofTestCensored for an explanation of the latter). The distribution associated with the largest p-value is the chosen distribution. In the case when all p-values are less than the value of the argument alpha, the distribution “Nonparametric” is chosen.

Method Based on ProUCL Algorithm (method="proucl")

When method="proucl", the function distChooseCensored uses the algorithm that ProUCL (USEPA, 2015) uses to determine the best fitting distribution. The candidate distributions are the Normal, Gamma, and Lognormal distributions. The algorithm used by ProUCL is as follows:

Remove all censored observations and use only the uncensored observations.
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the Normal distribution, i.e., call the function gofTest with distribution="norm", test="sw" and
distribution = "norm", test="lillie". If either or both of the associated p-values are greater than or equal to the user-supplied value of alpha, then choose the Normal distribution. Otherwise, proceed to the next step.
Perform the “ProUCL Anderson-Darling” and “ProUCL Kolmogorov-Smirnov” goodness-of-fit tests for the Gamma distribution, i.e., call the function gofTest with
distribution="gamma", test="proucl.ad.gamma" and
distribution="gamma", test="proucl.ks.gamma". If either or both of the associated p-values are greater than or equal to the user-supplied value of alpha, then choose the Gamma distribution. Otherwise, proceed to the next step.
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the Lognormal distribution, i.e., call the function gofTest with distribution = "lnorm", test="sw" and
distribution = "lnorm", test="lillie". If either or both of the associated p-values are greater than or equal to the user-supplied value of alpha, then choose the Lognormal distribution. Otherwise, proceed to the next step.
If none of the goodness-of-fit tests above yields a p-value greater than or equal to the user-supplied value of alpha, then choose the “Nonparametric” distribution.

Value

a list of class "distChooseCensored" containing the results of the goodness-of-fit tests. Objects of class "distChooseCensored" have a special printing method. See the help file for
distChooseCensored.object for details.

Note

In practice, almost any goodness-of-fit test will not reject the null hypothesis if the number of observations is relatively small. Conversely, almost any goodness-of-fit test will reject the null hypothesis if the number of observations is very large, since “real” data are never distributed according to any theoretical distribution (Conover, 1980, p.367). For most cases, however, the distribution of “real” data is close enough to some theoretical distribution that fairly accurate results may be provided by assuming that particular theoretical distribution. One way to asses the goodness of the fit is to use goodness-of-fit tests. Another way is to look at quantile-quantile (Q-Q) plots (see qqPlotCensored).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.

Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.

Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.

Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.

D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of g1. Biometrika 57, 679-681.

D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.

D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.

D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality. Empirical Results for the Distributions of b2 and \sqrt{b1}. Biometrika 60(3), 613-622.

D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of \sqrt{b1}. Biometrika 60(1), 169-173.

Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.

Kim, P.J., and R.I. Jennrich. (1973). Tables of the Exact Sampling Distribution of the Two Sample Kolmogorov-Smirnov Criterion. In Harter, H.L., and D.B. Owen, eds. Selected Tables in Mathematical Statistics, Vol. 1. American Mathematical Society, Providence, Rhode Island, pp.79-170.

Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell' Istituto Italiano degle Attuari 4, 83-91.

Marsaglia, G., W.W. Tsang, and J. Wang. (2003). Evaluating Kolmogorov's distribution. Journal of Statistical Software, 8(18). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v008.i18")}.

Moore, D.S. (1986). Tests of Chi-Squared Type. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, pp.63-95.

Pomeranz, J. (1973). Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for Small Samples (Algorithm 487). Collected Algorithms from ACM ??, ???-???.

Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.

Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.

Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.

Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.

Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.

Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.

Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.

Smirnov, N.V. (1939). Estimate of Deviation Between Empirical Distribution Functions in Two Independent Samples. Bulletin Moscow University 2(2), 3-16.

Smirnov, N.V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279-281.

Stephens, M.A. (1970). Use of the Kolmogorov-Smirnov, Cramer-von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society, Series B, 32, 115-122.

Stephens, M.A. (1986a). Tests Based on EDF Statistics. In D'Agostino, R. B., and M.A. Stevens, eds. Goodness-of-Fit Techniques. Marcel Dekker, New York.

USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.

Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics – Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.

Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.

Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.

Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.

Examples

  # Generate 30 observations from a gamma distribution with
  # parameters mean=10 and cv=1 and then censor observations less than 5.
  # Then:
  #
  # 1) Call distChooseCensored using the Shapiro-Wilk method and specify
  #    choices of the
  #      normal,
  #      gamma (alternative parameterzation), and
  #      lognormal (alternative parameterization)
  #    distributions.
  #
  # 2) Compare the results in 1) above with the results using the
  #    ProUCL method.
  #
  # Notes:  The call to set.seed lets you reproduce this example.
  #
  #         The ProUCL method chooses the Normal distribution, whereas the
  #         Shapiro-Wilk method chooses the Gamma distribution.

  set.seed(598)

  dat <- sort(rgammaAlt(30, mean = 10, cv = 1))
  dat
  # [1]  0.5313509  1.4741833  1.9936208  2.7980636  3.4509840
  # [6]  3.7987348  4.5542952  5.5207531  5.5253596  5.7177872
  #[11]  5.7513827  9.1086375  9.8444090 10.6247123 10.9304922
  #[16] 11.7925398 13.3432689 13.9562777 14.6029065 15.0563342
  #[21] 15.8730642 16.0039936 16.6910715 17.0288922 17.8507891
  #[26] 19.1105522 20.2657141 26.3815970 30.2912797 42.8726101

  dat.censored <- dat
  censored <- dat.censored < 5
  dat.censored[censored] <- 5


  # 1) Call distChooseCensored using the Shapiro-Wilk method.
  #----------------------------------------------------------

  distChooseCensored(dat.censored, censored, method = "sw",
    choices = c("norm", "gammaAlt", "lnormAlt"))

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   Shapiro-Wilk
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Gamma
  #
  #Estimated Parameter(s):          mean = 12.4911448
  #                                 cv   =  0.7617343
  #
  #Estimation Method:               MLE
  #
  #Data:                            dat.censored
  #
  #Sample Size:                     30
  #
  #Censoring Side:                  left
  #
  #Censoring Variable:              censored
  #
  #Censoring Level(s):              5
  #
  #Percent Censored:                23.33333%
  #
  #Test Results:
  #
  #  Normal
  #    Test Statistic:              W = 0.9372741
  #    P-value:                     0.1704876
  #
  #  Gamma
  #    Test Statistic:              W = 0.9613711
  #    P-value:                     0.522329
  #
  #  Lognormal
  #    Test Statistic:              W = 0.9292406
  #    P-value:                     0.114511

  #--------------------

  # 2) Compare the results in 1) above with the results using the
  #    ProUCL method.
  #---------------------------------------------------------------

  distChooseCensored(dat.censored, censored, method = "proucl")

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   ProUCL
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Normal
  #
  #Estimated Parameter(s):          mean = 15.397584
  #                                 sd   =  8.688302
  #
  #Estimation Method:               mvue
  #
  #Data:                            dat.censored
  #
  #Sample Size:                     30
  #
  #Censoring Side:                  left
  #
  #Censoring Variable:              censored
  #
  #Censoring Level(s):              5
  #
  #Percent Censored:                23.33333%
  #
  #ProUCL Sample Size:              23
  #
  #Test Results:
  #
  #  Normal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.861652
  #      P-value:                   0.004457924
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.1714435
  #      P-value:                   0.07794315
  #
  #  Gamma
  #    ProUCL Anderson-Darling Gamma GOF
  #      Test Statistic:            A = 0.3805556
  #      P-value:                   >= 0.10
  #    ProUCL Kolmogorov-Smirnov Gamma GOF
  #      Test Statistic:            D = 0.1035271
  #      P-value:                   >= 0.10
  #
  #  Lognormal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.9532604
  #      P-value:                   0.3414187
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.115588
  #      P-value:                   0.5899259

  #--------------------

  # Clean up
  #---------

  rm(dat, censored, dat.censored)

  #====================================================================

  # Check the assumption that the silver data stored in Helsel.Cohn.88.silver.df
  # follows a lognormal distribution.
  # Note that the small p-value and the shape of the Q-Q plot
  # (an inverted S-shape) suggests that the log transformation is not quite strong
  # enough to "bring in" the tails (i.e., the log-transformed silver data has tails
  # that are slightly too long relative to a normal distribution).
  # Helsel and Cohn (1988, p.2002) note that the gross outlier of 560 mg/L tends to
  # make the shape of the data resemble a gamma distribution, but
  # the distChooseCensored function decision is neither Gamma nor Lognormal,
  # but instead Nonparametric.

  # First create a lognormal Q-Q plot
  #----------------------------------

  dev.new()
  with(Helsel.Cohn.88.silver.df,
    qqPlotCensored(Ag, Censored, distribution = "lnorm",
      points.col = "blue", add.line = TRUE))

  #----------

  # Now call the distChooseCensored function using the default settings.
  #---------------------------------------------------------------------

  with(Helsel.Cohn.88.silver.df,
    distChooseCensored(Ag, Censored))

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   Shapiro-Francia
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Nonparametric
  #
  #Data:                            Ag
  #
  #Sample Size:                     56
  #
  #Censoring Side:                  left
  #
  #Censoring Variable:              Censored
  #
  #Censoring Level(s):               0.1  0.2  0.3  0.5  1.0  2.0  2.5  5.0  6.0 10.0 20.0 25.0
  #
  #Percent Censored:                60.71429%
  #
  #Test Results:
  #
  #  Normal
  #    Test Statistic:              W = 0.3065529
  #    P-value:                     8.346126e-08
  #
  #  Gamma
  #    Test Statistic:              W = 0.6254148
  #    P-value:                     1.884155e-05
  #
  #  Lognormal
  #    Test Statistic:              W = 0.8957198
  #    P-value:                     0.03490314

  #----------

  # Clean up
  #---------

  graphics.off()

  #====================================================================

  # Chapter 15 of USEPA (2009) gives several examples of looking
  # at normal Q-Q plots and estimating the mean and standard deviation
  # for manganese concentrations (ppb) in groundwater at five background
  # wells (USEPA, 2009, p. 15-10).  The Q-Q plot shown in Figure 15-4
  # on page 15-13 clearly indicates that the Lognormal distribution
  # is a good fit for these data.
  # In EnvStats these data are stored in the data frame EPA.09.Ex.15.1.manganese.df.

  # Here we will call the distChooseCensored function to determine
  # whether the data appear to come from a normal, gamma, or lognormal
  # distribution.
  #
  # Note that using the Probability Plot Correlation Coefficient method
  # (equivalent to using the Shapiro-Francia method) yields a decision of
  # Lognormal, but using the ProUCL method yields a decision of Gamma.
  #----------------------------------------------------------------------


  # First look at the data:
  #-----------------------

  EPA.09.Ex.15.1.manganese.df

  #   Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
  #1       1 Well.1                 <5           5.0     TRUE
  #2       2 Well.1               12.1          12.1    FALSE
  #3       3 Well.1               16.9          16.9    FALSE
  #...
  #23      3 Well.5                3.3           3.3    FALSE
  #24      4 Well.5                8.4           8.4    FALSE
  #25      5 Well.5                 <2           2.0     TRUE

  longToWide(EPA.09.Ex.15.1.manganese.df,
    "Manganese.Orig.ppb", "Sample", "Well",
    paste.row.name = TRUE)

  #         Well.1 Well.2 Well.3 Well.4 Well.5
  #Sample.1     <5     <5     <5    6.3   17.9
  #Sample.2   12.1    7.7    5.3   11.9   22.7
  #Sample.3   16.9   53.6   12.6     10    3.3
  #Sample.4   21.6    9.5  106.3     <2    8.4
  #Sample.5     <2   45.9   34.5   77.2     <2


  # Use distChooseCensored with the probability plot correlation method,
  # and for the gamma and lognormal distribution specify the
  # mean and CV parameterization:
  #------------------------------------------------------------

  with(EPA.09.Ex.15.1.manganese.df,
    distChooseCensored(Manganese.ppb, Censored,
      choices = c("norm", "gamma", "lnormAlt"), method = "ppcc"))

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   PPCC
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Lognormal
  #
  #Estimated Parameter(s):          mean = 23.003987
  #                                 cv   =  2.300772
  #
  #Estimation Method:               MLE
  #
  #Data:                            Manganese.ppb
  #
  #Sample Size:                     25
  #
  #Censoring Side:                  left
  #
  #Censoring Variable:              Censored
  #
  #Censoring Level(s):              2 5
  #
  #Percent Censored:                24%
  #
  #Test Results:
  #
  #  Normal
  #    Test Statistic:              r = 0.9147686
  #    P-value:                     0.004662658
  #
  #  Gamma
  #    Test Statistic:              r = 0.9844875
  #    P-value:                     0.6836625
  #
  #  Lognormal
  #    Test Statistic:              r = 0.9931982
  #    P-value:                     0.9767731

  #--------------------

  # Repeat the above example using the ProUCL method.
  #--------------------------------------------------

  with(EPA.09.Ex.15.1.manganese.df,
    distChooseCensored(Manganese.ppb, Censored, method = "proucl"))

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   ProUCL
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Gamma
  #
  #Estimated Parameter(s):          shape =  1.284882
  #                                 scale = 19.813413
  #
  #Estimation Method:               MLE
  #
  #Data:                            Manganese.ppb
  #
  #Sample Size:                     25
  #
  #Censoring Side:                  left
  #
  #Censoring Variable:              Censored
  #
  #Censoring Level(s):              2 5
  #
  #Percent Censored:                24%
  #
  #ProUCL Sample Size:              19
  #
  #Test Results:
  #
  #  Normal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.7423947
  #      P-value:                   0.0001862975
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.2768771
  #      P-value:                   0.0004771155
  #
  #  Gamma
  #    ProUCL Anderson-Darling Gamma GOF
  #      Test Statistic:            A = 0.6857121
  #      P-value:                   0.05 <= p < 0.10
  #    ProUCL Kolmogorov-Smirnov Gamma GOF
  #      Test Statistic:            D = 0.1830034
  #      P-value:                   >= 0.10
  #
  #  Lognormal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.969805
  #      P-value:                   0.7725528
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.138547
  #      P-value:                   0.4385195

  #====================================================================

  ## Not run: 
  # 1) Simulate 1000 trials where for each trial you:
  #    a) Generate 30 observations from a Gamma distribution with
  #       parameters mean = 10 and CV = 1.
  #    b) Censor observations less than 5 (the 39th percentile).
  #    c) Use distChooseCensored with the Shapiro-Francia method.
  #    d) Use distChooseCensored with the ProUCL method.
  #
  #  2) Compare the proportion of times the
  #     Normal vs. Gamma vs. Lognormal vs. Nonparametric distribution
  #     is chosen for c) and d) above.
  #------------------------------------------------------------------

  set.seed(58)
  N <- 1000

  Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric"))
  Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac)

  for(i in 1:N) {
    dat <- rgammaAlt(30, mean = 10, cv = 1)
    censored <- dat < 5
    dat[censored] <- 5
    Choose.df[i, "SW"]     <- distChooseCensored(dat, censored, method = "sw")$decision
    Choose.df[i, "ProUCL"] <- distChooseCensored(dat, censored, method = "proucl")$decision
  }

  summaryStats(Choose.df, digits = 0)

  #                ProUCL(N) ProUCL(Pct) SW(N) SW(Pct)
  #Normal              520          52   398      40
  #Gamma               336          34   375      38
  #Lognormal           105          10   221      22
  #Nonparametric        39           4     6       1
  #Combined           1000         100  1000     100

  #--------------------


  # Repeat above example for the Lognormal Distribution with mean=10 and CV = 1.
  # In this case, 5 is the 34th percentile.
  #-----------------------------------------------------------------------------

  set.seed(297)
  N <- 1000

  Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric"))
  Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac)

  for(i in 1:N) {
    dat <- rlnormAlt(30, mean = 10, cv = 1)
    censored <- dat < 5
    dat[censored] <- 5
    Choose.df[i, "SW"]     <- distChooseCensored(dat, censored, method = "sf")$decision
    Choose.df[i, "ProUCL"] <- distChooseCensored(dat, censored, method = "proucl")$decision
  }

  summaryStats(Choose.df, digits = 0)

  #              ProUCL(N) ProUCL(Pct) SW(N) SW(Pct)
  #Normal              277          28    92       9
  #Gamma               393          39   231      23
  #Lognormal           190          19   624      62
  #Nonparametric       140          14    53       5
  #Combined           1000         100  1000     100

  #--------------------


  # Clean up
  #---------

  rm(N, Choose.fac, Choose.df, i, dat, censored)
  
## End(Not run)

EnvStats documentation built on June 8, 2025, 11:37 a.m.