eqlnormCensored: Estimate Quantiles of a Lognormal Distribution Based on Type...
In EnvStats: Package for Environmental Statistics, Including US EPA Guidance

eqlnormCensored

R Documentation

Estimate Quantiles of a Lognormal Distribution Based on Type I Censored Data

Description

Estimate quantiles of a lognormal distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for a quantile.

Usage

  eqlnormCensored(x, censored, censoring.side = "left", p = 0.5, method = "mle", 
    ci = FALSE, ci.method = "exact.for.complete", ci.type = "two-sided", 
    conf.level = 0.95, digits = 0, nmc = 1000, seed = NULL)

Arguments

`x`	a numeric vector of positive observations. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed.
`censored`	numeric or logical vector indicating which values of `x` are censored. This must be the same length as `x`. If the mode of `censored` is `"logical"`, `TRUE` values correspond to elements of `x` that are censored, and `FALSE` values correspond to elements of `x` that are not censored. If the mode of `censored` is `"numeric"`, it must contain only `1`'s and `0`'s; `1` corresponds to `TRUE` and `0` corresponds to `FALSE`. Missing (`NA`) values are allowed but will be removed.
`censoring.side`	character string indicating on which side the censoring occurs. The possible values are `"left"` (the default) and `"right"`.
`p`	numeric vector of probabilities for which quantiles will be estimated. All values of `p` must be between 0 and 1. When `ci=TRUE`, `p` must be a scalar. The default value is `p=0.5`.
`method`	character string specifying the method of estimating the mean and standard deviation on the log-scale. For singly censored data, the possible values are: `"mle"` (maximum likelihood; the default), `"bcmle"` (bias-corrected maximum likelihood), `"qq.reg"` (quantile-quantile regression), `"qq.reg.w.cen.level"` (quantile-quantile regression including the censoring level), `"impute.w.qq.reg"` (moment estimation based on imputation using the `qq.reg method`), `"impute.w.qq.reg.w.cen.level"` (moment estimation based on imputation using the `qq.reg.w.cen.level method`), `"impute.w.mle"` (moment estimation based on imputation using the mle), `"iterative.impute.w.qq.reg"` (moment estimation based on iterative imputation using the `qq.reg method`), `"m.est"` (robust M-estimation), and `"half.cen.level"` (moment estimation based on setting the censored observations to half the censoring level). For multiply censored data, the possible values are: `"mle"` (maximum likelihood; the default), `"qq.reg"` (quantile-quantile regression), `"impute.w.qq.reg"` (moment estimation based on imputation using the `qq.reg method`), and `"half.cen.level"` (moment estimation based on setting the censored observations to half the censoring level). See the DETAILS section for more information.
`ci`	logical scalar indicating whether to compute a confidence interval for the quantile. The default value is `ci=FALSE`.
`ci.method`	character string indicating what method to use to construct the confidence interval for the quantile. The possible values are: `"exact.for.complete"` (exact method for complete (uncensored) data; the default), `"gpq"` (method based on generalized pivotal quantities), and `"normal.approx"` (normal approximation). See the DETAILS section for more information. This argument is ignored if `ci=FALSE`.
`ci.type`	character string indicating what kind of confidence interval for the quantile to compute. The possible values are `"two-sided"` (the default), `"lower"`, and `"upper"`. This argument is ignored if `ci=FALSE`.
`conf.level`	a scalar between 0 and 1 indicating the confidence level of the confidence interval. The default value is `conf.level=0.95`. This argument is ignored if `ci=FALSE`.
`digits`	an integer indicating the number of decimal places to round to when printing out the value of `100*p`. The default value is `digits=0`.
`nmc`	numeric scalar indicating the number of Monte Carlo simulations to run when `ci=TRUE` and `ci.method="gpq"`. The default is `nmc=1000`.
`seed`	integer supplied to the function `set.seed` and used when `ci=TRUE` and `ci.method="gpq"`. The default value is `seed=NULL`, in which case the current value of `.Random.seed` is used.

Details

Quantiles and their associated confidence intervals are constructed by calling the function
eqnormCensored using the log-transformed data and then exponentiating the quantiles and confidence limits.

Value

eqlnormCensored returns a list of class "estimateCensored" containing the estimated quantile(s) and other information. See estimateCensored.object for details.

Note

Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.

A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.

Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.

Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.

In practice, it is better to use a confidence interval for a percentile, rather than rely on a single point-estimate of percentile. Confidence intervals for percentiles of a normal distribution depend on the properties of the estimators for both the mean and standard deviation.

Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring (see, for example, Singh et al., 2006). Studies to evaluate the performance of a confidence interval for a percentile include: Caudill et al. (2007), Hewett and Ganner (2007), Kroll and Stedinger (1996), and Serasinghe (2010).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.

Caudill, S.P., L.-Y. Wong, W.E. Turner, R. Lee, A. Henderson, D. G. Patterson Jr. (2007). Percentile Estimation Using Variable Censored Data. Chemosphere 68, 169–180.

Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.

Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.

Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, pp.132-136.

Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.

Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.

Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.

Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.

Hewett, P., and G.H. Ganser. (2007). A Comparison of Several Methods for Analyzing Censored Data. Annals of Occupational Hygiene 51(7), 611–632.

Johnson, N.L., and B.L. Welch. (1940). Applications of the Non-Central t-Distribution. Biometrika 31, 362-389.

Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.

Kroll, C.N., and J.R. Stedinger. (1996). Estimation of Moments and Quantiles Using Censored Data. Water Resources Research 32(4), 1005–1012.

Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.

Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.

Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.

Serasinghe, S.K. (2010). A Simulation Comparison of Parametric and Nonparametric Estimators of Quantiles from Right Censored Data. A Report submitted in partial fulfillment of the requirements for the degree Master of Science, Department of Statistics, College of Arts and Sciences, Kansas State University, Manhattan, Kansas.

Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.

Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.

Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.

Stedinger, J. (1983). Confidence Intervals for Design Events. Journal of Hydraulic Engineering 109(1), 13-27.

Stedinger, J.R., R.M. Vogel, and E. Foufoula-Georgiou. (1993). Frequency Analysis of Extreme Events. In: Maidment, D.R., ed. Handbook of Hydrology. McGraw-Hill, New York, Chapter 18, pp.29-30.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.

Examples

  # Generate 15 observations from a lognormal distribution with 
  # parameters meanlog=3 and sdlog=0.5, and censor observations less than 10. 
  # Then generate 15 more observations from this distribution and  censor 
  # observations less than 9.
  # Then estimate the 90th percentile and create a one-sided upper 95% 
  # confidence interval for that percentile. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(47) 

  x.1 <- rlnorm(15, meanlog = 3, sdlog = 0.5) 
  sort(x.1)
  # [1]  8.051717  9.651611 11.671282 12.271247 12.664108 17.446124
  # [7] 17.707301 20.238069 20.487219 21.025510 21.208197 22.036554
  #[13] 25.710773 28.661973 54.453557

  censored.1 <- x.1 < 10
  x.1[censored.1] <- 10

  x.2 <- rlnorm(15, meanlog = 3, sdlog = 0.5) 
  sort(x.2)
  # [1]  6.289074  7.511164  8.988267  9.179006 12.869408 14.130081
  # [7] 16.941937 17.060513 19.287572 19.682126 20.363893 22.750203
  #[13] 24.744306 28.089325 37.792873

  censored.2 <- x.2 < 9
  x.2[censored.2] <- 9

  x <- c(x.1, x.2)
  censored <- c(censored.1, censored.2)

  eqlnormCensored(x, censored, p = 0.9, ci = TRUE, ci.type = "upper")

  #Results of Distribution Parameter Estimation
  #Based on Type I Censored Data
  #--------------------------------------------
  #
  #Assumed Distribution:            Lognormal
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):               9 10 
  #
  #Estimated Parameter(s):          meanlog = 2.8099300
  #                                 sdlog   = 0.5137151
  #
  #Estimation Method:               MLE
  #
  #Estimated Quantile(s):           90'th %ile = 32.08159
  #
  #Quantile Estimation Method:      Quantile(s) Based on
  #                                 MLE Estimators
  #
  #Data:                            x
  #
  #Censoring Variable:              censored
  #  
  #Sample Size:                     30
  #
  #Percent Censored:                16.66667%
  #
  #Confidence Interval for:         90'th %ile
  #
  #Assumed Sample Size:             30
  #
  #Confidence Interval Method:      Exact for
  #                                 Complete Data
  #
  #Confidence Interval Type:        upper
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL =  0.00000
  #                                 UCL = 41.38716

  #----------

  # Compare these results with the true 90'th percentile:

  qlnorm(p = 0.9, meanlog = 3, sd = 0.5)
  #[1] 38.1214

  #----------

  # Clean up
  rm(x.1, censored.1, x.2, censored.2, x, censored)
  
  #--------------------------------------------------------------------

  # Chapter 15 of USEPA (2009) gives several examples of estimating the mean
  # and standard deviation of a lognormal distribution on the log-scale using 
  # manganese concentrations (ppb) in groundwater at five background wells. 
  # In EnvStats these data are stored in the data frame 
  # EPA.09.Ex.15.1.manganese.df.

  # Here we will estimate the mean and standard deviation using the MLE, 
  # and then construct an upper 95% confidence limit for the 90th percentile. 

  # First look at the data:
  #-----------------------

  EPA.09.Ex.15.1.manganese.df

  #   Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
  #1       1 Well.1                 <5           5.0     TRUE
  #2       2 Well.1               12.1          12.1    FALSE
  #3       3 Well.1               16.9          16.9    FALSE
  #...
  #23      3 Well.5                3.3           3.3    FALSE
  #24      4 Well.5                8.4           8.4    FALSE
  #25      5 Well.5                 <2           2.0     TRUE

  longToWide(EPA.09.Ex.15.1.manganese.df, 
    "Manganese.Orig.ppb", "Sample", "Well",
    paste.row.name = TRUE)  

  #         Well.1 Well.2 Well.3 Well.4 Well.5
  #Sample.1     <5     <5     <5    6.3   17.9
  #Sample.2   12.1    7.7    5.3   11.9   22.7
  #Sample.3   16.9   53.6   12.6     10    3.3
  #Sample.4   21.6    9.5  106.3     <2    8.4
  #Sample.5     <2   45.9   34.5   77.2     <2


  # Now estimate the mean, standard deviation, and 90th percentile 
  # on the log-scale using the MLE, and construct an upper 95% 
  # confidence limit for the 90th percentile:
  #---------------------------------------------------------------

  with(EPA.09.Ex.15.1.manganese.df,
    eqlnormCensored(Manganese.ppb, Censored, 
      p = 0.9, ci = TRUE, ci.type = "upper"))

  #Results of Distribution Parameter Estimation
  #Based on Type I Censored Data
  #--------------------------------------------
  #
  #Assumed Distribution:            Lognormal
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):              2 5 
  #
  #Estimated Parameter(s):          meanlog = 2.215905
  #                                 sdlog   = 1.356291
  #
  #Estimation Method:               MLE
  #
  #Estimated Quantile(s):           90'th %ile = 52.14674
  #
  #Quantile Estimation Method:      Quantile(s) Based on
  #                                 MLE Estimators
  #
  #Data:                            Manganese.ppb
  #
  #Censoring Variable:              censored
  #
  #Sample Size:                     25
  #
  #Percent Censored:                24%
  #
  #Confidence Interval for:         90'th %ile
  #
  #Assumed Sample Size:             25
  #
  #Confidence Interval Method:      Exact for
  #                                 Complete Data
  #
  #Confidence Interval Type:        upper
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL =   0.0000
  #                                 UCL = 110.9305

EnvStats documentation built on June 8, 2025, 11:37 a.m.