Boxcox Power Transformation for Type I Censored Data

Share:

Description

Compute the value(s) of an objective for one or more Box-Cox power transformations, or to compute an optimal power transformation based on a specified objective, based on Type I censored data.

Usage

1
2
3
4
5
  boxcoxCensored(x, censored, censoring.side = "left", 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, 
    objective.name = "PPCC", eps = .Machine$double.eps, 
    include.x.and.censored = TRUE, prob.method = "michael-schucany", 
    plot.pos.con = 0.375) 

Arguments

x

a numeric vector of positive numbers. Missing (NA), undefined (NaN), and infinite (-Inf, Inf) values are allowed but will be removed.

censored

numeric or logical vector indicating which values of x are censored. This must be the same length as x. If the mode of censored is "logical", TRUE values correspond to elements of x that are censored, and FALSE values correspond to elements of x that are not censored. If the mode of censored is "numeric", it must contain only 1's and 0's; 1 corresponds to TRUE and 0 corresponds to FALSE. Missing (NA) values are allowed but will be removed.

censoring.side

character string indicating on which side the censoring occurs. The possible values are "left" (the default) and "right".

lambda

numeric vector of finite values indicating what powers to use for the Box-Cox transformation. When optimize=FALSE, the default value is
lambda=seq(-2, 2, by=0.5). When optimize=TRUE, lambda must be a vector with two values indicating the range over which the optimization will occur and the range of these two values must include 1. In this case, the default value is lambda=c(-2, 2).

optimize

logical scalar indicating whether to simply evalute the objective function at the given values of lambda (optimize=FALSE; the default), or to compute the optimal power transformation within the bounds specified by lambda (optimize=TRUE).

objective.name

character string indicating what objective to use. The possible values are "PPCC" (probability plot correlation coefficient; the default), "Shapiro-Wilk" (the Shapiro-Wilk goodness-of-fit statistic), and "Log-Likelihood" (the log-likelihood function).

eps

finite, positive numeric scalar. When the absolute value of lambda is less than eps, lambda is assumed to be 0 for the Box-Cox transformation. The default value is eps=.Machine$double.eps.

include.x.and.censored

logical scalar indicating whether to include the finite, non-missing values of the argument x and the corresponding values of censored with the returned object. The default value is include.x.and.censored=TRUE.

prob.method

for multiply censored data, character string indicating what method to use to compute the plotting positions (empirical probabilities) when
objective.name="PPCC". Possible values are:

"kaplan-meier" (product-limit method of Kaplan and Meier (1958)),
"modified kaplan-meier" (same as "kaplan-meier" with the maximum value included),
"nelson" (hazard plotting method of Nelson (1972)),
"michael-schucany" (generalization of the product-limit method due to Michael and Schucany (1986)), and
"hirsch-stedinger" (generalization of the product-limit method due to Hirsch and Stedinger (1987)).

The default value is prob.method="michael-schucany".

The "nelson" method is only available for censoring.side="right", and the "modified kaplan-meier" is only available for censoring.side="left". See the DETAILS section for more explanation.

This argument is ignored if objective.name is not equal to "PPCC" and/or the data are singly censored.

plot.pos.con

for multiply censored data, numeric scalar between 0 and 1 containing the value of the plotting position constant when objective.name="PPCC". The default value is plot.pos.con=0.375. See the DETAILS section for more information. This argument is used only if prob.method is equal to "michael-schucany" or "hirsch-stedinger".

This argument is ignored if objective.name is not equal to "PPCC" and/or the data are singly censored.

Details

Two common assumptions for several standard parametric hypothesis tests are:

  1. The observations all come from a normal distribution.

  2. The observations all come from distributions with the same variance.

For example, the standard one-sample t-test assumes all the observations come from the same normal distribution, and the standard two-sample t-test assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups.

When the original data do not satisfy the above assumptions, data transformations are often used to attempt to satisfy these assumptions. Box and Cox (1964) presented a formalized method for deciding on a data transformation. Given a random variable X from some distribution with only positive values, the Box-Cox family of power transformations is defined as:

Y = \frac{X^λ - 1}{λ} λ \ne 0
log(X) λ = 0 \;\;\;\;\;\; (1)

where Y is assumed to come from a normal distribution. This transformation is continuous in λ. Note that this transformation also preserves ordering. See the help file for boxcoxTransform for more information on data transformations.

Box and Cox (1964) proposed choosing the appropriate value of λ based on maximizing the likelihood function. Alternatively, an appropriate value of λ can be chosen based on another objective, such as maximizing the probability plot correlation coefficient or the Shapiro-Wilk goodness-of-fit statistic.

Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.

In the case when optimize=TRUE, the function boxcoxCensored calls the R function nlminb to minimize the negative value of the objective (i.e., maximize the objective) over the range of possible values of λ specified in the argument lambda. The starting value for the optimization is always λ=1 (i.e., no transformation).

The next section explains assumptions and notation, and the section after that explains how the objective is computed for the various options for objective.name.

Assumptions and Notation
Let \underline{x} denote a random sample of N observations from some continuous distribution. Assume n (0 < n < N) of these observations are known and c (c=N-n) of these observations are all censored below (left-censored) or all censored above (right-censored) at k fixed censoring levels

T_1, T_2, …, T_K; \; K ≥ 1 \;\;\;\;\;\; (2)

For the case when K ≥ 2, the data are said to be Type I multiply censored. For the case when K=1, set T = T_1. If the data are left-censored and all n known observations are greater than or equal to T, or if the data are right-censored and all n known observations are less than or equal to T, then the data are said to be Type I singly censored (Nelson, 1982, p.7), otherwise they are considered to be Type I multiply censored.

Let c_j denote the number of observations censored below or above censoring level T_j for j = 1, 2, …, K, so that

∑_{i=1}^K c_j = c \;\;\;\;\;\; (3)

Let x_{(1)}, x_{(2)}, …, x_{(N)} denote the “ordered” observations, where now “observation” means either the actual observation (for uncensored observations) or the censoring level (for censored observations). For right-censored data, if a censored observation has the same value as an uncensored one, the uncensored observation should be placed first. For left-censored data, if a censored observation has the same value as an uncensored one, the censored observation should be placed first.

Note that in this case the quantity x_{(i)} does not necessarily represent the i'th “largest” observation from the (unknown) complete sample.

Finally, let Ω (omega) denote the set of n subscripts in the “ordered” sample that correspond to uncensored observations, and let Ω_j denote the set of c_j subscripts in the “ordered” sample that correspond to the censored observations censored at censoring level T_j for j = 1, 2, …, k.

We assume that there exists some value of λ such that the transformed observations

y_i = \frac{x_i^λ - 1}{λ} λ \ne 0
log(x_i) λ = 0 \;\;\;\;\;\; (4)

(i = 1, 2, …, n) form a random sample of Type I censored data from a normal distribution.

Note that for the censored observations, Equation (4) becomes:

y_{(i)} = T_j^* = \frac{T_j^λ - 1}{λ} λ \ne 0
log(T_j) λ = 0 \;\;\;\;\;\; (5)

where i \in Ω_j.

Computing the Objective

Objective Based on Probability Plot Correlation Coefficient (objective.name="PPCC")
When objective.name="PPCC", the objective is computed as the value of the normal probability plot correlation coefficient based on the transformed data (see the description of the Probability Plot Correlation Coefficient (PPCC) goodness-of-fit test in the help file for gofTestCensored). That is, the objective is the correlation coefficient for the normal quantile-quantile plot for the transformed data. Large values of the PPCC tend to indicate a good fit to a normal distribution.

Objective Based on Shapiro-Wilk Goodness-of-Fit Statistic (objective.name="Shapiro-Wilk")
When objective.name="Shapiro-Wilk", the objective is computed as the value of the Shapiro-Wilk goodness-of-fit statistic based on the transformed data (see the description of the Shapiro-Wilk test in the help file for gofTestCensored). Large values of the Shapiro-Wilk statistic tend to indicate a good fit to a normal distribution.

Objective Based on Log-Likelihood Function (objective.name="Log-Likelihood")
When objective.name="Log-Likelihood", the objective is computed as the value of the log-likelihood function. Assuming the transformed observations in Equation (4) above come from a normal distribution with mean μ and standard deviation σ, we can use the change of variable formula to write the log-likelihood function as follows.

For Type I left censored data, the likelihood function is given by:

log[L(λ, μ, σ)] = log[{N \choose c_1 c_2 … c_k n}] + ∑_{j=1}^k c_j log[F(T_j^*)] + ∑_{i \in Ω} log\{f[y_{(i)}]\} + (λ - 1) ∑_{i \in Ω} log[x_{(i)}] \;\;\;\;\;\; (6)

where f and F denote the probability density function (pdf) and cumulative distribution function (cdf) of the population. That is,

f(t) = φ(\frac{t-μ}{σ}) \;\;\;\;\;\; (7)

F(t) = Φ(\frac{t-μ}{σ}) \;\;\;\;\;\; (8)

where φ and Φ denote the pdf and cdf of the standard normal distribution, respectively (Shumway et al., 1989). For left singly censored data, Equation (6) simplifies to:

log[L(λ, μ, σ)] = log[{N \choose c}] + c log[F(T^*)] + ∑_{i = c+1}^N log\{f[y_{(i)}]\} + (λ - 1) ∑_{i = c+1}^N log[x_{(i)}] \;\;\;\;\;\; (9)

Similarly, for Type I right censored data, the likelihood function is given by:

log[L(λ, μ, σ)] = log[{N \choose c_1 c_2 … c_k n}] + ∑_{j=1}^k c_j log[1 - F(T_j^*)] + ∑_{i \in Ω} log\{f[y_{(i)}]\} + (λ - 1) ∑_{i \in Ω} log[x_{(i)}] \;\;\;\;\;\; (10)

and for right singly censored data this simplifies to:

log[L(λ, μ, σ)] = log[{N \choose c}] + c log[1 - F(T^*)] + ∑_{i = 1}^n log\{f[y_{(i)}]\} + (λ - 1) ∑_{i = 1}^n log[x_{(i)}] \;\;\;\;\;\; (11)

For a fixed value of λ, the log-likelihood function is maximized by replacing μ and σ with their maximum likelihood estimators (see the section Maximum Likelihood Estimation in the help file for enormCensored).

Thus, when optimize=TRUE, Equation (6) or (10) is maximized by iteratively solving for λ using the MLEs for μ and σ. When optimize=FALSE, the value of the objective is computed by using Equation (6) or (10), using the values of λ specified in the argument lambda, and using the MLEs of μ and σ.

Value

boxcoxCensored returns a list of class "boxcoxCensored" containing the results. See the help file for boxcoxCensored.object for details.

Note

Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity. Johnson and Wichern (2007, p.192) note that "Transformations are nothing more than a reexpression of the data in different units."

Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.

Stoline (1991) compared the goodness-of-fit of Box-Cox transformed data (based on using the “optimal” power transformation from a finite set of values between -1.5 and 1.5) with log-transformed data for 17 groundwater chemistry variables. Using the Probability Plot Correlation Coefficient statistic for censored data as a measure of goodness-of-fit (see gofTest), Stoline (1991) found that only 6 of the variables were adequately modeled by a Box-Cox transformation (p >0.10 for these 6 variables). Of these variables, five were adequately modeled by a a log transformation. Ten of variables were “marginally” fit by an optimal Box-Cox transformation, and of these 10 only 6 were marginally fit by a log transformation. Based on these results, Stoline (1991) recommends checking the assumption of lognormality before automatically assuming environmental data fit a lognormal distribution.

One problem with data transformations is that translating results on the transformed scale back to the original scale is not always straightforward. Estimating quantities such as means, variances, and confidence limits in the transformed scale and then transforming them back to the original scale usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149; van Belle et al., 2004, p.400). For example, exponentiating the confidence limits for a mean based on log-transformed data does not yield a confidence interval for the mean on the original scale. Instead, this yields a confidence interval for the median (see the help file for elnormAltCensored). It should be noted, however, that quantiles (percentiles) and rank-based procedures are invariant to monotonic transformations (Helsel and Hirsch, 1992, p.12).

Finally, there is no guarantee that a Box-Cox tranformation based on the “optimal” value of λ will provide an adequate transformation to allow the assumption of approximate normality and constant variance. Any set of transformed data should be inspected relative to the assumptions you want to make about it (Johnson and Wichern, 2007, p.194).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.

Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.

Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, pp.50–59.

Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.

Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.

Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.

Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.

Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.

Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.

Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.

Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.

Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.

van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 13.

See Also

boxcoxCensored.object, plot.boxcoxCensored, print.boxcoxCensored, boxcox, Data Transformations, Goodness-of-Fit Tests.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
  # Generate 15 observations from a lognormal distribution with 
  # mean=10 and cv=2 and censor the observations less than 2.  
  # Then generate 15 more observations from this distribution and 
  # censor the observations less than 4.  
  # Then Look at some values of various objectives for various transformations.  
  # Note that for both the PPCC objective the optimal value is about -0.3, 
  # whereas for the Log-Likelihood objective it is about 0.3.
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 

  x.1 <- rlnormAlt(15, mean = 10, cv = 2) 
  censored.1 <- x.1 < 2
  x.1[censored.1] <- 2

  x.2 <- rlnormAlt(15, mean = 10, cv = 2) 
  censored.2 <- x.2 < 4
  x.2[censored.2] <- 4

  x <- c(x.1, x.2)
  censored <- c(censored.1, censored.2)

  #--------------------------
  # Using the PPCC objective:
  #--------------------------

  boxcoxCensored(x, censored) 

  #Results of Box-Cox Transformation
  #Based on Type I Censored Data
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Censoring Variable:              censored
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):              2 4 
  #
  #Sample Size:                     30
  #
  #Percent Censored:                26.7%
  #
  # lambda      PPCC
  #   -2.0 0.8954683
  #   -1.5 0.9338467
  #   -1.0 0.9643680
  #   -0.5 0.9812969
  #    0.0 0.9776834
  #    0.5 0.9471025
  #    1.0 0.8901990
  #    1.5 0.8187488
  #    2.0 0.7480494


  boxcoxCensored(x, censored, optimize = TRUE)

  #Results of Box-Cox Transformation
  #Based on Type I Censored Data
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Censoring Variable:              censored
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):              2 4 
  #
  #Sample Size:                     30
  #
  #Percent Censored:                26.7%
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = -0.3194799
  #
  #Value of Objective:              PPCC = 0.9827546


  #-----------------------------------
  # Using the Log-Likelihodd objective
  #-----------------------------------

  boxcoxCensored(x, censored, objective.name = "Log-Likelihood") 

  #Results of Box-Cox Transformation
  #Based on Type I Censored Data
  #---------------------------------
  #
  #Objective Name:                  Log-Likelihood
  #
  #Data:                            x
  #
  #Censoring Variable:              censored
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):              2 4 
  #
  #Sample Size:                     30
  #
  #Percent Censored:                26.7%
  #
  # lambda Log-Likelihood
  #   -2.0      -95.38785
  #   -1.5      -84.76697
  #   -1.0      -75.36204
  #   -0.5      -68.12058
  #    0.0      -63.98902
  #    0.5      -63.56701
  #    1.0      -66.92599
  #    1.5      -73.61638
  #    2.0      -82.87970


  boxcoxCensored(x, censored, objective.name = "Log-Likelihood", 
    optimize = TRUE) 

  #Results of Box-Cox Transformation
  #Based on Type I Censored Data
  #---------------------------------
  #
  #Objective Name:                  Log-Likelihood
  #
  #Data:                            x
  #
  #Censoring Variable:              censored
  #
  #Censoring Side:                  left
  #
  #Censoring Level(s):              2 4 
  #
  #Sample Size:                     30
  #
  #Percent Censored:                26.7%
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.3049744
  #
  #Value of Objective:              Log-Likelihood = -63.2733

  #----------

  # Plot the results based on the PPCC objective
  #---------------------------------------------
  boxcox.list <- boxcoxCensored(x, censored)
  dev.new()
  plot(boxcox.list)

  #Look at QQ-Plots for the candidate values of lambda
  #---------------------------------------------------
  plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) 

  #==========

  # Clean up
  #---------
  rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list)
  graphics.off()

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.