fsrfan: Robust transformations for regression

View source: R/fsrfan.R

fsrfanR Documentation

Robust transformations for regression

Description

The transformations for negative and positive responses were determined by Yeo and Johnson (2000) by imposing the smoothness condition that the second derivative of zYJ (\lambda) with respect to y be smooth at y = 0. However some authors, for example Weisberg (2005), query the physical interpretability of this constraint which is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020) extend the Yeo-Johnson transformation to allow two values of the transformations parameter: \lambda_N for negative observations and \lambda_P for non-negative ones.

FSRfan monitors:

  1. the t test associated with the constructed variable computed assuming the same transformation parameter for positive and negative observations fixed. In short we call this test, "global score test for positive observations".

  2. the t test associated with the constructed variable computed assuming a different transformation for positive observations keeping the value of the transformation parameter for negative observations fixed. In short we call this test, "test for positive observations".

  3. the t test associated with the constructed variable computed assuming a different transformation for negative observations keeping the value of the transformation parameter for positive observations fixed. In short we call this test, "test for negative observations".

  4. the F test for the joint presence of the two constructed variables described in points 2) and 3).

  5. the F likelihood ratio test based on the MLE of \lambda_P and \lambda_N. In this case the residual sum of squares of the null model bsaed on a single trasnformation parameter \lambda is compared with the residual sum of squares of the model based on data transformed data using MLE of \lambda_P and \lambda_N.

Usage

fsrfan(x, ...)

## S3 method for class 'formula'
fsrfan(
  formula,
  data,
  subset,
  weights,
  na.action,
  model = TRUE,
  x.ret = FALSE,
  y.ret = FALSE,
  contrasts = NULL,
  offset,
  ...
)

## Default S3 method:
fsrfan(
  x,
  y,
  intercept = TRUE,
  plot = FALSE,
  family = c("BoxCox", "YJ", "YJpn", "YJall"),
  la = c(-1, -0.5, 0, 0.5, 1),
  lms,
  alpha = 0.75,
  h,
  init,
  msg = FALSE,
  nocheck = FALSE,
  nsamp = 1000,
  conflev = 0.99,
  xlab,
  ylab,
  main,
  xlim,
  ylim,
  lwd = 2,
  lwd.env = 1,
  trace = FALSE,
  ...
)

## S3 method for class 'fsrfan'
plot(
  x,
  conflev = 0.99,
  xlim,
  ylim,
  xlab = "Subset of size m",
  ylab = "Score test statistic",
  main = "Fan plot",
  col,
  lty,
  lwd = 2.5,
  lwd.env = 1,
  ...
)

Arguments

x

An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

...

potential further arguments passed to lower level functions.

formula

a formula of the form y ~ x1 + x2 + ....

data

data frame from which variables specified in formula are to be taken.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

weights

an optional vector of weights to be used NOT USED YET.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The “factory-fresh” default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

model

logical indicating if the model frame, is to be returned.

x.ret

logical indicating if the the model matrixis to be returned.

y.ret

logical indicating if the response is to be returned.

contrasts

an optional list. See the contrasts.arg of model.matrix.default.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. An offset term can be included in the

y

Response variable. A vector with n elements that contains the response variable.

intercept

wheather to use constant term (default is intercept=TRUE

plot

If plot=FALSE (default) no plot is produced. If plot=TRUE a fan plot is shown.

family

string which identifies the family of transformations which must be used. Possible values are c('BoxCox', 'YJ', 'YJpn', 'YJall'). Default is 'BoxCox'. The Box-Cox family of power transformations equals (y^{\lambda}-1)/\lambda for \lambda not equal to zero, and \log(y) if \lambda = 0. The Yeo-Johnson (YJ) transformation is the Box-Cox transformation of y+1 for nonnegative values, and of |y|+1 with parameter 2-\lambda for y negative. Remember that BoxCox can be used only if input y is positive. Yeo-Johnson family of transformations does not have this limitation. If family='YJpn' Yeo-Johnson family is applied but in this case it is also possible to monitor (in the output arguments Scorep and Scoren) the score test for positive and negative observations respectively. If family='YJall', it is also possible to monitor the joint F test for the presence of the two constructed variables for positive and negative observations.

la

values of the transformation parameter for which it is necessary to compute the score test. Default value of lambda is la=c(-1, -0.5, 0, 0.5, 1), i.e., the five most common values of lambda.

lms

how to find the initlal subset to initialize the search. If lms=1 (default) Least Median of Squares (LMS) is computed, else Least Trimmed Squares (LTS) is computed. If, lms is matrix of size p - 1 + intercept X length(la) it contains in column j=1,..., lenght(la) the list of units forming the initial subset for the search associated with la(j). In this case the input option nsamp is ignored.

alpha

the percentage (roughly) of squared residuals whose sum will be minimized, by default alpha=0.5. In general, alpha must between 0.5 and 1.

h

The number of observations that have determined the least trimmed squares estimator, scalar. h is an integer greater or equal than p but smaller then n. Generally h=[0.5*(n+p+1)] (default value).

init

Search initialization. It specifies the initial subset size to start monitoring the value of the score test. If init is not specified it will be set equal to: p+1, if the sample size is smaller than 40 or min(3 * p + 1, floor(0.5 * (n+p+1))), otherwise.

msg

Controls whether to display or not messages on the screen. If msg==TRUE messages are displayed on the screen. If msg=2, detailed messages are displayed, for example the information at iteration level.

nocheck

Whether to check input arguments. If nocheck=TRUE no check is performed on matrix y and matrix X. Notice that y and X are left unchanged. In other words the additional column of ones for the intercept is not added. The default is nocheck=FALSE.

nsamp

number of subsamples which will be extracted to find the robust estimator. If nsamp=0 all subsets will be extracted. They will be n choose p.

Remark: if the number of all possible subset is <1000 the default is to extract all subsets otherwise just 1000. If nsamp is a matrix of size r-by-p, it contains in the rows the subsets which sill have to be extracted. For example, if p=3 and nsamp=c(2,4,9; 23, 45, 49; 90, 34, 1) the first subset is made up of units c(2, 4, 9), the second subset of units c(23, 45, 49) and the third subset of units c(90 34 1).

conflev

Confidence level for the bands (default is 0.99, that is, we plot two horizontal lines corresponding to values -2.58 and 2.58).

xlab

A label for the X-axis, default is 'Subset size m'

ylab

A label for the Y-axis, default is 'Score test statistic'

main

A label for the title, default is 'Fan plot'

xlim

Minimum and maximum for the X-axis

ylim

Minimum and maximum for the Y-axis

lwd

The line width of the curves which contain the score test, a positive number, default is lwd=2

lwd.env

The line width of the lines associated with the envelopes, a positive number, default is lwd.env=1

trace

Whether to print intermediate results. Default is trace=FALSE.

col

a vector specifying the colors for the lines, each one corresponding to a la value. if length(col) < length(la), the colors will be recycled.

lty

a vector specifying the line types for the lines, each one corresponding to a la value. if length(col) < length(la), the colors will be recycled.

Value

An S3 object of class fsrfan.object will be returned which is basically a list containing the following elements:

  1. la: vector containing the values of lambda for which fan plot is constructed

  2. bs: matrix of size p X length(la) containing the units forming the initial subset for each value of lambda

  3. Score: a matrix containing the values of the score test for each value of the transformation parameter:

    • 1st col = fwd search index;

    • 2nd col = value of the score test in each step of the fwd search for la[1]

    • ...

  4. Scorep: matrix containing the values of the score test for positive observations for each value of the transformation parameter.

    Note: this output is present only if input option family='YJpn' or family='YJall'.

  5. Scoren: matrix containing the values of the score test for negative observations for each value of the transformation parameter.

    Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.

  6. Scoreb: matrix containing the values of the score test for the joint presence of both constructed variables (associated with positive and negative observations) for each value of the transformation parameter. In this case the reference distribution is the F with 2 and subset_size - p degrees of freedom.

    Note: this output is present only if input option family='YJall'.

  7. Un: a three-dimensional array containing length(la) matrices of size retnUn=(n-init) X retpUn=11. Each matrix contains the unit(s) included in the subset at each step in the search associated with the corresponding element of la.

    REMARK: at each step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one.

Author(s)

FSDA team, valentin.todorov@chello.at

References

Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.

Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, pp. 87–100.

Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, "Applied Statistics", 69, pp. 251–275. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/rssc.12389")}

Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions, Statistical Science, 36(2), pp. 239–255. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/20-STS778")}.

Examples


## Not run: 
   data(wool)
   XX <- wool
   y <- XX[, ncol(XX)]
   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]

   out <- fsrfan(X, y)                   # call 'fsrfan' with all default parameters
   out <- fsrfan(cycles~., data=wool)    # use the formula interface

   set.seed(10)
   out <- fsrfan(cycles~., data=wool, plot=TRUE) # call 'fsrfan' and produce the plot
   plot(out)                               # use the plot method on the fsrfan object
   plot(out, conflev=c(0.9, 0.95, 0.99))   # change the confidence leel in the plot method

 ##======================
 ##
 ##  fsrfan() with all default options.

 ##  Store values of the score test statistic for the five most common
 ##  values of $\lambda$. Produce also a fan plot and display it on the screen.
 ##  Common part to all examples: load 'wool' data set.
 data(wool)
 head(wool)
 dim(wool)
 ##  The function fsrfan() stores the score test statistic.
 ##  In this case we use the five most common values of lambda are considered

 out <- fsrfan(cycles~., data=wool)
 plot(out)
 ##  fanplot(out)                # Not yet implemented in fsdaR

 ##  The fan plot shows the log transformation is diffused throughout the data
 ##  and does not depend on the presence of particular observations.
 ##======================
 ##
 ##  Example specifying 'lambda'.
 ##      Produce a fan plot for each value of 'lambda' in the vector 'la'.
 ##      Extract in matrix 'Un' the units which entered the search in each step

     data(wool)
     out <- fsrfan(cycles~., data=wool, la=c(-1, -0.5, 0, 0.5), plot=TRUE)
     plot(out)

     out$Un[,2,]

 ##======================
 ##  Example specifying the confidence level and the initial starting point for monitoring.
 ##  Construct the fan plot specifying the confidence level and the initial starting point
 ##  for monitoring.
     data(wool)
     out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, conflev=0.95, plots=TRUE)
     plot(out, conflev=0.95)

 ##=====================
 ##  Example with starting point based on LTS.
 ##  Extract all subsamples, construct a fan plot specifying the confidence level
 ##  and the initial starting point for monitoring based on p+2 observations,
 ##  strong line width for lines associated with the confidence bands.
     data(wool)
     out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, lms=0,
         lwd.env=3, plot=TRUE)
     plot(out, lwd.env=3)

 ##=====================
 ##  Fan plot using the loyalty cards data.
 ##  In this example, 'la' is the vector contanining the most common values
 ##  of the transformation parameter.
 ##  Store the score test statistics for the specified values of lambda
 ##  and automatically produce the fan plot
     data(loyalty)
     head(loyalty)
     dim(loyalty)

 ##  la is a vector contanining the most common values of the transformation parameter
     out <- fsrfan(amount_spent~., data=loyalty, la=c(0, 1/3, 0.4, 0.5),
           init=ncol(loyalty)+1, plot=TRUE, lwd=3)
     plot(out, lwd=3)

 ##  The fan plot shows that even if the third root is the best value of the transformation
 ##  parameter at the end of the search, in earlier steps it lies very close to the upper
 ##  rejection region. The best value of the transformation parameter seems to be the one
 ##  associated with la=0.4, which is always the confidence bands but at the end of search,
 ##  due to the presence of particular observations it goes below the lower rejection line.

 ##=====================
 ##  Compare BoxCox with Yeo and Johnson transformation.
 ##  Store values of the score test statistic for the five most common
 ##  values of lambda. Produce also a fan plot and display it on the screen.
 ##  Common part to all examples: load wool dataset.

     data(wool)

     ##  Store the score test statistic using Box Cox transformation.
     outBC <- fsrfan(cycles~., data=wool, nsamp=0)

     ##  Store the score test statistic using Yeo and Johnson transformation.
     outYJ <- fsrfan(cycles~., data=wool, family="YJ", nsamp=0)

     ## Not yet fully implemented
     ##  fanplot(outBC, main="Box Cox")
     ##  fanplot(outYJ,main="Yeo and Johnson")

     plot(outBC, main="Box Cox")
     plot(outYJ, main="Yeo and Johnson")

     cat("\nMaximum difference in absolute value: ",
         max(max(abs(outYJ$Score - outBC$Score), na.rm=TRUE)), "\n")




 ##======================
   ## Call 'fsrfan' with Yeo-Johnson (YJ) transformation
   out <- fsrfan(cycles~., data=wool, family="YJ")
   plot(out)


## End(Not run)


fsdaR documentation built on March 31, 2023, 8:18 p.m.