fsrfan: Robust transformations for regression
In fsdaR: Robust Data Analysis Through Monitoring and Dynamic Visualization

fsrfan

R Documentation

Robust transformations for regression

Description

The transformations for negative and positive responses were determined by Yeo and Johnson (2000) by imposing the smoothness condition that the second derivative of zYJ (\lambda) with respect to y be smooth at y = 0. However some authors, for example Weisberg (2005), query the physical interpretability of this constraint which is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020) extend the Yeo-Johnson transformation to allow two values of the transformations parameter: \lambda_N for negative observations and \lambda_P for non-negative ones.

FSRfan monitors:

the t test associated with the constructed variable computed assuming the same transformation parameter for positive and negative observations fixed. In short we call this test, "global score test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for positive observations keeping the value of the transformation parameter for negative observations fixed. In short we call this test, "test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for negative observations keeping the value of the transformation parameter for positive observations fixed. In short we call this test, "test for negative observations".
the F test for the joint presence of the two constructed variables described in points 2) and 3).
the F likelihood ratio test based on the MLE of \lambda_P and \lambda_N. In this case the residual sum of squares of the null model bsaed on a single trasnformation parameter \lambda is compared with the residual sum of squares of the model based on data transformed data using MLE of \lambda_P and \lambda_N.

Usage

fsrfan(x, ...)

## S3 method for class 'formula'
fsrfan(
  formula,
  data,
  subset,
  weights,
  na.action,
  model = TRUE,
  x.ret = FALSE,
  y.ret = FALSE,
  contrasts = NULL,
  offset,
  ...
)

## Default S3 method:
fsrfan(
  x,
  y,
  intercept = TRUE,
  plot = FALSE,
  family = c("BoxCox", "YJ", "YJpn", "YJall"),
  la = c(-1, -0.5, 0, 0.5, 1),
  lms,
  alpha = 0.75,
  h,
  init,
  msg = FALSE,
  nocheck = FALSE,
  nsamp = 1000,
  conflev = 0.99,
  xlab,
  ylab,
  main,
  xlim,
  ylim,
  lwd = 2,
  lwd.env = 1,
  trace = FALSE,
  ...
)

## S3 method for class 'fsrfan'
plot(
  x,
  conflev = 0.99,
  xlim,
  ylim,
  xlab = "Subset of size m",
  ylab = "Score test statistic",
  main = "Fan plot",
  col,
  lty,
  lwd = 2.5,
  lwd.env = 1,
  ...
)

Arguments

`x`	An `n x p` data matrix (`n` observations and `p` variables). Rows of `x` represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
`...`	potential further arguments passed to lower level functions.
`formula`	a `formula` of the form `y ~ x1 + x2 + ...`.
`data`	data frame from which variables specified in `formula` are to be taken.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`weights`	an optional vector of weights to be used NOT USED YET.
`na.action`	a function which indicates what should happen when the data contain `NA`s. The default is set by the `na.action` setting of `options`, and is `na.fail` if that is unset. The “factory-fresh” default is `na.omit`. Another possible value is `NULL`, no action. Value `na.exclude` can be useful.
`model`	`logical` indicating if the model frame, is to be returned.
`x.ret`	`logical` indicating if the the model matrixis to be returned.
`y.ret`	`logical` indicating if the response is to be returned.
`contrasts`	an optional list. See the `contrasts.arg` of `model.matrix.default`.
`offset`	this can be used to specify an a priori known component to be included in the linear predictor during fitting. An `offset` term can be included in the
`y`	Response variable. A vector with `n` elements that contains the response variable.
`intercept`	wheather to use constant term (default is `intercept=TRUE`
`plot`	If `plot=FALSE` (default) no plot is produced. If `plot=TRUE` a fan plot is shown.
`family`	string which identifies the family of transformations which must be used. Possible values are `c('BoxCox', 'YJ', 'YJpn', 'YJall')`. Default is `'BoxCox'`. The Box-Cox family of power transformations equals `(y^{\lambda}-1)/\lambda` for `\lambda` not equal to zero, and `\log(y)` if `\lambda = 0`. The Yeo-Johnson (YJ) transformation is the Box-Cox transformation of `y+1` for nonnegative values, and of `\|y\|+1` with parameter `2-\lambda` for `y` negative. Remember that BoxCox can be used only if input y is positive. Yeo-Johnson family of transformations does not have this limitation. If `family='YJpn'` Yeo-Johnson family is applied but in this case it is also possible to monitor (in the output arguments `Scorep` and `Scoren`) the score test for positive and negative observations respectively. If `family='YJall'`, it is also possible to monitor the joint F test for the presence of the two constructed variables for positive and negative observations.
`la`	values of the transformation parameter for which it is necessary to compute the score test. Default value of lambda is `la=c(-1, -0.5, 0, 0.5, 1)`, i.e., the five most common values of lambda.
`lms`	how to find the initlal subset to initialize the search. If `lms=1` (default) Least Median of Squares (LMS) is computed, else Least Trimmed Squares (LTS) is computed. If, `lms` is matrix of size `p - 1 + intercept X length(la)` it contains in column `j=1,..., lenght(la)` the list of units forming the initial subset for the search associated with `la(j)`. In this case the input option `nsamp` is ignored.
`alpha`	the percentage (roughly) of squared residuals whose sum will be minimized, by default `alpha=0.5`. In general, alpha must between 0.5 and 1.
`h`	The number of observations that have determined the least trimmed squares estimator, scalar. `h` is an integer greater or equal than `p` but smaller then `n`. Generally `h=[0.5*(n+p+1)]` (default value).
`init`	Search initialization. It specifies the initial subset size to start monitoring the value of the score test. If `init` is not specified it will be set equal to: `p+1`, if the sample size is smaller than 40 or `min(3 * p + 1, floor(0.5 * (n+p+1)))`, otherwise.
`msg`	Controls whether to display or not messages on the screen. If `msg==TRUE` messages are displayed on the screen. If `msg=2`, detailed messages are displayed, for example the information at iteration level.
`nocheck`	Whether to check input arguments. If `nocheck=TRUE` no check is performed on matrix `y` and matrix `X`. Notice that `y` and `X` are left unchanged. In other words the additional column of ones for the intercept is not added. The default is `nocheck=FALSE`.
`nsamp`	number of subsamples which will be extracted to find the robust estimator. If `nsamp=0` all subsets will be extracted. They will be `n choose p`. Remark: if the number of all possible subset is <1000 the default is to extract all subsets otherwise just 1000. If `nsamp` is a matrix of size `r-by-p`, it contains in the rows the subsets which sill have to be extracted. For example, if `p=3` and `nsamp=c(2,4,9; 23, 45, 49; 90, 34, 1)` the first subset is made up of units `c(2, 4, 9)`, the second subset of units `c(23, 45, 49)` and the third subset of units `c(90 34 1)`.
`conflev`	Confidence level for the bands (default is 0.99, that is, we plot two horizontal lines corresponding to values -2.58 and 2.58).
`xlab`	A label for the X-axis, default is 'Subset size m'
`ylab`	A label for the Y-axis, default is 'Score test statistic'
`main`	A label for the title, default is 'Fan plot'
`xlim`	Minimum and maximum for the X-axis
`ylim`	Minimum and maximum for the Y-axis
`lwd`	The line width of the curves which contain the score test, a positive number, default is `lwd=2`
`lwd.env`	The line width of the lines associated with the envelopes, a positive number, default is `lwd.env=1`
`trace`	Whether to print intermediate results. Default is `trace=FALSE`.
`col`	a vector specifying the colors for the lines, each one corresponding to a `la` value. if `length(col) < length(la)`, the colors will be recycled.
`lty`	a vector specifying the line types for the lines, each one corresponding to a `la` value. if `length(col) < length(la)`, the colors will be recycled.

Value

An S3 object of class fsrfan.object will be returned which is basically a list containing the following elements:

la: vector containing the values of lambda for which fan plot is constructed
bs: matrix of size p X length(la) containing the units forming the initial subset for each value of lambda
Score: a matrix containing the values of the score test for each value of the transformation parameter:
- 1st col = fwd search index;
- 2nd col = value of the score test in each step of the fwd search for la[1]
- ...
Scorep: matrix containing the values of the score test for positive observations for each value of the transformation parameter.

Note: this output is present only if input option family='YJpn' or family='YJall'.
Scoren: matrix containing the values of the score test for negative observations for each value of the transformation parameter.

Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb: matrix containing the values of the score test for the joint presence of both constructed variables (associated with positive and negative observations) for each value of the transformation parameter. In this case the reference distribution is the F with 2 and subset_size - p degrees of freedom.

Note: this output is present only if input option family='YJall'.
Un: a three-dimensional array containing length(la) matrices of size retnUn=(n-init) X retpUn=11. Each matrix contains the unit(s) included in the subset at each step in the search associated with the corresponding element of la.

REMARK: at each step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one.

Author(s)

FSDA team, valentin.todorov@chello.at

References

Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.

Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, pp. 87–100.

Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, "Applied Statistics", 69, pp. 251–275. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/rssc.12389")}

Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions, Statistical Science, 36(2), pp. 239–255. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/20-STS778")}.

Examples


## Not run: 
   data(wool)
   XX <- wool
   y <- XX[, ncol(XX)]
   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]

   out <- fsrfan(X, y)                   # call 'fsrfan' with all default parameters
   out <- fsrfan(cycles~., data=wool)    # use the formula interface

   set.seed(10)
   out <- fsrfan(cycles~., data=wool, plot=TRUE) # call 'fsrfan' and produce the plot
   plot(out)                               # use the plot method on the fsrfan object
   plot(out, conflev=c(0.9, 0.95, 0.99))   # change the confidence leel in the plot method

 ##======================
 ##
 ##  fsrfan() with all default options.

 ##  Store values of the score test statistic for the five most common
 ##  values of $\lambda$. Produce also a fan plot and display it on the screen.
 ##  Common part to all examples: load 'wool' data set.
 data(wool)
 head(wool)
 dim(wool)
 ##  The function fsrfan() stores the score test statistic.
 ##  In this case we use the five most common values of lambda are considered

 out <- fsrfan(cycles~., data=wool)
 plot(out)
 ##  fanplot(out)                # Not yet implemented in fsdaR

 ##  The fan plot shows the log transformation is diffused throughout the data
 ##  and does not depend on the presence of particular observations.
 ##======================
 ##
 ##  Example specifying 'lambda'.
 ##      Produce a fan plot for each value of 'lambda' in the vector 'la'.
 ##      Extract in matrix 'Un' the units which entered the search in each step

     data(wool)
     out <- fsrfan(cycles~., data=wool, la=c(-1, -0.5, 0, 0.5), plot=TRUE)
     plot(out)

     out$Un[,2,]

 ##======================
 ##  Example specifying the confidence level and the initial starting point for monitoring.
 ##  Construct the fan plot specifying the confidence level and the initial starting point
 ##  for monitoring.
     data(wool)
     out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, conflev=0.95, plots=TRUE)
     plot(out, conflev=0.95)

 ##=====================
 ##  Example with starting point based on LTS.
 ##  Extract all subsamples, construct a fan plot specifying the confidence level
 ##  and the initial starting point for monitoring based on p+2 observations,
 ##  strong line width for lines associated with the confidence bands.
     data(wool)
     out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, lms=0,
         lwd.env=3, plot=TRUE)
     plot(out, lwd.env=3)

 ##=====================
 ##  Fan plot using the loyalty cards data.
 ##  In this example, 'la' is the vector contanining the most common values
 ##  of the transformation parameter.
 ##  Store the score test statistics for the specified values of lambda
 ##  and automatically produce the fan plot
     data(loyalty)
     head(loyalty)
     dim(loyalty)

 ##  la is a vector contanining the most common values of the transformation parameter
     out <- fsrfan(amount_spent~., data=loyalty, la=c(0, 1/3, 0.4, 0.5),
           init=ncol(loyalty)+1, plot=TRUE, lwd=3)
     plot(out, lwd=3)

 ##  The fan plot shows that even if the third root is the best value of the transformation
 ##  parameter at the end of the search, in earlier steps it lies very close to the upper
 ##  rejection region. The best value of the transformation parameter seems to be the one
 ##  associated with la=0.4, which is always the confidence bands but at the end of search,
 ##  due to the presence of particular observations it goes below the lower rejection line.

 ##=====================
 ##  Compare BoxCox with Yeo and Johnson transformation.
 ##  Store values of the score test statistic for the five most common
 ##  values of lambda. Produce also a fan plot and display it on the screen.
 ##  Common part to all examples: load wool dataset.

     data(wool)

     ##  Store the score test statistic using Box Cox transformation.
     outBC <- fsrfan(cycles~., data=wool, nsamp=0)

     ##  Store the score test statistic using Yeo and Johnson transformation.
     outYJ <- fsrfan(cycles~., data=wool, family="YJ", nsamp=0)

     ## Not yet fully implemented
     ##  fanplot(outBC, main="Box Cox")
     ##  fanplot(outYJ,main="Yeo and Johnson")

     plot(outBC, main="Box Cox")
     plot(outYJ, main="Yeo and Johnson")

     cat("\nMaximum difference in absolute value: ",
         max(max(abs(outYJ$Score - outBC$Score), na.rm=TRUE)), "\n")




 ##======================
   ## Call 'fsrfan' with Yeo-Johnson (YJ) transformation
   out <- fsrfan(cycles~., data=wool, family="YJ")
   plot(out)


## End(Not run)

fsdaR documentation built on May 29, 2024, 5:35 a.m.