# hotspots: Hot spots and outliers In hotspots: Hot Spots

## Description

Calculates a hot spot or outlier cutoff for a statistical population based on deviance from the normal or t distribution. In the case of the hot spot cutoff, the relative magnitude of the values is also taken into account to determine if values are disproportionately large relative to other values. Thus, a value that is a statistical outlier may not always be a hot spot if other values are similarly large.

## Usage

 ```1 2 3 4``` ```hotspots(x, p = 0.99, tail = "positive", distribution = "t", var.est = "mad") outliers(x, p = 0.99, tail = "positive", distribution = "t", var.est = "mad", center.est = "mean") ```

## Arguments

 `x` a numeric vector `p` probability level of chosen distribution used for calculation of cutoff (between 0 and 1) `tail` determines whether cutoffs are calculated for positive numbers within `x`, negative numbers, or both. Defaults to `"positive"` but can also be `"negative"` or `"both"`. `distribution` statistical distribution used to calculate the hot spot or outlier cutoff. Defaults to `"t"` but can also be "normal". Other distributions could be implemented through simple modifications to the source code. `var.est` character vector indicating the function to be used to estimate the level of variation within the data. Defaults to the robust measure `"mad"`. Non-robust measures such as `"sd"` may also be used, but result in greater variation in cutoff location. `center.est` character vector indicating the function to be used to center the data for identification of outliers. Defaults to `"mean"`.

## Details

This function first scales the data by dividing them by a robust version of the root mean square. The robust root mean square (`rrms`) is calculated as:

`rrms = sqrt(med(x)^2 + var.est(x)^2)`

where `var.est` is the user-specified function for estimating the level of variation within the data. This scaling of the data allows for the comparison of scaled values with a statistical distribution, which in turn allows discrimination between outliers that do not substantially influence the data from those that do. For the outlier function, the data are scaled after centering the data using the user-specified center.est function, which defaults to the mean. The hotspot or outlier cutoff (for positive values, negative values, or both) is then calculated as:

`cutoff = (med(x/rrms) + F^-1(p))*rrms`

where `F` is a cumulative distribution function for the t or normal distribution (its inverse `F^-1` being a quantile function; e.g., `qt`), and `p` is a user-defined parameter indicating the probability of `F^-1` beyond which we wish to define the cutoff.

## Value

Returns an object of class "`hotspots`". The functions `summary` and `plot`, can be used to examine the properties of the cutoff. The function `disprop` can be used to calculate the level of disproportionality for each value in the data. An object of class "`hotspots`" is a list containing some or all of the following components:

 `x` numeric input vector `data` vector with missing values (`NA`) removed `distribution` statistical distribution used to calculate the hot spot or outlier cutoff. `var.est` function used to estimate the level of variation within the data `p` probability level of chosen distribution used for calculation of cutoff `tail` tail(s) of data for which cutoffs were calculated `dataset_name` character vector with name of input data `rrms` robust root mean square `positive.cut` calculated hot spot or outlier cutoff for positive values `negative.cut` calculated hot spot or outlier cutoff for negative values `center.est` function to be used to center the data for identification of outliers (only for `outliers` function

## Author(s)

Anthony Darrouzet-Nardi

`summary.hotspots`, `plot.hotspots`, `disprop`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41``` ```#basic operation on lognormal data rln100 <- hotspots(rlnorm(100)) summary(rln100) plot(rln100) #greater skew in data rln100sd2 <- hotspots(rlnorm(100,sd=2)) print(summary(rln100sd2),top = 5) plot(rln100sd2) #both tails on normally distributed data n100 <- hotspots(rnorm(100), tail = "both") summary(n100) plot(n100) #both tails on skewed data rln100pn <- hotspots(c(rlnorm(50),rlnorm(50)*-1),tail = "both") summary(rln100pn) plot(rln100pn) #importance of disproportionality on normally distributed data #contrast with n100 n100p3 <- hotspots(n100\$x+3, tail = "both") summary(n100p3) plot(n100p3) #importance of disproportionality on skewed data #contrast with rln100 rln100p10 <- hotspots(rlnorm(100)+10) summary(rln100p10) plot(rln100p10) #outliers function ignores disproportionality rln100p10o <- outliers(rlnorm(100)+10) summary(rln100p10o) plot(rln100p10o) #some alternative parameters rln100a <- hotspots(rlnorm(100), p = 0.9, distribution = "normal", var.est = "sd") summary(rln100a) plot(rln100a) ```