distribution quantiles

Description

Parametric quantiles of distributions fitted to a sample.

Usage

1
2
3
4
5
6
distLquantile(x = NULL, probs = c(0.8, 0.9, 0.99), truncate = 0,
  threshold = berryFunctions::quantileMean(dlf$dat_full, truncate),
  selection = NULL, dlf = NULL, order = TRUE, returnlist = FALSE,
  empirical = TRUE, weighted = empirical, gpd = empirical,
  addinfo = FALSE, speed = TRUE, plot = FALSE, plotargs = NULL,
  quiet = FALSE, ssquiet = quiet, ttquiet = quiet, ...)

Arguments

x

Sample for which parametrical quantiles are to be calculated. If it is NULL (the default), dat from dlf is used. DEFAULT: NULL

probs

Numeric vector of probabilities with values in [0,1]. DEFAULT: c(0.8,0.9,0.99)

truncate

Number between 0 and 1 (proportion of sample discarded). Censored quantile: fit to highest values only (truncate lower proportion of x). Probabilities are adjusted accordingly. DEFAULT: 0

threshold

POT cutoff value. If you want correct percentiles, set this only via truncate, see Details of q_gpd. DEFAULT: quantileMean(x, truncate)

selection

Distribution type, eg. "gev" or "wak", see dist.list in lmomco. Can be a vector. If NULL (the default), all types present in dlf$parameter are used. DEFAULT: NULL

dlf

dlf object described in extremeStat. Use this to save computing time for large datasets where you already have dlf. DEFAULT: NULL

order

Sort results by GOF? If TRUE (the default) and length(selection)>1, the output is ordered by dlf$gof, else by order of appearance in selection (or dlf$parameter). DEFAULT: TRUE

returnlist

Return full dlflist with output attached as element quant? If FALSE (the default), just the matrix with quantile estimates is returned. DEFAULT: FALSE

empirical

Add empirical quantileMean in the output matrix and vertical lines? DEFAULT: TRUE

weighted

Include weighted averages across distribution functions to the output? DEFAULT: empirical, so additional options can all be excluded with emp=F.

gpd

Include GPD quantile estimation via q_gpd? DEFAULT: empirical

addinfo

Should information like sample size be rbinded to the output? DEFAULT: FALSE

speed

Compute q_gpd only for fast methods? Don't accidentally set this to FALSE in simulations or with large datasets! DEFAULT: TRUE

plot

Should distLplot be called? DEFAULT: FALSE

plotargs

List of arguments to be passed to distLplot like qlines, qheights, qrow, qlinargs, nbest, cdf, ...

quiet

Suppress notes? DEFAULT: FALSE

ssquiet

Suppress sample size notes? DEFAULT: quiet

ttquiet

Suppress truncation!=threshold note? Note that q_gpd is called with ttquiet=TRUE. DEFAULT: quiet

...

Arguments passed to distLfit (and distLplot if plot=TRUE).

Details

Very high quantiles (99% and higher) need large sample sizes for quantile to yield a robust estimate. Theoretically, at least 1/(1-probs) values must be present, e.g. 10'000 for Q99.99%. With smaller sample sizes (eg n=35), they underestimate the actual (but unknown) quantile. Parametric quantiles need only small sample sizes. They don't have a systematical underestimation bias, but have higher variability.

Value

Matrix with distribution quantile values (with NAs for probs below truncate),
or, if returnlist=TRUE, a dlf list as described in extremeStat.

Note

NAs are always removed from x in distLfit

Author(s)

Berry Boessenkool, berry-b@gmx.de, March + July 2015, Feb 2016

References

On GPD: http://stats.stackexchange.com/questions/69438

See Also

q_gpd, distLfit, Xian Zhou, Liuquan Sun and Haobo Ren (2000): Quantile estimation for left truncated and right censored data, Statistica Sinica 10 http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n411.pdf
require("truncdist")

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
data(annMax) # Annual Discharge Maxima (streamflow)

distLquantile(annMax, emp=FALSE) # several distribution functions in lmomco
distLquantile(annMax, truncate=0.8, probs=0.95) # POT (annMax already block maxima)
distLquantile(annMax, probs=0.95, plot=TRUE, qlinargs=list(lwd=3), nbest=5, breaks=10)
# Parametric 95% quantile estimates range from 92 to 111!
# But the best fitting distributions all lie aroud 103.

# compare General Pareto Fitting methods
# Theoretically, the tails of distributions converge to GPD (General Pareto)
# q_gpd compares several R packages for fitting and quantile estimation:
dlq <- distLquantile(annMax, weight=FALSE, quiet=TRUE, probs=0.97, returnlist=TRUE)
dlq$quant
distLplot(dlq, qlines=TRUE) # per default best fitting distribution functions
distLplot(dlq, qlines=TRUE, qrow=c("wak","q_gpd*"), nbest=14)
#pdf("dummy.pdf", width=9)
distLplot(dlq, qlines=TRUE, qrow="q_gpd*", nbest=13, xlim=c(102,110), 
          qlinargs=list(lwd=3), qheights=seq(0.02, 0.005, len=13))
#dev.off()


## Not run: 
## Taken out from CRAN package check because it's slow

# weighted distribution quantiles are calculated by different weighting schemes:
dlf <- distLfit(annMax)
distLgofPlot(dlf, ranks=FALSE, weights=TRUE)

# If speed is important and parameters are already available, pass them via dlf:
distLquantile(dlf=dlf, probs=0:5/5, selection=c("wak","gev","kap"), order=FALSE)
distLquantile(dlf=dlf, truncate=0.3, returnlist=TRUE)$truncate

# censored (truncated, trimmed) quantile, Peak Over Treshold (POT) method:
qwak <- distLquantile(annMax, sel="wak", prob=0.95, plot=TRUE, ylim=c(0,0.06), emp=FALSE)
qwak2 <-distLquantile(annMax, sel="wak", prob=0.95, truncate=0.6, plot=TRUE,
                     addinfo=FALSE, add=TRUE, coldist="blue", empirical=FALSE)
                     

# Simulation of truncation effect
library(lmomco)
#set.seed(42)
rnum <- rlmomco(n=1e3, para=dlf$parameter$gev)
myprobs <- c(0.9, 0.95, 0.99, 0.999)
mytrunc <- seq(0, 0.9, length.out=20)
trunceffect <- sapply(mytrunc, function(mt) distLquantile(rnum, selection="gev",
                             probs=myprobs, truncate=mt, plot=FALSE, quiet=TRUE,
                             progbars=FALSE, empirical=FALSE)["gev",])
# If more values are truncated, the function runs faster

op <- par(mfrow=c(2,1), mar=c(2,4.5,2,0.5), cex.main=1)
distLquantile(rnum, sel="gev", probs=myprobs, emp=FALSE, ylab="", xlab="", plot=TRUE)
distLquantile(rnum, sel="gev", probs=myprobs, emp=FALSE, addinfo=FALSE,
              truncate=0.3, add=TRUE, coldist=4, plot=TRUE)
legend("right", c("fitted GEV", "fitted with truncate=0.3"), lty=1, col=c(2,4),
       bg="white")
par(mar=c(3,4.5,3,0.5))
plot(mytrunc, trunceffect[1,], ylim=range(trunceffect), las=1, type="l",
     main=c("High quantiles of 1000 random numbers from gev distribution",
           "Estimation based on proportion of lower values truncated"),
     xlab="", ylab="parametrical quantile")
title(xlab="Proportion censored", mgp=c(1.8,1,0))
for(i in 2:4) lines(mytrunc, trunceffect[i,])
library("berryFunctions")
textField(rep(0.5,4), trunceffect[,11], paste0("Q",myprobs*100,"%") )
par(op)


set.seed(3); rnum <- rlmomco(n=1e3, para=dlf$parameter$gpa)
qd99 <- evir::quant(rnum, p=0.99, start=15, end=1000, ci=0.5, models=30)
axis(3, at=seq(-1000,0, length=6), labels=0:5/5, pos=par("usr")[3])
title(xlab="Proportion truncated", line=-3)
mytrunc <- seq(0, 0.9, length.out=30)
trunceffect <- sapply(mytrunc, function(mt) distLquantile(rnum, selection="gpa",
                      probs=0.99, truncate=mt, plot=FALSE, quiet=TRUE,
                      empirical=FALSE, gpd=TRUE))
lines(-1000*(1-mytrunc), trunceffect[1,], col=4)
lines(-1000*(1-mytrunc), trunceffect[2,], col=3) # interesting...
for(i in 3:13) lines(-1000*(1-mytrunc), trunceffect[i,], col=3) # interesting...

# If you want the estimates only for one single truncation, use
q_gpd(rnum, probs=myprobs, truncate=0.5)


## End(Not run) # end dontrun