Nothing
# Documentation of `as_*()` -----------------------------------------------
#' Convert to pdqr-function
#'
#' Convert some function to be a proper pdqr-function of specific
#' [class][meta_class()], i.e. a function describing distribution with finite
#' support and finite values of probability/density.
#'
#' @param f Appropriate function to be converted (see Details).
#' @param support Numeric vector with two increasing elements describing desired
#' support of output. If `NULL` or any its value is `NA`, detection is done
#' using specific algorithms (see Details).
#' @param ... Extra arguments to `f`.
#' @param n_grid Number of grid points at which `f` will be evaluated (see
#' Details). Bigger values lead to better approximation precision, but worse
#' memory usage and evaluation speed (direct and in `summ_*()` functions).
#' @param n_sample Number of points to sample from `f` inside `as_r()`.
#' @param args_new List of extra arguments for [new_d()] to control [density()]
#' inside `as_r()`.
#'
#' @details General purpose of `as_*()` functions is to create a proper
#' pdqr-function of desired class from input which doesn't satisfy these
#' conditions. Here is described sequence of steps which are taken to achieve
#' that goal.
#'
#' If **`f` is already a pdqr-function**, `as_*()` functions properly update it
#' to have specific class. They take input's ["x_tbl" metadata][meta_x_tbl()]
#' and [type][meta_type()] to use with corresponding [new_*()][new-pdqr]
#' function. For example, `as_p(f)` in case of pdqr-function `f` is essentially
#' the same as `new_p(x = meta_x_tbl(f), type = meta_type(f))`.
#'
#' If **`f` is a function describing "honored" distribution**, it is detected
#' and output is created in predefined way taking into account extra arguments
#' in `...`. For more details see "Honored distributions" section.
#'
#' If **`f` is some other unknown function**, `as_*()` functions use heuristics
#' for approximating input distribution with a "proper" pdqr-function. Outputs
#' of `as_*()` can be only pdqr-functions of type "continuous" (because of
#' issues with support detection). It is assumed that `f` returns values
#' appropriate for desired output class of `as_*()` function and output type
#' "continuous". For example, input for `as_p()` should return values of some
#' continuous cumulative distribution function (monotonically non-increasing
#' values from 0 to 1). To manually create function of type "discrete", supply
#' data frame input describing it to appropriate `new_*()` function.
#'
#' General algorithm of how `as_*()` functions work for unknown function is as
#' follows:
#' - **Detect support**. See "Support detection" section for more details.
#' - **Create data frame input for `new_*()`**. The exact process differs:
#' - In `as_p()` equidistant grid of `n_grid` points is created inside
#' detected support. After that, input's values at the grid is taken as
#' reference points of cumulative distribution function used to
#' *approximate* density at that same grid. This method showed to work more
#' reliably in case density goes to infinity. That grid and density values
#' are used as "x" and "y" columns of data frame input for `new_p()`.
#' - In `as_d()` "x" column of data frame is the same equidistant grid is
#' taken as in `as_p()`. "y" column is taken as input's values at this grid
#' after possibly imputing infinity values. This imputation is done by
#' taking maximum from left and right linear extrapolations on mentioned
#' grid.
#' - In `as_q()`, at first inverse of input `f` function is computed on \[0;
#' 1\] interval. It is done by approximating it with piecewise-linear
#' function on \[0; 1\] equidistant grid with `n_grid` points, imputing
#' infinity values (which ensures finite support), and computing inverse of
#' approximation. This inverse of `f` is used to create data frame input
#' with `as_p()`.
#' - In `as_r()` at first d-function with `new_d()` is created based on the
#' same sample used for support detection and extra arguments supplied as
#' list in `args_new` argument. In other words, density estimation is done
#' based on sample, generated from input `f`. After that, its values are
#' used to create data frame with `as_d()`.
#' - **Use appropriate `new_*()` function** with data frame from previous step
#' and `type = "continuous"`. This step implies that all tails outside detected
#' support are trimmed and data frame is normalized to represent proper
#' piecewise-linear density.
#'
#' @section Honored distributions:
#'
#' For efficient workflow, some commonly used distributions are recognized as
#' special ("honored"). Those receive different treatment in `as_*()` functions.
#'
#' Basically, there is a manually selected list of "honored" distributions with
#' all their information enough to detect them. Currently that list has all
#' common univariate distributions [from 'stats' package][stats::Distributions],
#' i.e. all except multinomial and "less common distributions of test
#' statistics".
#'
#' "Honored" distribution is **recognized only if `f` is one of `p*()`, `d*()`,
#' `q*()`, or `r*()` function describing honored distribution and is supplied as
#' variable with original name**. For example, `as_d(dunif)` will be treated as
#' "honored" distribution but `as_d(function(x) {dunif(x)})` will not.
#'
#' After it is recognized that input `f` represents "honored" distribution,
#' **its support is computed based on predefined rules**. Those take into
#' account special features of distribution (like infinite support or infinite
#' density values) and supplied extra arguments in `...`. Usually output support
#' "loses" only around `1e-6` probability on each infinite tail.
#'
#' After that, for "discrete" type output `new_d()` is used for appropriate data
#' frame input and for "continuous" - `as_d()` with appropriate `d*()` function
#' and support. D-function is then converted to desired class with `as_*()`.
#'
#' @section Support detection:
#'
#' In case input is a function without any extra information, `as_*()` functions
#' must know which finite support its output should have. User can supply
#' desired support directly with `support` argument, which can also be `NULL`
#' (mean automatic detection of both edges) or have `NA` to detect only those
#' edges.
#'
#' Support is detected in order to preserve as much information as practically
#' reasonable. Exact methods differ:
#' - In `as_p()` support is detected as values at which input function is equal
#' to `1e-6` (left edge detection) and `1 - 1e-6` (right edge), which means
#' "losing" `1e-6` probability on each tail. **Note** that those values are
#' searched inside \[-10^100; 10^100\] interval.
#' - In `as_d()`, at first an attempt at finding one point of non-zero density
#' is made by probing 10000 points spread across wide range of real line
#' (approximately from `-1e7` to `1e7`). If input's value at all of them is
#' zero, error is thrown. After finding such point, cumulative distribution
#' function is made by integrating input with [integrate()][stats::integrate()]
#' using found point as reference (without this there will be poor accuracy of
#' `integrate()`). Created CDF function is used to find `1e-6` and `1 - 1e-6`
#' quantiles as in `as_p()`, which serve as detected support.
#' - In `as_q()` quantiles for 0 and 1 are probed for being infinite. If they
#' are, `1e-6` and `1 - 1e-6` quantiles are used respectively instead of
#' infinite values to form detected support.
#' - In `as_r()` sample of size `n_sample` is generated and detected support is
#' its range stretched by mean difference of sorted points (to account for
#' possible tails at which points were not generated). **Note** that this means
#' that original input `f` "demonstrates its randomness" only once inside
#' `as_r()`, with output then used for approximation of "original randomness".
#'
#' @return A pdqr-function of corresponding [class][meta_class()].
#'
#' @seealso [pdqr_approx_error()] for computing approximation errors compared to
#' some reference function (usually input to `as_*()` family).
#'
#' @examples
#' # Convert existing "proper" pdqr-function
#' set.seed(101)
#' x <- rnorm(10)
#' my_d <- new_d(x, "continuous")
#'
#' my_p <- as_p(my_d)
#'
#' # Convert "honored" function to be a proper pdqr-function. To use this
#' # option, supply originally named function.
#' p_unif <- as_p(punif)
#' r_beta <- as_r(rbeta, shape1 = 2, shape2 = 2)
#' d_pois <- as_d(dpois, lambda = 5)
#'
#' ## `pdqr_approx_error()` computes pdqr approximation error
#' summary(pdqr_approx_error(as_d(dnorm), dnorm))
#'
#' ## This will work as if input is unkonw function because of unsupported
#' ## variable name
#' my_runif <- function(n) {
#' runif(n)
#' }
#' r_unif_2 <- as_r(my_runif)
#' plot(as_d(r_unif_2))
#'
#' # Convert some other function to be a "proper" pdqr-function
#' my_d_quadr <- as_d(function(x) {
#' 0.75 * (1 - x^2)
#' }, support = c(-1, 1))
#'
#' # Support detection
#' unknown <- function(x) {
#' dnorm(x, mean = 1)
#' }
#' ## Completely automatic support detection
#' as_d(unknown)
#' ## Semi-automatic support detection
#' as_d(unknown, support = c(-4, NA))
#' as_d(unknown, support = c(NA, 5))
#'
#' ## If support is very small and very distant from zero, it probably won't
#' ## get detected in `as_d()` (throwing a relevant error)
#' \dontrun{
#' as_d(function(x) {
#' dnorm(x, mean = 10000, sd = 0.1)
#' })
#' }
#'
#' # Using different level of granularity
#' as_d(unknown, n_grid = 1001)
#' @name as-pdqr
NULL
# Honored distributions ---------------------------------------------------
as_honored_distr <- function(pdqr_class, f_name, f, support, ..., n_grid) {
distr_info <- honored_distr_info(f_name, f)
if (is.null(distr_info)) {
return(NULL)
}
f_class <- distr_info[["class"]]
if (pdqr_class != f_class) {
warning_collapse(
"You supplied a ", f_class, "-function `", f_name, "` to `as_",
pdqr_class, "()`. ",
"A ", pdqr_class, "-function will be created."
)
}
distr_supp <- honored_distr_supp(
distr = distr_info[["distr"]], q_fun = distr_info[["q_fun"]], ...
)
supp <- coalesce_pair(format_support(support), distr_supp)
if (distr_info[["type"]] == "discrete") {
# This approach assumes that honored "discrete" pdqr-functions have only
# integer "x" values. If in future it is not true, some (serious)
# refactoring should be made here.
x_vec <- supp[1]:supp[2]
x_tbl <- data.frame(x = x_vec, prob = distr_info[["d_fun"]](x_vec, ...))
res <- new_d(x_tbl, type = "discrete")
} else {
# This is a shallow recursion for `as_d.default()`. However, during this run
# there will be no match for honored distribtuion (there shouldn't be any
# name `distr_info[["d_fun"]]` as distribution function) and there will be
# no support detection as `supp` shouldn't have any `NA`s.
res <- as_d.default(
f = distr_info[["d_fun"]], support = supp, ..., n_grid = n_grid
)
}
as_pdqr_by_class(pdqr_class)(res)
}
honored_distr_info <- function(f_name, f) {
# Allow call with `package::` prefix. Also using `[1]` indexing because input
# `f_name` can be a vector in case of anonymous function like
# `function(x) {dunif(x)}`.
f_name <- sub("^.*::", "", f_name[1])
f_env <- environment(f)
f_package <- environmentName(f_env)
is_honored <- (f_name == honored_distrs[["fun"]]) &
(f_package == honored_distrs[["package"]])
if (sum(is_honored) == 0) {
return(NULL)
}
list(
distr = honored_distrs[["distr"]][is_honored],
type = honored_distrs[["type"]][is_honored],
class = honored_distrs[["class"]][is_honored],
d_fun = get(honored_distrs[["d_fun"]][is_honored], envir = f_env),
q_fun = get(honored_distrs[["q_fun"]][is_honored], envir = f_env)
)
}
honored_distr_supp <- function(distr, q_fun, ..., p = 1e-6) {
if (distr %in% c("binom", "hyper", "unif")) {
distr_edge_quantiles <- c(0, 1)
} else if (distr %in% c("geom", "exp")) {
distr_edge_quantiles <- c(0, 1 - p)
} else if (
distr %in% c(
# "discrete"
"nbinom", "pois",
# "continuous"
"beta", "chisq", "f", "gamma", "lnorm", "norm", "t", "weibull"
)
) {
# Putting "beta" distribution here because with large `shape1` and/or
# `shape2` parameters most probability lies far away from 0 and/or 1.
# In cases of distributions "chisq", "f", "gamma", "lnorm", "weibull" `p` is
# used for the left edge quantile instead of zero because there are
# parameters with which most probability lies far away from zero.
distr_edge_quantiles <- c(p, 1 - p)
} else if (distr %in% c("cauchy")) {
# Cutting more tails from Cauchy distribution because they are long and
# heavy.
big_p <- 1e3 * p
distr_edge_quantiles <- c(big_p, 1 - big_p)
} else {
stop_collapse('Distribution "', distr, '" is not honored properly.')
}
# `q_fun` represents quantile function of distribution. It should take
# quantiles as first argument and `...` as others
supp <- q_fun(distr_edge_quantiles, ...)
# Stretching is needed for "beauty" purpose and to correctly handle density
# going to infinity. In latter case 'pdqr' "infinity imputation" works better
# than stepping a little bit aside from edge.
stretch_to_total_supp(supp, total_supp = q_fun(c(0, 1), ...))
}
stretch_to_total_supp <- function(supp, total_supp, h = 1e-2) {
delta <- h * diff(supp)
if (supp[1] - total_supp[1] <= delta) {
supp[1] <- total_supp[1]
}
if (total_supp[2] - supp[2] <= delta) {
supp[2] <- total_supp[2]
}
supp
}
# Other -------------------------------------------------------------------
y_from_p_grid <- function(x, p) {
# It is assumed that `x` has no duplicates and sorted increasingly
# Values of density are computed as numerical derivatives at `x` points
p_lin_derivs <- diff(p) / diff(x)
n_int <- length(p_lin_derivs)
y <- 0.5 * c(
# Boundary derivatives need special treatment for accuracy reasons
3 * p_lin_derivs[1] - p_lin_derivs[2],
p_lin_derivs[-n_int] + p_lin_derivs[-1],
-p_lin_derivs[n_int - 1] + 3 * p_lin_derivs[n_int]
)
# Extra cutoffs to ensure positive `y`. Sometimes it is needed when `y` is
# small near edges.
y <- pmax(y, 0)
# As `y` is not exact solution, total integral will not equal 1 so
# renormalization should be done.
y / trapez_integral(x, y)
}
assert_as_def_args <- function(f, support, n_grid) {
if (dont_assert()) {
return(TRUE)
}
assert_missing(f, "distribution function")
assert_type(f, is.function)
# `support` in `as_*.default()` is allowed to be `NULL` or have `NA`s
if (!is.null(support)) {
assert_support(support, allow_na = TRUE)
}
assert_type(
n_grid, is_single_number,
type_name = "single number more than 2", min_val = 3
)
TRUE
}
assert_tot_prob <- function(tot_prob) {
if (tot_prob <= 0) {
stop_collapse("Total probability on supplied `support` isn't positive.")
}
TRUE
}
# Removes redundant rows from `x_tbl` corresponding to zero `y` values near the
# edges (beginning or end of data frame) in a way depending on type of "x_tbl"'s
# pdqr-function. It doesn't affect the computation of future p- and d-functions.
# This is needed to ensure that q-function, created based on the `x_tbl`,
# returns not extreme results (withing `as_*.default()` functions), as it
# returns **the smallest** value with cumulative probability not more than
# input. For example, if first two rows of `x_tbl` have `x = c(-100, 0)` with
# zero `y` than future q-function `q()` would give `q(0) = -100`, which is not
# desirable because that `-100` is usually a result of too wide input `support`.
remove_zero_edge_y <- function(x_tbl) {
d_vals <- x_tbl[[get_x_tbl_sec_col(x_tbl)]]
type <- get_type_from_x_tbl(x_tbl)
is_to_remove <- is_zero_tail(d_vals, type)
# Check is done for performance reasons
if (any(is_to_remove)) {
x_tbl <- x_tbl[!is_to_remove, ]
}
x_tbl
}
#' Check if element belongs to zero probability tail
#'
#' This is a utility function for detecting zero probability tails. Its input
#' `vec` is supposed to be one of "prob" or "y" column of "x_tbl" metadata
#' (depending on `type`). For "discrete" type element belongs to zero
#' probability tail if it is a part of group of zeros at the beginning or end of
#' input vector `vec`. For "continuous" the most center elements of groups
#' doesn't belong to zero probability tail because it is used to define positive
#' probability interval (due to piecewise-linear nature of density). For
#' example, for `vec = c(0, 0, 1, 0, 0)`:
#' - In case of "discrete" `type` elements 1, 2, 4, 5 are within zero
#' probability tails. And so output will be `c(TRUE, TRUE, FALSE, TRUE, TRUE)`.
#' - In case of "continuous" `type` only elements 1 and 5 are within zero
#' probability tails. It defines a triangular density from second to fourth
#' element. Output will be `c(TRUE, FALSE, FALSE, FALSE, TRUE)`.
#'
#' @param vec A numeric vector of either "prob" or "y" column of "x_tbl"
#' metadata.
#' @param type Pdqr type. Should be one of "discrete" or "continuous".
#'
#' @return A logical vector indicating if element belonds to zero probability
#' tail.
#'
#' @noRd
is_zero_tail <- function(vec, type) {
n <- length(vec)
is_zero <- vec == 0
# Check for first element being zero is made for performance reasons
if (is_zero[1]) {
left_is_zero <- cumsum(is_zero) == 1:n
if (type == "continuous") {
# Exclude the "most center" zero
left_is_zero <- left_is_zero &
duplicated.default(left_is_zero, fromLast = TRUE)
}
} else {
left_is_zero <- rep(FALSE, n)
}
# Check for last element being zero is made for performance reasons
if (is_zero[n]) {
right_is_zero <- (cumsum(is_zero[n:1]) == 1:n)[n:1]
if (type == "continuous") {
# Exclude the "most center" zero
right_is_zero <- right_is_zero & duplicated.default(right_is_zero)
}
} else {
right_is_zero <- rep(FALSE, n)
}
left_is_zero | right_is_zero
}
format_support <- function(support) {
if (is.null(support)) {
c(NA_real_, NA_real_)
} else {
support
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.