devtools::load_all('/home/brent/myRpackages/mixem/mixem') library(knitcitations) cleanbib() options("citation_format" = "pandoc")
Modelling losses is a problem of interest arising frequently in finance and actuarial science. Completely monotone densities have properties that make them well suited for modelling loss distributions. While parametric approaches are popular, they require rather strong assumptions about the data, and model misspecification can lead to serious inferential errors. Strictly nonparametric approaches tend to fit the data well but do not provide an appropriate amount of smoothing. In order to strike a balance between these two extremes one can take a semiparametric approach. It can be shown that completely monotone densities are equivalent to mixtures of exponential distributions. Although this places a parametric restriction on the estimation problem, no parametric assumption is required about the mixing measure. mixem
implements the expectation-maximization (EM) algorithm to compute maximum likelihood estimates (MLEs) of completely monotone densities for situations when data is grouped, censored, and when no grouping or censoring of the data is present. These methods are easily extended with Fast Fourier Transform techniques for use in common aggregate loss models.
mixem
Infomixem
consists of three main functions em_fit
, em_group_fit
, and em_censored_fit
that each obtain the MLEs of completely monotone densities when data is not grouped or censored, grouped, and censored respectively. In addition to finding the MLEs, the three functions above will also return an estimate of the optimal number of mixtures. Each function will continue to add a component to the mixture distribution as long as the log likelihood function continues to increase.
In the plain vanilla situation where no grouping or censoring of the data is present, the function em_fit
can be used to obtain the MLEs as well as an estimate of the optimal number of mixtures. The development of em_fit
implements the EM algorithm as outlined in the paper by r citet('10.1214/aos/1176345789')
.
Consider the following scenario where n = 500
data points are generated from an exponential mixture distribution with rate parameters $\lambda_1 = \frac{1}{500}, \lambda_2 = \frac{1}{5000}, \lambda_3 = \frac{1}{50000}$ and corresponding mixing proportions $p_1 = 0.2, p_2 = 0.4, p_3 = 0.4$
library(ggplot2) library(scales) set.seed(123) n <- 500 components <- sample(1:3, prob = c(0.2, 0.4, 0.4), size = n, replace = TRUE) lambdas <- c(1/500, 1/5000, 1/50000) x <- rexp(n, rate = lambdas[components]) ggplot(data = data.frame(x), aes(x))+ geom_histogram(aes(y=..density..), bins = 40) + ggtitle("Sample Distribution") + theme(plot.title = element_text(hjust = 0.5))
Now we can call em_fit
to obtain MLEs.
fit <- em_fit(x) fit
When data is right censored, the function em_censored_fit
can be called to obtain MLEs. The implementation of the EM algorithm in the paper by r citet('10.1016/0378-3758(94)00097-f')
is carried out in em_censored_fit
.
Consider the same sample of data where n = 500
data points are generated from an exponential mixture distribution with rate parameters $\lambda_1 = \frac{1}{500}, \lambda_2 = \frac{1}{5000}, \lambda_3 = \frac{1}{50000}$ and corresponding mixing proportions $p_1 = 0.2, p_2 = 0.4, p_3 = 0.4$. But now, suppose the data is right-censored at $50000$. In order to obtain the MLEs we need to pass the uncensored observations x_uncensored
, censoring point censoring_point
, and the number of right-censored data points n_censored
.
x_censored <- ifelse(x>50000,'>50000',x) x_censored <- as.numeric(x_censored) x_uncensored <- x_censored[!is.na(x_censored)] censoring_point <- 50000 n_censored <- length(x_censored[is.na(x_censored)]) em_censored_fit(x_uncensored, censoring_point, n_censored)
For situations when the data is grouped, you can call em_grouped_fit
to obtain MLEs. The development of em_grouped_fit
implements the EM algorithm as outlined in the paper by r citet('10.2307/2531869')
.
As an example, we'll group the same data we've been working with into 150 equal sized groups. Since max(x) = ``r format(max(x), scientific=FALSE)
, we can set the max upper boundary as $322,000$ and proceed with the grouping as follows:
boundaries <- 322000/150*0:150 x_grp <- cut(x,breaks = boundaries)
We can call em_grouped_fit
on this data directly by specifying data_type = interval
.
em_group_fit(x_grp = x_grp, data_type = "interval")
If the data is in a cross tab format, then choose data_type = cross tab
.
x_tab <- table(x_grp) em_group_fit(x_grp = x_tab, data_type = "cross tab")
Data in tabular format is also supported. The table needs to be in a format such that the first two columns represent the lower and upper bounds of the group, and the third column is the count of obervations in each group.
Lower group boundary | Upper group boundary | Count -------------------- | ---------------------|------ 0 | 2,150 |169 2,150 | 4,290 |51 ... | ... |... 320,000 | 322,000 |1
Then em_grouped_fit
can be called with data_type = "tabular"
.
em_group_fit(x_grp = table_data, data_type = "tabular")
In aggregate loss models we are interested in the distribution of the sum $S = X_1+X_2+...+X_m$ where the $X_i$ are independent and identically distributed with probability density function $f$. In order to estimate $f_S$ the probabilty density function of the random variable $S$ we can use the methodology outlined in r citet('10.1080/10485252.2015.1041944')
that uses the Fast Fourier Transform to estimate the m-fold convolution of the density $f$.
Recall from earlier we used em_fit
to estimate the completely monotone density $f$ parametrized by $\lambda_1 = \frac{1}{500}, \lambda_2 = \frac{1}{5000}, \lambda_3 = \frac{1}{50000}$ and corresponding mixing proportions $p_1 = 0.2, p_2 = 0.4, p_3 = 0.4$. If we want to now estimate the m-fold convolution of $f$ where $m=10$, we simply do the following:
density <- m_fold_convolution(m = 10, lambda = 1/fit$mean, p = fit$p, xmax = 1e+6, N = 2^21) ggplot(data = data.frame(density), aes(x = x, y = convolution)) + geom_line() + ggtitle("Convolution Density Function") + theme(plot.title = element_text(hjust = 0.5))
With m_fold_convolution
in hand, we can estimate other common distributions used to model aggregate losses such as the compound Poisson distribution. The compound Poisson distribution again considers the sum $S = X_1 + X_2 +...+X_N$ where the $X_i$ are independent and identically distributed, but here $N$ is a random variable following a $Poisson(\lambda)$ distribution where $\mathbb{E}(N)=\lambda$.
Again we consider the example where the $X_i$ have a completely monotone density $f$ parametrized by $\lambda_1 = \frac{1}{500}, \lambda_2 = \frac{1}{5000}, \lambda_3 = \frac{1}{50000}$ and corresponding mixing proportions $p_1 = 0.2, p_2 = 0.4, p_3 = 0.4$. If we are interested in estimating the compound Poisson distribution where $\lambda=12$, then a call to compound_poisson
as follows does the trick.
compound_density <- compound_poisson(poisson_mean = 12, lambda = 1/fit$mean, p = fit$p, xmax = 1.5e+6, N = 2^21) ggplot(data = data.frame(compound_density), aes(x = x, y = density)) + geom_line() + ggtitle("Compound Poisson Density") + theme(plot.title = element_text(hjust = 0.5))
write.bibtex(file="references.bib")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.