The main task of the package dsdp
is to estimate probability density
functions from a data set using a maximum likelihood method.
The models of density functions in use are familiar Gaussian
or exponential distributions with polynomial correction terms.
We call Gaussian distribution with a polynomial Gaussian-based model and
an exponential distribution with a polynomial Exponential-based model,
respectively.
dsdp
seeks parameters of Gaussian or exponential distributions
together with degrees of polynomials using a grid search, and coefficients of
polynomials using a variant of semidefinite programming(SDP) problems.
Detailed discussions of SDP problem formulations and this type of SDP problems are
found in other vignettes.
The outline of estimation procedure is as follows.
We will see each process step by step in the next section. Before we move on, please install and import the package if you haven't yet. Installation is done by
## Install from CRAN install.packages("dsdp")
Importing the package is done by
## Import dsdp library(dsdp)
This package requires ggplot2
for displaying histograms and
density functions.
ggplot2
is a part of tidyverse
,
a de facto standard for data wrangling in R.
Our plot
method returns ggplot2
objects, so if you plan to add the
title or change the labels in the graphs, it is better to import ggplot2
too.
## Import ggplot2 if necessary. library(ggplot2)
In this section, we will see estimation procedures in Gaussian-based model and Exponential-based model in details. Essentially, they are same in computations, yet there are subtle differences in practice.
The density function of Gaussian-based model is $$ p(x; \boldsymbol{\alpha}) \cdot N(x; \mu, \sigma^2), $$ where $p(x; \boldsymbol{\alpha})$ is a polynomial with a coefficient vector $\boldsymbol{\alpha}$, and $N(x; \mu, \sigma^2)$ is Gaussian distribution with mean $\mu$ and variance $\sigma^2$: $$ N(x;\mu, \sigma^2) := \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). $$
The aim of estimation is to find a good set of parameters: $\boldsymbol{\alpha}$, $\mu$, and $\sigma$. To this end, we first provide a coarse set of parameters of base functions, namely, $\mu$ and $\sigma$, along with degrees of polynomials, and then compute the coefficients of polynomials $\boldsymbol{\alpha}$, to get a rough idea of the model. And subsequently, refine the set of parameters and repeat above process until sufficient estimate is obtained.
We first create Gaussian-based model from a data set.
The name of R's S3 class for Gaussian-based model is gaussmodel
,
and we refer to the method of gaussmodel
as method.gaussmodel,
for example, summmary.gaussmodel
, plot.gaussmodel
, estimate.gaussmodel
,
func.gaussmodel
.
There are two scenarios for model creations. One is to create a model from only a data set, and the other is to create a model from a data set and its corresponding frequency data. Let's see model creations in examples.
In the first case, we use a data set mix2gauss$n200
, which contains 200 realizations
of bimodal mixed Gaussian distributions, to create R's S3 class gaussmodel
object
gm1
.
## Create gaussmodel object from a data set mix2gauss$n200 gm1 <- gaussmodel(data=mix2gauss$n200)
The object gm1
, an instance of a S3 class gaussmodel
, contains the data
and parameters to be estimated.
Similarly, in the second case, we use mix2gaussHist$n200p
for data points
and mix2gaussHist$n200f
for their corresponding frequencies, to create
gaussmodel
object gm2
.
## Create gaussmodel object from a data set mix2gaussHist$n200p and ## its frequencies mix2gaussHist$n200f gm2 <- gaussmodel(mix2gaussHist$n200p, mix2gaussHist$n200f)
A summary of gm1
is displayed:
## Display the summary of a data set summary(gm1)
As a name suggests, summary.gaussmodel
shows the basic statistics of a data set.
It prints out the quantiles of standardized data as well as original data.
Here standardization means rescaling of data so as to have a mean 0 and
a standard deviation 1.
The histogram of the data is displayed:
## Draw a histogram of the data set plot(gm1)
plot.gaussmodel
can plot scaled data as well as original data by setting
scaling=TRUE
.
Providing these parameter sets, we are now ready to estimate the model.
## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, deglist=deglist, mulist=mulist, sdlist=sdlist, scaling=TRUE)
The computation of the coefficients of the polynomials is done
for all of the combinations of the parameter sets
deglist
, mulist
, and sdlist
, 9 cases in this example.
By setting scaling=TRUE
, estimation is done for scaled data,
not for original data, as mentioned before.
The result is sorted according to
Akaike information criterion(AIC).
AIC is widely used criterion for model selection so as to avoid overfitting
by penalizing the number of free parameters.
Let's see the result of estimation.
## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE)
(nmax=5
limits top 5 estimates, and estonly=TRUE
suppresses the basic statistics.)
The columns of deg
, mu1
and sig1
indicate the degree of polynomials,
mean, and standard deviation, respectively.
And the columns of mu
and sig
indicate the scaled mean and standard deviation,
respectively.
The column of aic
indicates AIC and that of accuracy
does the accuracy
of the underlying SDP solver, whose value around $1.0^{-7}$ is
sufficient for estimation under IEEE 754
double precision.
If sufficient accuracy is not achieved because of numerical difficulty,
set recompute=TRUE
and stepsize=c(0.4, 0.2)
, for example, and try recomputation.
## This is demonstration for recomputation ## Not Executed gm1 <- estimate(gm1, deglist=deglist, mulist=mulist, sdlist=sdlist, scaling=TRUE, recompute=TRUE, stepsize=c(0.4, 0.2))
The flag stepsize
indicates the vector of step sizes of the underlying SDP solver.
The smaller the values are, the better the chances of successful estimation, but
the slower the computation is.
The default value of stepsize
is c(0.5, 0.3)
, which is enough in many cases.
We will not discuss the implementation details here, but if the user sets
stepsize
by oneself,
the user should set the values smaller than default values and the length of two is enough.
The numbers 1,2,3,... in the leftmost column indicate the indices of estimates
ordered by AIC.
First row is the best estimate, second row is the second best, so on,
and these numbers can be used in plot.gaussmodel
or func.gaussmodel
to designate the estimates to be plotted or evaluated, respectively.
To see the graphs of estimated densities along with the histogram, simply type:
plot(gm1)
By default, plot.gaussmodel
plots up to best 4 estimated density functions.
In the legend, the color of graphs are displayed in increasing order of AIC.
For example, in this graph, the estimation with the degree 6, mean 0.96151,
standard deviation 0.76680 is the best one.
By setting scaling=TRUE
, it can plot a scaled data, like scaled graphs.
plot(gm1, scaling=TRUE)
Similarly, in the legend, the color of graphs are displayed in increasing order of AIC. For example, in this graph, the estimation with the degree 6, mean 0.5, standard deviation 0.75 is the best one.
We continue to estimate further by adding the degree 8 and refining
mulist=seq(0, 0.5, by=0.1)
and sdlist=seq(0.6, 0.9, by=0.1)
.
Here seq
command generates the vector starting from 0, incrementing by 0.1,
and ending with 0.5,
in case of seq(0, 0.5, by=0.1)
.
## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, c(4, 6, 8), seq(0, 0.5, by=0.1), seq(0.5, 1, by=0.1), scaling=TRUE)
Note that parameters already estimated are skipped in estimate.gaussmodel
and we omit argument names.
The summary of estimation is displayed:
## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE)
The graphs are displayed:
plot(gm1)
We continue to do estimation by refining parameters and checking estimates and graphs.
## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, c(4, 6, 8), seq(0, 0.2, by=0.05), seq(0.6, 0.8, by=0.05), scaling=TRUE)
## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE)
plot(gm1)
According to above results, we confine the degrees of polynomials only to 4,
and set mulist=seq(0, 0.2, by=0.025)
and sdlist=seq(0.7, 0.8, by=0.01)
.
## Do estimation ## Output messages are suppressed for brevity gm1 <- estimate(gm1, c(4, 6, 8), seq(0, 0.2, by=0.025), seq(0.7, 0.8, by=0.01), scaling=TRUE)
## Show the summary of results up to 5 summary(gm1, nmax=5, estonly=TRUE)
plot(gm1)
We stop doing estimation here.
Using plot.gaussmodel
, we can plot cumulative distributions
by setting cum=TRUE
:
plot(gm1, cum=TRUE)
For more details, see ?plot.gaussmodel
.
func.gaussmodel
computes the values of density and cumulative distribution of
desired estimate. For example,
x <- seq(-4, 4, by=0.1) ## Compute the density of 1st estimate y_pdf <- func(gm1, x, n=1) ## Compute the cumulative distribution of 1st estimate y_cdf <- func(gm1, x, cdf=TRUE, n=1)
Of course, we can compute the desired estimate by designating n=k
for kth
estimate shown in summary.gaussmodel
.
The density function of Exponential-based model is $$ p(x; \boldsymbol{\alpha}) \cdot \mathrm{Exp}(x; \lambda), $$ where $p(x; \boldsymbol{\alpha})$ is a polynomial with a coefficient vector $\boldsymbol{\alpha}$, and $\mathrm{Exp}(x; \lambda)$ is an exponential distribution with rate parameter $\lambda$: $$ \mathrm{Exp}(x;\lambda) := \lambda e^{-\lambda x}, \quad x \in S = [0, \infty). $$
The aim of estimation is to find a good set of parameters: $\boldsymbol{\alpha}$, $\lambda$. To this end, we first provide a coarse set of parameters of base functions, namely, $\lambda$, and a degree of polynomials, and then compute the coefficients of polynomials $\boldsymbol{\alpha}$, to get a rough idea of the model.
The creation of Exponential-based model from a data set is same as that of Gaussian-based model. We will show the two scenarios, one is to create a model from only a data set, and the other is to create a model from a data set and its corresponding frequency data, in sequel.
In the first case, we use a data set mixexpgamma$n200
, which contains 200
realizations of mixture of an exponential distribution and a gamma distribution,
to create R's S3 class expmodel
object
em1
.
em1 <- expmodel(mixexpgamma$n200)
The object em1
of a S3 class expmodel
contains the data
and parameters to be estimated.
Similarly, in the second case, we use mixExpGammaHist$n800p
for data points
and mixExpGammaHist$n800f
for their corresponding frequencies, to create
expmodel
object em2
.
em2 <- expmodel(mixExpGammaHist$n800p, mixExpGammaHist$n800f)
A summary of em1
is displayed:
## Display the summary of a data set summary(em1)
As a name suggests, summary.expmodel
shows the basic statistics of a data set.
It prints out the quantiles of scaled data as well as original data.
Here the scaling is to divide the data by the mean of the data.
The histogram of the data is displayed:
## Draw a histogram of the data set plot(em1)
plot.expmodel
can plot scaled data as well as original data by setting
scaling=TRUE
.
Before estimation, we need to provide a set of rate parameters, and degrees of polynomials, to compute the coefficients of polynomials.
## A vector of degrees of polynomials deglist <- c(2, 3, 4) ## A vector of rate parameters of exponential distributions lmdlist <- c(0.5, 1, 2, 4)
deglist
is a vector of degrees of polynomials, in this case
2, 3, 4
. In Exponential-based model, a positive integer up to around
20 is okay. Note that large degrees can cause numerical difficulty.
lmdlist
is a vector of means of exponential distributions,
so the element of lmdlist
should be positive.
Also note that the rate parameters to be passed to
an estimate.expmodel
method are applied to
internally scaled data, not original data.
Providing these parameter sets, we are now ready to estimate the model.
## Do estimation ## Output messages are suppressed for brevity em1 <- estimate(em1, deglist=deglist, lmdlist=lmdlist)
The computation of the coefficients of the polynomials is done
for all of the combinations of the parameter sets
deglist
, lmdlist
, 12 cases in this example.
The result is sorted according to Akaike information criterion(AIC)
Let's see the result of estimation.
## Show the summary of results up to 5 summary(em1, nmax=5, estonly=TRUE)
(nmax=5
limits top 5 estimates, and estonly=TRUE
suppresses the basic statistics.)
Next see the histogram.
plot(em1)
We continue to estimate further by adding parameters.
As we see in estimate.gaussmodel
, paramters already estimated are
skipped.
## Do estimation ## Output messages are suppressed for brevity em1 <- estimate(em1, c(3, 4, 5, 6), c(1, 2, 4, 8))
The summary of the estimation is:
## Show the summary of results up to 5 summary(em1, nmax=5, estonly=TRUE)
The graphs are:
plot(em1)
We further confine parameters as follows:
## Do estimation ## Output messages are suppressed for brevity em1 <- estimate(em1, c(5, 6), seq(3, 4, by=0.25))
And see the summary of estimation:
## Show the summary of results up to 5 summary(em1, nmax=5, estonly=TRUE)
The graphs are as follows:
plot(em1)
Theoretically, we can refine estimation as you like, but we stop estimation here.
Using plot.expmodel
, we can plot cumulative distributions
by setting cum=TRUE
:
plot(em1, cum=TRUE)
For more details, see ?plot.expmodel
.
func.expmodel
computes the values of density and cumulative distribution of
desired estimate. For example,
x <- seq(0, 14, by=0.1) ## Compute the density of 1st estimate y_pdf <- func(em1, x, n=1) ## Compute the cumulative distribution of 1st estimate y_cdf <- func(em1, x, cdf=TRUE, n=1)
Of course, we can compute the desired estimate by designating n=k
for kth
estimate shown in summary.expmodel
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.