Categorical Logspline Density

Share:

Description

clsd computes the logspline density, density derivative, distribution, and smoothed quantiles for a one (1) dimensional continuous variable using the approach of Racine (2013).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
clsd(x = NULL,
     beta = NULL,
     xeval = NULL,
     degree = NULL,
     segments = NULL,
     degree.min = 2,
     degree.max = 25,
     segments.min = 1,
     segments.max = 100,
     lbound = NULL,
     ubound = NULL,
     basis = "tensor",
     knots = "quantiles",
     penalty = NULL,
     deriv.index = 1,
     deriv = 1,
     elastic.max = TRUE,
     elastic.diff = 3,
     do.gradient = TRUE,
     er = NULL,
     monotone = TRUE,
     monotone.lb = -250,
     n.integrate = 500,
     nmulti = 1,
     method = c("L-BFGS-B", "Nelder-Mead", "BFGS", "CG", "SANN"),
     verbose = FALSE,
     quantile.seq = seq(.01,.99,by=.01),
     random.seed = 42,
     maxit = 10^5,
     max.attempts = 25,
     NOMAD = FALSE)

Arguments

x

a numeric vector of training data

beta

a numeric vector of coefficients (default NULL)

xeval

a numeric vector of evaluation data

degree

integer/vector specifying the polynomial degree of the B-spline basis for each dimension of the continuous x (default degree=2)

segments

integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous x (i.e. number of knots minus one) (default segments=1, i.e. Bezier curve)

segments.min,segments.max

when elastic.max=FALSE, the minimum/maximum segments of the B-spline basis for each of the continuous predictors (default segments.min=1,segments.max=100)

degree.min,degree.max

when elastic.max=FALSE the minimum/maximum degree of the B-spline basis for each of the continuous predictors (default degree.min=2, degree.max=25)

lbound,ubound

lower/upper bound for the support of the density. For example, if there is a priori knowledge that the density equals zero to the left of 0, and has a discontinuity at 0, the user could specify lbound = 0. However, if the density is essentially zero near 0, one does not need to specify lbound

basis

a character string (default basis="tensor") indicating whether the additive or tensor product B-spline basis matrix for a multivariate polynomial spline or generalized B-spline polynomial basis should be used

knots

a character string (default knots="quantiles") specifying where knots are to be placed. ‘quantiles’ specifies knots placed at equally spaced quantiles (equal number of observations lie in each segment) and ‘uniform’ specifies knots placed at equally spaced intervals

deriv

an integer l (default deriv=1) specifying whether to compute the univariate lth partial derivative for each continuous predictor (and difference in levels for each categorical predictor) or not and if so what order. Note that if deriv is higher than the spline degree of the associated continuous predictor then the derivative will be zero and a warning issued to this effect

deriv.index

an integer l (default deriv.index=1) specifying the index (currently only supports 1) of the variable whose derivative is requested

nmulti

integer number of times to restart the process of finding extrema of the cross-validation function from different (random) initial points (default nmulti=1)

penalty

the parameter to be used in the AIC criterion. The method chooses the number of degrees plus number of segments (knots-1) that maximizes 2*logl-penalty*(degree+segments). The default is to use the penalty parameter of log(n)/2 (2 would deliver standard AIC, log(n) standard BIC)

elastic.max,elastic.diff

a logical value/integer indicating whether to use ‘elastic’ search bounds such that the optimal degree/segment must lie elastic.diff units from the respective search bounds

do.gradient

a logical value indicating whether or not to use the analytical gradient during optimization (defaults to TRUE)

er

a scalar indicating the fraction of data range to extend the tails (default 1/log(n), see extendrange for further details)

monotone

a logical value indicating whether modify the standard B-spline basis function so that it is tailored for density estimation (default TRUE)

monotone.lb

a negative bound specifying the lower bound on the linear segment coefficients used when (monotone=FALSE)

n.integrate

the number of evenly spaced integration points on the extended range specified by er (defaults to 500)

method

see optim for details

verbose

a logical value which when TRUE produces verbose output during optimization

quantile.seq

a sequence of numbers lying in [0,1] on which quantiles from the logspline distribution are obtained

random.seed

seeds the random number generator for initial parameter values when optim is called

maxit

maximum number of iterations used by optim

max.attempts

maximum number of attempts to undertake if optim fails for any set of initial parameters for each value of nmulti

NOMAD

a logical value which when TRUE calls snomadr to determine the optimal degree and segments

Details

Typical usages are (see below for a list of options and also the examples at the end of this help file)

1
2
3
4
    
    model <- clsd(x)

    

clsd computes a logspline density estimate of a one (1) dimensional continuous variable.

The spline model employs the tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (http://www.gnu.org/software/gsl/) and the tensor.prod.model.matrix function.

When basis="additive" the model becomes additive in nature (i.e. no interaction/tensor terms thus semiparametric not fully nonparametric).

When basis="tensor" the model uses the multivariate tensor product basis.

Value

clsd returns a clsd object. The generic functions coef, fitted, plot and summary support objects of this type (er=FALSE plots the density on the sample realizations (default is 'extended range' data), see er above, distribution=TRUE plots the distribution). The returned object has the following components:

density

estimates of the density function at the sample points

density.er

the density evaluated on the ‘extended range’ of the data

density.deriv

estimates of the derivative of the density function at the sample points

density.deriv.er

estimates of the derivative of the density function evaluated on the ‘extended range’ of the data

distribution

estimates of the distribution function at the sample points

distribution.er

the distribution evaluated on the ‘extended range’ of the data

xer

the ‘extended range’ of the data

degree

integer/vector specifying the degree of the B-spline basis for each dimension of the continuous x

segments

integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous x

xq

vector of quantiles

tau

vector generated by quantile.seq or input by the user (lying in [0,1]) from which the quantiles xq are obtained

Usage Issues

This function should be considered to be in ‘beta’ status until further notice.

If smoother estimates are desired and degree=degree.min, increase degree.min to, say, degree.min=3.

The use of ‘regression’ B-splines can lead to undesirable behavior at the endpoints of the data (i.e. when monotone=FALSE). The default ‘density’ B-splines ought to be well-behaved in these regions.

Author(s)

Jeffrey S. Racine racinej@mcmaster.ca

References

Racine, J.S. (2013), “Logspline Mixed Data Density Estimation,” manuscript.

See Also

logspline

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
## Not run: 
## Old Faithful eruptions data histogram and clsd density

library(MASS)
data(faithful)
attach(faithful)

model <- clsd(eruptions)

ylim <- c(0,max(model$density,hist(eruptions,breaks=20,plot=FALSE)$density))

plot(model,ylim=ylim)

hist(eruptions,breaks=20,freq=FALSE,add=TRUE,lty=2)

rug(eruptions)

summary(model)

coef(model)

## Simulated data

set.seed(42)
require(logspline)

## Example - simulated data

n <- 250
x <- sort(rnorm(n))
f.dgp <- dnorm(x)

model <- clsd(x)

## Standard (cubic) estimate taken from the logspline package
## Compute MSEs

mse.clsd <- mean((fitted(model)-f.dgp)^2)

model.logspline <- logspline(x)

mse.logspline <- mean((dlogspline(x,model.logspline)-f.dgp)^2)

ylim <- c(0,max(fitted(model),dlogspline(x,model.logspline),f.dgp))

plot(model,
     ylim=ylim,
     sub=paste("MSE: logspline = ",format(mse.logspline),", clsd = ",
     format(mse.clsd)),
     lty=3,
     col=3)

xer <- model$xer

lines(xer,dlogspline(xer,model.logspline),col=2,lty=2)
lines(xer,dnorm(xer),col=1,lty=1)

rug(x)

legend("topright",c("DGP",
                    paste("Cubic Logspline Density (package `logspline', knots = ",
                          model.logspline$nknots,")",sep=""),
                    paste("clsd Density (degree = ", model$degree, ", segments = ",
                          model$segments,", penalty = ",round(model$penalty,2),")",sep="")),
       lty=1:3,
       col=1:3,
       bty="n",
       cex=0.75)

summary(model)

coef(model)

## Simulate data with known bounds

set.seed(42)
n <- 10000
x <- runif(n,0,1)

model <- clsd(x,lbound=0,ubound=1)

plot(model)

## End(Not run)