title: "The npfixedcomp Package" author: "Xiangjie Xue" date: "07 June, 2020" output: html_document: keep_md: yes
This is a R package which uses non-parametric method to compute the mixing distribution.
Likelihood-based method:
NPMLE using normal component with support points fixed, estimate proportion of 0 using threshold function.
NPMLE using t component with / without support points fixed, estimate proportion of 0 using threshold function.
Distance-based method:
Density estimation using normal density under squared error, Cramer-von Mises or Anderson Darling criterion.
Estimate the proportion of 0 using normal density, under Cramer-von Mises or Anderson Darling criterion.
npfixedcomp-1.3.0: by default the family function is not loaded to the global environment when calling the packages. Move all the R-based functions dependent on nspmix to ./inst/extdata/. And complete the the large scale functions for ll, ad, cvm.
npfixedcomp-1.2.5: imported lsei.f and pnnls.f into ./src/ directory. The full Rcpp implementation of restricted NPMLE for normal density is based on this NNLS code. The Rcpp version of the NPMLE solver is considerably faster than other version. See Latex (private git) commit 7600a93 on 16/03/2020 or later for details.
npfixedcomp-1.2.4: Direct implementation of npnormfc (makenpnormfc) is known to fail when G is generated from cauchy distribution. As a result npnormfc.ll2 is written from scratch and it works when G is generated from Cauchy. It has been seen that the new function runs slightly faster. But the numerical stability of this function is yet to be determined.
npfixedcomp-1.2.3: Add the functions for estimating the prportion of 0 using the ad loss. The critical value in findnpnormdist1fc.ad is the p.value. the pad function is the direct integration in Anderson and Darling (1952). The qad function uses the root-finding function. Adding the computations with weights is in the plan.
npfixedcomp-1.2.2: Updated to finding the proportion of 0 using the cvm loss. The crit.val in findnpnormdist1fc.cvm is the p.value thanks to [p/q]-cvm.
npfixedcomp-1.2.1: Add the non-parametric distance-based method to compute the mixing distribution under Cramer-von Mises criterion. npnormdist.cvm function is equivalent to likelihood-based npnorm. The fixed component is yet to be added.
npfixedcomp-1.2: Intend to add capability to compute cnms with weights. The bining process is used. When the bining process is used, the NPMLE is only an approximation. One should consider the trade off between the accuracy of the solution and the speed.
npfixedcomp-1.1: There is no cancellation error in addition. There is no need to use LSE functions.
07/06/2020: Fixed errors in npnormadfcweighted and change the code so that the compiler does not complain. Change according to the consistency.
28/05/2020: fixed the code for npnormsq, npnormad and npnormllweighted. The strange behaviour of the ad function is solved by extending the range. In theory we do not know the range for the support points under distance-based method.
27/05/2020: add the family for computing the mixing distribution of sample correlation coefficients when the density is approximated to be normal.
21/05/2020: Changed the stopping criterion for npnorm.ll and the algorithms for finding 0.
26/04/2020: NPMLE for weighted observations are done by discrete normal distribution. Changed the binning function so that they are consistent.
24/04/2020: Moved all the functions with weighted observation. Try to use discrete normal distribution for this.
21/04/2020: Fixed the formulation for ad functions. LLT decomposition should be investigated. weigheed version of sq and cvm functions done.
18/04/2020: move solver to header since it is easy to maintain. In this chance, solver with 0, 1,2-order is written.
17/04/2020: add the gridpointsnpnorm for computing the local minima for normal family.
14/04/2020: Finished writing weighted version of maximum likelihood methods.
03/04/2020: finish writing npnorm.ad, npnormfc.ad, findnpnorm.ad.
02/04/2020: Revert the setting on 30/03/2020: It is not viable to find all the local minima in optimisation. If there is no local minima found in one iteration, it is terminated.
01/04/2020: Change the optimisation tool in cpp files so that it does not depend on RcppNumerical. For npt, it uses the secant method and for npnorm, it uses Newton-Raphson.
30/03/2020: Across all the computing functions in Rcpp, all the stopping criterion becomes stop if the directional derivative is less than the tolerance. This makes more sense.
28/03/2020: Completed the function for finding the proportion of given location for log-likelihood methods.
27/03/2020: wrote a function for estimating the initial mixing distribution.
26/03/2020: Rename the original cvm functions by cvm2. Used the sample template for npnormdist.sq and npnormdistfc.sq to write the corresponding cvm functions. Changed the inline function for collapsing. Fixed errors in npnormdistfc.sq and npnormdistfc.cvm functions.
25/03/2020: Reshuffle codes into places. Make most functions in previous version into the header and make it inline. Finished writing npnormdist.sq and npnormdistfc.sq. The performance can be seen below.
23/03/2020: remove the dependency on Rcpp for make functions. Remove the code for assigning mixing distribution to global variable in find functions (This gives notes in R CMD check). Removed the R version of restricted NPMLE (npnormfc.ll2).
22/03/2020: add nptfc.ll function for t distribution.
18/03/2020: massive improvement to sortmix and chklossfunction.
16/03/2020: Change the version to 1.2.5. See the update notes for details.
14/03/2020: Copied lsei.f and pnnls.f from the package lsei to ./src/ for c++ implementation of pnnls with fixed sum.
05/03/2020: Making improvement to the functions that generate the initial mixing distribution if the mixing distribution is not specified via init. The new function that generates the initial mixing distribution first generate a mixing distribution according to nspmix, and then minus the mixing density given mu0 and pi0 at the mids points. The new functions makes the normal component functions faster. For the $t$ distribution, the improvement is to be determined. The initial functions for the $t$ family is misspecified anyway.
03/03/2020: Change the version to 1.2.4. See the update notes for details.
15/01/2020: Fixed the errors in the gradient functions for estimating the null proportions.
18/12/2019: Change the version to 1.2.3. See the Update Notes for details.
13/12/2019: Suppress the print for find* functions. Improvement made in distance functions.
12/12/2019: The second version of NPMLE is working but considering changing the method for finding local extrema. Should the normal dnpnorm be used or the log-safe version?
12/12/2019: Another version of npnormfc which does not depend on cnm. It is working in a sense that it gives an reasonable solution. However, this solution still has some issue. There is still gap between the min directional derivative and 0. Possible reason : values passed into pnnqp or the condition checking.
10/12/2019: A slight change to the npnormdist and npnormfc, the functions for plotting of gredient and collapse are separate from the function. Add the function for checking the updated pi0. More tests on those two functions are needed.
10/12/2019: Fixed the bug that npnormfc and nptfc constructor cannot be used in cnm directly. makenpnormfc and makenptfc still the preferred choice. The solution computed by cnm directly is not post-processed.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New Zealand.1252
## [3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C
## [5] LC_TIME=English_New Zealand.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.3 magrittr_1.5 tools_3.6.3 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.4.6 stringi_1.4.6 rmarkdown_2.2
## [9] knitr_1.28 stringr_1.4.0 xfun_0.14 digest_0.6.25
## [13] rlang_0.4.6 evaluate_0.14
For the prostate dataset, the degreen of freedom is 100, so it is pausible that we can use normal distribution in calculation.
library(npfixedcomp)
load(url("http://statweb.stanford.edu/~ckirby/brad/LSI/datasets-and-programs/data/prostz.RData"))
npnormfc.ll(prostz, pi0 = 0.9)
## $iter
## [1] 15
##
## $family
## [1] "npnorm"
##
## $max.gradient
## [1] 3.398953e-07
##
## $beta
## [1] 1
##
## $mix
## $mix$pt
## [1] -2.2733473 -0.8729892 0.0000000 1.1404941 2.7549436
##
## $mix$pr
## [1] 0.02212661 0.02801186 0.90000000 0.03652114 0.01334038
##
##
## $ll
## [1] -9286.442
##
## $dd0
## [1] -0.6736482
##
## $convergence
## [1] 0
##
## attr(,"class")
## [1] "nspmix"
findnpnormfc.ll(prostz)
## $iter
## [1] 7
##
## $family
## [1] "npnorm"
##
## $max.gradient
## [1] 1.715528e-08
##
## $beta
## [1] 1
##
## $mix
## $mix$pt
## [1] -2.418859 0.000000 2.601970
##
## $mix$pr
## [1] 0.01902377 0.96298198 0.01799425
##
##
## $ll
## [1] -9290.353
##
## $dd0
## [1] -13.26656
##
## $convergence
## [1] 0
##
## attr(,"class")
## [1] "nspmix"
Similar process can be done when the component density follows a $t$ distribution. In this example, 1000 observations with 500 from ncp = 0 and 500 from ncp = 5. The degree of freedom here is 10.
data = rt(300, df = 10, ncp = rep(c(0, 5), each = 150))
nptfc.ll(data, pi0 = 0.5, df = 10)
## $iter
## [1] 10
##
## $family
## [1] "npt"
##
## $max.gradient
## [1] 3.968128e-06
##
## $beta
## [1] 10
##
## $mix
## $mix$pt
## [1] -1.224998 0.000000 4.885200 5.915628 15.744576
##
## $mix$pr
## [1] 0.023867874 0.500000000 0.377036935 0.096742571 0.002352619
##
##
## $ll
## [1] -720.9072
##
## $dd0
## [1] -14.31792
##
## $convergence
## [1] 0
For the distance-based method, the only distance loss available here is the Cramer-von Mises criterion and Anderson-Darling. Using the prostate data above,
npnormfc.ll(prostz)
## $iter
## [1] 23
##
## $family
## [1] "npnorm"
##
## $max.gradient
## [1] 7.725662e-06
##
## $beta
## [1] 1
##
## $mix
## $mix$pt
## [1] -2.3410759 -0.1570473 0.0000000 0.4348154 2.7338397
##
## $mix$pr
## [1] 0.02048200 0.69017611 0.00000000 0.27475588 0.01458601
##
##
## $ll
## [1] -9286
##
## $dd0
## [1] -0.03206196
##
## $convergence
## [1] 0
##
## attr(,"class")
## [1] "nspmix"
npnormfc.cvm(prostz)
## $iter
## [1] 4
##
## $family
## [1] "npnormdist"
##
## $max.gradient
## [1] -1.435994e-12
##
## $beta
## [1] 1
##
## $mix
## $mix$pt
## [1] -4.4239178 -0.3811211 0.0000000 0.4106377 4.3783443
##
## $mix$pr
## [1] 0.003619654 0.516644580 0.000000000 0.474487217 0.005248548
##
##
## $ll
## [1] 0.05048828
##
## $dd0
## [1] 0.001208542
##
## $convergence
## [1] 0
##
## attr(,"class")
## [1] "nspmix"
npnormfc.ad(prostz)
## $iter
## [1] 24
##
## $family
## [1] "npnormdist"
##
## $max.gradient
## [1] 1.127952e-05
##
## $beta
## [1] 1
##
## $mix
## $mix$pt
## [1] -3.1060327 -0.2749033 0.0000000 0.4678700 3.3436915
##
## $mix$pr
## [1] 0.008952014 0.613460823 0.000000000 0.369767435 0.007819728
##
##
## $ll
## [1] -6033.339
##
## $dd0
## [1] -0.005345347
##
## $convergence
## [1] 0
##
## attr(,"class")
## [1] "nspmix"
The following example shows the effciency between the NPMLE and two distance-based method. Only one sample with 1000 observations is calculated hence the the plot below might be wired in some way.
set.seed(1)
data = rnorm(1000, mean = c(0, 2))
x = qnorm(seq(1e-2, 1 - 1e-2, length = 101))
x = (x - min(x) * 1.01) / diff(range(x)) / 1.05
k1 = matrix(unlist(lapply(x, function(ddd){
result = npnormfc.ll(data, pi0 = ddd)
result2 = npnormfc.ad(data, pi0 = ddd)
result3 = npnormfc.cvm(data, pi0 = ddd)
c(result$ll,
-result2$ll,
result3$ll)
})), nrow = 3)
plot(x, exp(k1[1, ] - k1[1, 1]), type = "l", ylab = "Likelihood/p.value")
lines(x, npfixedcomp:::pad(k1[2, ] - length(data), lower.tail = FALSE), col = "red")
lines(x, npfixedcomp:::pcvm(k1[3, ] + 1 / 12 / length(data), lower = FALSE), col = "blue")
abline(v = 0.5)
The figure above shows the likelihood function for NPMLE (black) and the p.values for the distance-based methods (red for Anderson-Darling and blue for Cramer-Von Mises) plotted against the fixed probability at 0. The vertical line shows the true proportion of 0.
For version 1.2.5 and later (possibly), We can test the computational speed among implemention of restricted NPMLE on \texttt{nspmix} pacakge, direction implementation using the algorithm in R and Rcpp.
mu0 = 0; pi0 = 0.4
set.seed(1)
data = rnorm(500, c(0, 4))
set.seed(2)
rdata = rt(500, ncp = c(0, 5), df = 13)
# comparison for finding restricted NPMLE (normal)
# nspmix reference : 109.89 (mean) and 108.77 (median) on macOS.
microbenchmark::microbenchmark(npnormfc.ll(data, mu0 = mu0, pi0 = pi0)) # Rcpp
## Unit: milliseconds
## expr min lq mean median
## npnormfc.ll(data, mu0 = mu0, pi0 = pi0) 64.2774 64.5123 64.8411 64.6829
## uq max neval
## 65.02435 66.9411 100
# comparison for finding pi0 (normal)
# nspmix reference : 1.681s (no binning)
datalarge = rnorm(1e5, c(0, 2))
system.time(r1 <- findnpnormfc.ll(datalarge))
## user system elapsed
## 191.65 24.25 216.08
# with order = -2 binning.
system.time(r2 <- findnpnormfc.ll(datalarge, order = -2))
## user system elapsed
## 1.77 0.00 1.77
r1$mix$pr[r1$mix$pt == 0];r2$mix$pr[r2$mix$pt == 0]
## [1] 0.5073351
## [1] 0.507552
system.time(r1 <- findnpnormfc.cvm(datalarge))
## user system elapsed
## 160.83 5.94 166.86
# with order = -2 binning.
system.time(r2 <- findnpnormfc.cvm(datalarge, order = -2))
## user system elapsed
## 1.43 0.00 1.42
r1$mix$pr[r1$mix$pt == 0];r2$mix$pr[r2$mix$pt == 0]
## [1] 0.506271
## [1] 0.6016159
system.time(r1 <- findnpnormfc.ad(datalarge))
## user system elapsed
## 397.88 19.16 417.37
# with order = -2 binning.
system.time(r2 <-findnpnormfc.ad(datalarge, order = -2))
## user system elapsed
## 1.60 0.00 1.61
r1$mix$pr[r1$mix$pt == 0];r2$mix$pr[r2$mix$pt == 0]
## [1] 0.506119
## [1] 0.5063576
# comparsion for restricted NPMLE (t distribution)
# nspmix reference 7.649s (no binning)
system.time(nptfc.ll(rdata, mu0 = mu0, pi0 = pi0, df = 13)) # Rcpp
## user system elapsed
## 5.64 0.00 5.64
# comparisons for distance NPMDE (normal)
# R-implementation reference mean 26.21 (cvm) / 33.44 (cvmfc) / 134.53 (ad) / 123.83 (adfc) millisec.
# median 25.06 (cvm) / 32.23 (cvmfc) / 133.32 (ad) / 120.33 (adfc) millisec.
microbenchmark::microbenchmark(npnorm.sq(data),
npnorm.cvm(data),
npnorm.ad(data),
npnormfc.sq(data, pi0 = pi0),
npnormfc.cvm(data, pi0 = pi0),
npnormfc.ad(data, pi0 = pi0)) # Original version
## Unit: milliseconds
## expr min lq mean median uq
## npnorm.sq(data) 48.3884 48.48910 48.91688 48.61680 48.78815
## npnorm.cvm(data) 54.0211 54.14035 54.63181 54.31110 54.54665
## npnorm.ad(data) 152.7562 153.01590 154.49812 153.52695 154.15560
## npnormfc.sq(data, pi0 = pi0) 36.1619 36.26105 36.61753 36.33545 36.55695
## npnormfc.cvm(data, pi0 = pi0) 27.6373 27.70890 28.06326 27.78220 27.94250
## npnormfc.ad(data, pi0 = pi0) 111.9438 112.31260 113.87270 112.68595 113.68330
## max neval
## 54.8831 100
## 60.3656 100
## 167.7344 100
## 42.6405 100
## 32.3776 100
## 128.4893 100
microbenchmark::microbenchmark(npnorm.cvm(data, order = -2),
npnorm.ad(data, order = -2),
npnormfc.ll(data, order = -2),
npnormfc.cvm(data, pi0 = pi0, order = -2)) # Original version
## Unit: milliseconds
## expr min lq mean median
## npnorm.cvm(data, order = -2) 71.8238 72.07355 73.93795 72.82155
## npnorm.ad(data, order = -2) 72.5153 72.71870 74.38591 73.22060
## npnormfc.ll(data, order = -2) 39.8674 39.99315 40.89892 40.23375
## npnormfc.cvm(data, pi0 = pi0, order = -2) 23.1226 23.22900 23.81399 23.35400
## uq max neval
## 74.8706 85.6138 100
## 74.9092 85.1267 100
## 40.9892 48.2332 100
## 23.7634 28.8231 100
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.