README.md
In xiangjiexue/npfixedcomp: Non-parametric methods for estimating mixing distribution

title: "The npfixedcomp Package" author: "Xiangjie Xue" date: "07 June, 2020" output: html_document: keep_md: yes

This is a R package which uses non-parametric method to compute the mixing distribution.

Likelihood-based method:

NPMLE using normal component with support points fixed, estimate proportion of 0 using threshold function.
NPMLE using t component with / without support points fixed, estimate proportion of 0 using threshold function.

Distance-based method:

Density estimation using normal density under squared error, Cramer-von Mises or Anderson Darling criterion.
Estimate the proportion of 0 using normal density, under Cramer-von Mises or Anderson Darling criterion.

npfixedcomp-1.3.0: by default the family function is not loaded to the global environment when calling the packages. Move all the R-based functions dependent on nspmix to ./inst/extdata/. And complete the the large scale functions for ll, ad, cvm.

npfixedcomp-1.2.5: imported lsei.f and pnnls.f into ./src/ directory. The full Rcpp implementation of restricted NPMLE for normal density is based on this NNLS code. The Rcpp version of the NPMLE solver is considerably faster than other version. See Latex (private git) commit 7600a93 on 16/03/2020 or later for details.

npfixedcomp-1.2.4: Direct implementation of npnormfc (makenpnormfc) is known to fail when G is generated from cauchy distribution. As a result npnormfc.ll2 is written from scratch and it works when G is generated from Cauchy. It has been seen that the new function runs slightly faster. But the numerical stability of this function is yet to be determined.

npfixedcomp-1.2.3: Add the functions for estimating the prportion of 0 using the ad loss. The critical value in findnpnormdist1fc.ad is the p.value. the pad function is the direct integration in Anderson and Darling (1952). The qad function uses the root-finding function. Adding the computations with weights is in the plan.

npfixedcomp-1.2.2: Updated to finding the proportion of 0 using the cvm loss. The crit.val in findnpnormdist1fc.cvm is the p.value thanks to [p/q]-cvm.

npfixedcomp-1.2.1: Add the non-parametric distance-based method to compute the mixing distribution under Cramer-von Mises criterion. npnormdist.cvm function is equivalent to likelihood-based npnorm. The fixed component is yet to be added.

npfixedcomp-1.2: Intend to add capability to compute cnms with weights. The bining process is used. When the bining process is used, the NPMLE is only an approximation. One should consider the trade off between the accuracy of the solution and the speed.

npfixedcomp-1.1: There is no cancellation error in addition. There is no need to use LSE functions.

07/06/2020: Fixed errors in npnormadfcweighted and change the code so that the compiler does not complain. Change according to the consistency.

28/05/2020: fixed the code for npnormsq, npnormad and npnormllweighted. The strange behaviour of the ad function is solved by extending the range. In theory we do not know the range for the support points under distance-based method.

27/05/2020: add the family for computing the mixing distribution of sample correlation coefficients when the density is approximated to be normal.

21/05/2020: Changed the stopping criterion for npnorm.ll and the algorithms for finding 0.

26/04/2020: NPMLE for weighted observations are done by discrete normal distribution. Changed the binning function so that they are consistent.

24/04/2020: Moved all the functions with weighted observation. Try to use discrete normal distribution for this.

21/04/2020: Fixed the formulation for ad functions. LLT decomposition should be investigated. weigheed version of sq and cvm functions done.

18/04/2020: move solver to header since it is easy to maintain. In this chance, solver with 0, 1,2-order is written.

17/04/2020: add the gridpointsnpnorm for computing the local minima for normal family.

14/04/2020: Finished writing weighted version of maximum likelihood methods.

03/04/2020: finish writing npnorm.ad, npnormfc.ad, findnpnorm.ad.

02/04/2020: Revert the setting on 30/03/2020: It is not viable to find all the local minima in optimisation. If there is no local minima found in one iteration, it is terminated.

01/04/2020: Change the optimisation tool in cpp files so that it does not depend on RcppNumerical. For npt, it uses the secant method and for npnorm, it uses Newton-Raphson.

30/03/2020: Across all the computing functions in Rcpp, all the stopping criterion becomes stop if the directional derivative is less than the tolerance. This makes more sense.

28/03/2020: Completed the function for finding the proportion of given location for log-likelihood methods.

27/03/2020: wrote a function for estimating the initial mixing distribution.

26/03/2020: Rename the original cvm functions by cvm2. Used the sample template for npnormdist.sq and npnormdistfc.sq to write the corresponding cvm functions. Changed the inline function for collapsing. Fixed errors in npnormdistfc.sq and npnormdistfc.cvm functions.

25/03/2020: Reshuffle codes into places. Make most functions in previous version into the header and make it inline. Finished writing npnormdist.sq and npnormdistfc.sq. The performance can be seen below.

23/03/2020: remove the dependency on Rcpp for make functions. Remove the code for assigning mixing distribution to global variable in find functions (This gives notes in R CMD check). Removed the R version of restricted NPMLE (npnormfc.ll2).

22/03/2020: add nptfc.ll function for t distribution.

18/03/2020: massive improvement to sortmix and chklossfunction.

16/03/2020: Change the version to 1.2.5. See the update notes for details.

14/03/2020: Copied lsei.f and pnnls.f from the package lsei to ./src/ for c++ implementation of pnnls with fixed sum.

05/03/2020: Making improvement to the functions that generate the initial mixing distribution if the mixing distribution is not specified via init. The new function that generates the initial mixing distribution first generate a mixing distribution according to nspmix, and then minus the mixing density given mu0 and pi0 at the mids points. The new functions makes the normal component functions faster. For the $t$ distribution, the improvement is to be determined. The initial functions for the $t$ family is misspecified anyway.

03/03/2020: Change the version to 1.2.4. See the update notes for details.

15/01/2020: Fixed the errors in the gradient functions for estimating the null proportions.

18/12/2019: Change the version to 1.2.3. See the Update Notes for details.

13/12/2019: Suppress the print for find* functions. Improvement made in distance functions.

12/12/2019: The second version of NPMLE is working but considering changing the method for finding local extrema. Should the normal dnpnorm be used or the log-safe version?

12/12/2019: Another version of npnormfc which does not depend on cnm. It is working in a sense that it gives an reasonable solution. However, this solution still has some issue. There is still gap between the min directional derivative and 0. Possible reason : values passed into pnnqp or the condition checking.

10/12/2019: A slight change to the npnormdist and npnormfc, the functions for plotting of gredient and collapse are separate from the function. Add the function for checking the updated pi0. More tests on those two functions are needed.

10/12/2019: Fixed the bug that npnormfc and nptfc constructor cannot be used in cnm directly. makenpnormfc and makenptfc still the preferred choice. The solution computed by cnm directly is not post-processed.

Use discrete normai distribution for weighted observations under distance-based methods. Binning should be according to this.

sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252   
## [3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C                        
## [5] LC_TIME=English_New Zealand.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.3  magrittr_1.5    tools_3.6.3     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.2  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.14       digest_0.6.25  
## [13] rlang_0.4.6     evaluate_0.14

For the prostate dataset, the degreen of freedom is 100, so it is pausible that we can use normal distribution in calculation.

library(npfixedcomp)
load(url("http://statweb.stanford.edu/~ckirby/brad/LSI/datasets-and-programs/data/prostz.RData"))
npnormfc.ll(prostz, pi0 = 0.9)

## $iter
## [1] 15
## 
## $family
## [1] "npnorm"
## 
## $max.gradient
## [1] 3.398953e-07
## 
## $beta
## [1] 1
## 
## $mix
## $mix$pt
## [1] -2.2733473 -0.8729892  0.0000000  1.1404941  2.7549436
## 
## $mix$pr
## [1] 0.02212661 0.02801186 0.90000000 0.03652114 0.01334038
## 
## 
## $ll
## [1] -9286.442
## 
## $dd0
## [1] -0.6736482
## 
## $convergence
## [1] 0
## 
## attr(,"class")
## [1] "nspmix"

findnpnormfc.ll(prostz)

## $iter
## [1] 7
## 
## $family
## [1] "npnorm"
## 
## $max.gradient
## [1] 1.715528e-08
## 
## $beta
## [1] 1
## 
## $mix
## $mix$pt
## [1] -2.418859  0.000000  2.601970
## 
## $mix$pr
## [1] 0.01902377 0.96298198 0.01799425
## 
## 
## $ll
## [1] -9290.353
## 
## $dd0
## [1] -13.26656
## 
## $convergence
## [1] 0
## 
## attr(,"class")
## [1] "nspmix"

Similar process can be done when the component density follows a $t$ distribution. In this example, 1000 observations with 500 from ncp = 0 and 500 from ncp = 5. The degree of freedom here is 10.

data = rt(300, df = 10, ncp = rep(c(0, 5), each = 150))
nptfc.ll(data, pi0 = 0.5, df = 10)

## $iter
## [1] 10
## 
## $family
## [1] "npt"
## 
## $max.gradient
## [1] 3.968128e-06
## 
## $beta
## [1] 10
## 
## $mix
## $mix$pt
## [1] -1.224998  0.000000  4.885200  5.915628 15.744576
## 
## $mix$pr
## [1] 0.023867874 0.500000000 0.377036935 0.096742571 0.002352619
## 
## 
## $ll
## [1] -720.9072
## 
## $dd0
## [1] -14.31792
## 
## $convergence
## [1] 0

For the distance-based method, the only distance loss available here is the Cramer-von Mises criterion and Anderson-Darling. Using the prostate data above,

npnormfc.ll(prostz)

## $iter
## [1] 23
## 
## $family
## [1] "npnorm"
## 
## $max.gradient
## [1] 7.725662e-06
## 
## $beta
## [1] 1
## 
## $mix
## $mix$pt
## [1] -2.3410759 -0.1570473  0.0000000  0.4348154  2.7338397
## 
## $mix$pr
## [1] 0.02048200 0.69017611 0.00000000 0.27475588 0.01458601
## 
## 
## $ll
## [1] -9286
## 
## $dd0
## [1] -0.03206196
## 
## $convergence
## [1] 0
## 
## attr(,"class")
## [1] "nspmix"

npnormfc.cvm(prostz)

## $iter
## [1] 4
## 
## $family
## [1] "npnormdist"
## 
## $max.gradient
## [1] -1.435994e-12
## 
## $beta
## [1] 1
## 
## $mix
## $mix$pt
## [1] -4.4239178 -0.3811211  0.0000000  0.4106377  4.3783443
## 
## $mix$pr
## [1] 0.003619654 0.516644580 0.000000000 0.474487217 0.005248548
## 
## 
## $ll
## [1] 0.05048828
## 
## $dd0
## [1] 0.001208542
## 
## $convergence
## [1] 0
## 
## attr(,"class")
## [1] "nspmix"

npnormfc.ad(prostz)

## $iter
## [1] 24
## 
## $family
## [1] "npnormdist"
## 
## $max.gradient
## [1] 1.127952e-05
## 
## $beta
## [1] 1
## 
## $mix
## $mix$pt
## [1] -3.1060327 -0.2749033  0.0000000  0.4678700  3.3436915
## 
## $mix$pr
## [1] 0.008952014 0.613460823 0.000000000 0.369767435 0.007819728
## 
## 
## $ll
## [1] -6033.339
## 
## $dd0
## [1] -0.005345347
## 
## $convergence
## [1] 0
## 
## attr(,"class")
## [1] "nspmix"

The following example shows the effciency between the NPMLE and two distance-based method. Only one sample with 1000 observations is calculated hence the the plot below might be wired in some way.

set.seed(1)
data = rnorm(1000, mean = c(0, 2))
x = qnorm(seq(1e-2, 1 - 1e-2, length = 101))
x = (x - min(x) * 1.01) / diff(range(x)) / 1.05
k1 = matrix(unlist(lapply(x, function(ddd){
  result = npnormfc.ll(data, pi0 = ddd)
  result2 = npnormfc.ad(data, pi0 = ddd)
  result3 = npnormfc.cvm(data, pi0 = ddd)
  c(result$ll,
    -result2$ll,
    result3$ll)
})), nrow = 3)

plot(x, exp(k1[1, ] - k1[1, 1]), type = "l", ylab = "Likelihood/p.value")
lines(x, npfixedcomp:::pad(k1[2, ] - length(data), lower.tail = FALSE), col = "red")
lines(x, npfixedcomp:::pcvm(k1[3, ] + 1 / 12 / length(data), lower = FALSE), col = "blue")
abline(v = 0.5)

The figure above shows the likelihood function for NPMLE (black) and the p.values for the distance-based methods (red for Anderson-Darling and blue for Cramer-Von Mises) plotted against the fixed probability at 0. The vertical line shows the true proportion of 0.

For version 1.2.5 and later (possibly), We can test the computational speed among implemention of restricted NPMLE on \texttt{nspmix} pacakge, direction implementation using the algorithm in R and Rcpp.

mu0 = 0; pi0 = 0.4
set.seed(1)
data = rnorm(500, c(0, 4))
set.seed(2)
rdata = rt(500, ncp = c(0, 5), df = 13)

# comparison for finding restricted NPMLE (normal)
# nspmix reference : 109.89 (mean) and 108.77 (median) on macOS.
microbenchmark::microbenchmark(npnormfc.ll(data, mu0 = mu0, pi0 = pi0)) # Rcpp

## Unit: milliseconds
##                                     expr     min      lq    mean  median
##  npnormfc.ll(data, mu0 = mu0, pi0 = pi0) 64.2774 64.5123 64.8411 64.6829
##        uq     max neval
##  65.02435 66.9411   100

# comparison for finding pi0 (normal)
# nspmix reference : 1.681s (no binning)
datalarge = rnorm(1e5, c(0, 2))
system.time(r1 <- findnpnormfc.ll(datalarge))

##    user  system elapsed 
##  191.65   24.25  216.08

# with order = -2 binning.
system.time(r2 <- findnpnormfc.ll(datalarge, order = -2))

##    user  system elapsed 
##    1.77    0.00    1.77

r1$mix$pr[r1$mix$pt == 0];r2$mix$pr[r2$mix$pt == 0]

## [1] 0.5073351

## [1] 0.507552

system.time(r1 <- findnpnormfc.cvm(datalarge))

##    user  system elapsed 
##  160.83    5.94  166.86

# with order = -2 binning.
system.time(r2 <- findnpnormfc.cvm(datalarge, order = -2))

##    user  system elapsed 
##    1.43    0.00    1.42

r1$mix$pr[r1$mix$pt == 0];r2$mix$pr[r2$mix$pt == 0]

## [1] 0.506271

## [1] 0.6016159

system.time(r1 <- findnpnormfc.ad(datalarge))

##    user  system elapsed 
##  397.88   19.16  417.37

# with order = -2 binning.
system.time(r2 <-findnpnormfc.ad(datalarge, order = -2))

##    user  system elapsed 
##    1.60    0.00    1.61

r1$mix$pr[r1$mix$pt == 0];r2$mix$pr[r2$mix$pt == 0]

## [1] 0.506119

## [1] 0.5063576

# comparsion for restricted NPMLE (t distribution)
# nspmix reference 7.649s (no binning)
system.time(nptfc.ll(rdata, mu0 = mu0, pi0 = pi0, df = 13)) # Rcpp

##    user  system elapsed 
##    5.64    0.00    5.64

# comparisons for distance NPMDE (normal)
# R-implementation reference mean 26.21 (cvm) / 33.44 (cvmfc) / 134.53 (ad) / 123.83 (adfc) millisec.
# median 25.06 (cvm) / 32.23 (cvmfc) / 133.32 (ad) / 120.33 (adfc) millisec.
microbenchmark::microbenchmark(npnorm.sq(data),
                               npnorm.cvm(data),
                               npnorm.ad(data),
                               npnormfc.sq(data, pi0 = pi0),
                               npnormfc.cvm(data, pi0 = pi0),
                               npnormfc.ad(data, pi0 = pi0)) # Original version

## Unit: milliseconds
##                           expr      min        lq      mean    median        uq
##                npnorm.sq(data)  48.3884  48.48910  48.91688  48.61680  48.78815
##               npnorm.cvm(data)  54.0211  54.14035  54.63181  54.31110  54.54665
##                npnorm.ad(data) 152.7562 153.01590 154.49812 153.52695 154.15560
##   npnormfc.sq(data, pi0 = pi0)  36.1619  36.26105  36.61753  36.33545  36.55695
##  npnormfc.cvm(data, pi0 = pi0)  27.6373  27.70890  28.06326  27.78220  27.94250
##   npnormfc.ad(data, pi0 = pi0) 111.9438 112.31260 113.87270 112.68595 113.68330
##       max neval
##   54.8831   100
##   60.3656   100
##  167.7344   100
##   42.6405   100
##   32.3776   100
##  128.4893   100

microbenchmark::microbenchmark(npnorm.cvm(data, order = -2),
                               npnorm.ad(data, order = -2),
                               npnormfc.ll(data, order = -2),
                               npnormfc.cvm(data, pi0 = pi0, order = -2)) # Original version

## Unit: milliseconds
##                                       expr     min       lq     mean   median
##               npnorm.cvm(data, order = -2) 71.8238 72.07355 73.93795 72.82155
##                npnorm.ad(data, order = -2) 72.5153 72.71870 74.38591 73.22060
##              npnormfc.ll(data, order = -2) 39.8674 39.99315 40.89892 40.23375
##  npnormfc.cvm(data, pi0 = pi0, order = -2) 23.1226 23.22900 23.81399 23.35400
##       uq     max neval
##  74.8706 85.6138   100
##  74.9092 85.1267   100
##  40.9892 48.2332   100
##  23.7634 28.8231   100

xiangjiexue/npfixedcomp documentation built on June 15, 2020, 9:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xiangjiexue/npfixedcomp
Non-parametric methods for estimating mixing distribution

README.md
In xiangjiexue/npfixedcomp: Non-parametric methods for estimating mixing distribution

Introduction

Update Notes

Daily Changes

TO DO list

Session Info

Density Estimation

Comparison between different methods in estimating proportion of zero

Computation performance

R Package Documentation

Browse R Packages

We want your feedback!

xiangjiexue/npfixedcomp Non-parametric methods for estimating mixing distribution

README.md In xiangjiexue/npfixedcomp: Non-parametric methods for estimating mixing distribution

Introduction

Update Notes

Daily Changes

TO DO list

Session Info

Density Estimation

Comparison between different methods in estimating proportion of zero

Computation performance

R Package Documentation

Browse R Packages

We want your feedback!

xiangjiexue/npfixedcomp
Non-parametric methods for estimating mixing distribution

README.md
In xiangjiexue/npfixedcomp: Non-parametric methods for estimating mixing distribution