R/cramer.R

Defines functions phiFracB phiFracA phiLog phiBahr phiCramer print.cramertest .cramer.kritwertfft .cramer.characteristicfunction .cramer.statistic cramer.test

Documented in cramer.test

#require(boot)
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#' Cramer-Test for uni- and multivariate two-sample-problem 
#' 
#' @author Carsten Franz <carsten.franz@gmail.com>
#' @useDynLib cramer
#' @importFrom Rcpp evalCpp
#' @import boot
#' @importFrom stats fft
#' @aliases cramer-package
"_PACKAGE"

#' Perform Cramer-Test for uni- and multivariate two-sample-problem 
#'
#' @description
#' Perform Cramér-test for two-sample-problem. 
#'
#' Both univariate and multivariate data is possible. For calculation of the critical value
#'    Monte-Carlo-bootstrap-methods and eigenvalue-methods are available. For the bootstrap
#'    access ordinary and permutation methods can be chosen as well as the number
#'    of bootstrap-replicates taken.
#' @details
#'    The Cramér-statistic is given by
#'       \deqn{T_{m,n} = \frac{mn}{m+n}\biggl(\frac{2}{mn}\sum_{i,j}^{m,n}\phi(\|\vec{X}_i-\vec{Y}_j\|^2)-\frac{1}{m^2}\sum_{i,j=1}^m\phi(\|\vec{X}_{i}-\vec{X}_{j}\|^2)}{T=mn/(m+n) ( 2/(mn) Sum[i=1..m,j=1..n] phi(||X_i-Y_j||^2) - 1/(m^2) Sum[i=1..m,j=1..m] phi(||X_i-X_j||^2) - 1/(n^2) Sum[i=1..n,j=1..n] phi(||Y_i-Y_j||^2) )}
#'       \deqn{-\frac{1}{n^2}\sum_{i,j=1}^n\phi(\|\vec{Y}_{i}-\vec{Y}_{j}\|^2)\biggr),}{\code{}}
#'    The function \eqn{\phi}{phi} is the kernel function mentioned in the Parameters section.
#'    The proof that the Monte-Carlo-Bootstrap and eigenvalue methods work is given in the reference given below. Other build-in kernel functions are
#'    \deqn{\phi_{Cramer}(z)=\sqrt{z}/2}{phiCramer(z)=z^(1/2)/2} (recommended for location alternatives),
#'    \deqn{\phi_{Bahr}(z)=1-\exp(-z/2)}{phiBahr(z)=1-exp(-z/2)} (recommended for dispersion as well as location alternatives),
#'    \deqn{\phi_{log}(z)=\log(1+z)}{phiLog(z)=log(1+z)} (preferrably for location alternatives),
#'    \deqn{\phi_{FracA}(z)=1-\frac{1}{1+z}}{phiFracA(z)=1-1/(1+z)} (preferrably for dispersion alternatives) and
#'    \deqn{\phi_{FracB}(z)=1-\frac{1}{(1+z)^2}.}{phiFracA(B)=1-1/(1+z)^2.} (also for dispersion alternatives). 
#'    Test performance was investigated in the below referenced 2010 publication. The idea of using this statistic is 
#'    due to L. Baringhaus, University of Hanover.
#' @param x First set of observations. Either in vector form (univariate) or in a matrix with 
#'          one observation per row (multivariate).
#' @param y Second set of observations. Same dimension as \code{x}.
#' @param conf.level Confidence level of test. The default is \code{conf.level=0.95}.
#' @param sim Type of Monte-Carlo-bootstrap method or eigenvalue method. Possible values are
#'          \code{"ordinary"} (default) for normal Monte-Carlo-bootstrap, \code{"permutation"}
#'          for a permutation Monte-Carlo-bootstrap or \code{"eigenvalue"} for bootstrapping
#'          the limit distribution, evaluating the (approximate) eigenvalues being the weights
#'          of the limiting chisquared-distribution and using the critical value of this
#'          approximation (calculated via fast fourier transform). This method is especially good 
#'          if the dataset is too large to perform Monte-Carlo-bootstrapping (although it must not
#'          be too large so the matrix eigenvalue problem can still be solved).
#' @param replicates Number of bootstrap-replicates taken to obtain critical value. The default 
#'          is \code{replicates=1000}. When using the eigenvalue method, this variable is unused.
#' @param maxM Gives the maximum number of points used for the fast fourier transform. When using
#'          Monte-Carlo-bootstrap methods, this variable is unused.
#' @param K Gives the upper value up to which the integral for the calculation of the
#'          distribution function out of the characteristic function (Gurlands formula) is
#'          evaluated. The default ist 160. Careful: When increasing \code{K} it is necessary to 
#'          increase \code{maxM} as well since the resolution of the points where the distribution 
#'          function is calculated is \deqn{\frac{2\pi}{K}.}{2 PI/K.}
#'          Thus, if just \code{K} is increased the maximum value, where the distribution function
#'          is calculated is lower. 
#'          When using Monte-Carlo-bootstrap methods, this variable is unused.
#' @param just.statistic Boolean variable. If \code{TRUE} just the value of the Cram\eqn{\mbox{\'e}}{e}r-statistic
#'          is calculated and no bootstrap-replicates are produced.
#' @param kernel Character-string giving the name of the kernel function. The default is \code{"phiCramer"} which
#'          is the Cram\eqn{\mbox{\'e}}{e}r-test included 
#'          in earlier versions of this package and which is used in the paper of Baringhaus and the author mentioned 
#'          below. It is possible to use user-defined kernel functions here. The functions needs to be able to deal with
#'          matrix arguments. Kernel functions need to be defined on 
#'          the positive real line with value 0 at 0 and have a non-constant completely monotone first 
#'          derivative. An example is show in the Examples section below. Build-in functions are \code{"phiCramer"},
#'          \code{"phiBahr"}, \code{"phiLog"}, \code{"phiFracA"} and \code{"phiFracB"}.
#' @returns The returned value is an object of class \code{"cramertest"}, containing the following components:
#'          \item{method}{Describing the test in words.}
#'          \item{d}{Dimension of the observations.}
#'          \item{m}{Number of \code{x} observations.}
#'          \item{n}{Number of \code{y} observations.} 
#'          \item{statistic}{Value of the Cram\eqn{\mbox{\'e}}{e}r-statistic for the given observations.}
#'          \item{conf.level}{Confidence level for the test.}
#'          \item{crit.value}{Critical value calculated by bootstrap method, eigenvalue method, respectively. When using 
#'                            the eigenvalue method, the distribution under the hypothesis will be interpolated linearly.}
#'          \item{p.value}{Estimated p-value of the test.}
#'          \item{result}{Contains \code{1} if the hypothesis of equal distributions should not be accepted and \code{0} otherwise.}
#'          \item{sim}{Method used for obtaining the critical value.}
#'          \item{replicates}{Number of bootstrap-replicates taken.}
#'          \item{ev}{Contains eigenvalues and eigenfunctions when using the eigenvalue-method to obtain the critical value}
#'          \item{hypdist}{Contains the via fft reconstructed distribution function under the hypothesis. \code{$x}
#'                         contains the x-values and \code{$Fx} the values of the distribution function at the positions.}
#' @section References:
#'    The test and its properties is described in:
#'    \itemize{
#'      \item{Baringhaus, L. and Franz, C. (2004) \emph{On a new multivariate two-sample test}, Journal of Multivariate Analysis, 88, p. 190-206}
#'      \item{Baringhaus, L. and Franz, C. (2010) \emph{Rigid motion invariant two-sample tests}, Statistica Sinica 20, 1333-1361}
#'    }
#'    The test of Bahr is also discussed in:
#'    \itemize{  
#'      \item{Bahr, R. (1996) \emph{Ein neuer Test fuer das mehrdimensionale Zwei-Stichproben-Problem bei allgemeiner Alternative}, German, Ph.D. thesis, University of Hanover}
#'    }  
#' @examples
#' # comparison of two univariate normal distributions
#' x<-rnorm(20,mean=0,sd=1)
#' y<-rnorm(50,mean=0.5,sd=1)
#' cramer.test(x,y)
#' 
#' # comparison of two multivariate normal distributions with permutation test:
#' # library "MASS" for multivariate routines (included in package "VR")
#'  
#' # library(MASS)
#' # x<-mvrnorm(n=20,mu=c(0,0),Sigma=diag(c(1,1)))
#' # y<-mvrnorm(n=50,mu=c(0.3,0),Sigma=diag(c(1,1)))
#' # cramer.test(x,y,sim="permutation")
#' 
#' # comparison of two univariate normal distributions with Bahrs Kernel
#' phiBahr<-function(x) return(1-exp(-x/2))
#' x<-rnorm(20,mean=0,sd=1)
#' y<-rnorm(50,mean=0,sd=2)
#' cramer.test(x,y,sim="eigenvalue",kernel="phiBahr")
#'
#' @export
cramer.test<-function(x,y,conf.level=0.95,replicates=1000,sim="ordinary",just.statistic=FALSE,kernel="phiCramer",maxM=2^14,K=160) {    
    RVAL <- list(method = paste("nonparametric Cramer-Test with kernel",kernel,"\n(on equality of two distributions)"),
                 d = 0,
                 m = 0,
                 n = 0,
                 statistic = 0,
                 conf.level = conf.level,
                 crit.value = 0,
                 p.value = 0,                 
                 result = 0,
                 sim = sim,
                 replicates = replicates,
		 hypdist = 0,
		 ev = 0) 
    if ((is.vector(x))&&(is.vector(y))) RVAL$d<-1
    if ((is.matrix(x))&&(is.matrix(y))) if (ncol(x)==ncol(y)) RVAL$d<-ncol(x)
    if (RVAL$d==0) stop("types of x and y incompatible or inappropriate.")    
    if (RVAL$d==1) {
        RVAL$m<-length(x)
        RVAL$n<-length(y)
        daten<-matrix(c(x,y),ncol=1,byrow=TRUE)
    } else {
        RVAL$m<-nrow(x)
        RVAL$n<-nrow(y)
        daten<-matrix(c(t(x),t(y)),ncol=ncol(x),byrow=TRUE)
    }
    lookup<-matrix(rep(0,(RVAL$m+RVAL$n)^2),ncol=(RVAL$m+RVAL$n))
#    for (i in 2:(RVAL$m+RVAL$n))
#         for (j in 1:(i-1)) {
#             lookup[i,j]<-sum((daten[i,]-daten[j,])^2)
#             lookup[j,i]<-lookup[i,j]
#         }
    lookup<-calculateLookupMatrix(RVAL$m+RVAL$n,daten,lookup) # proposed by Markus Schaffer to speed up, thanks!
    lookup<-eval(call(kernel,lookup))
#print(date()) #!!!
    if (just.statistic) {
        RVAL$statistic<-.cramer.statistic(daten,1:(RVAL$m+RVAL$n),RVAL$m,RVAL$n,lookup)
    } else if (sim!="eigenvalue") {
        b<-boot(data=daten,statistic=.cramer.statistic,mm=RVAL$m,nn=RVAL$n,lookup=lookup,sim=RVAL$sim,stype="i",R=RVAL$replicates)
        RVAL$statistic<-b$t0
        RVAL$p.value<-1-rank(c(b$t0,b$t))[1]/(replicates+1)
        RVAL$crit.value<-sort(b$t)[round(RVAL$conf.level*RVAL$replicates)]
        if (RVAL$statistic>RVAL$crit.value) RVAL$result<-1
	RVAL$hypdist.x<-sort(b$t)
	RVAL$hypdist.Fx<-(1:RVAL$replicates)/RVAL$replicates
    } else {
	RVAL$statistic<-.cramer.statistic(daten,1:(RVAL$m+RVAL$n),RVAL$m,RVAL$n,lookup)
	N<-RVAL$m+RVAL$n
	# alternative Berechnung von B (DEUTLICH schneller)
	C1<-rep(0,N)
	for (i in 1:N)     
	    for (j in 1:N) C1[i]<-C1[i]+lookup[i,j] 
	C1<-C1/N
	C2<-0
	for (i in 1:N)     
	    for (j in 1:N) C2<-C2+lookup[i,j] 
	C2<-C2/N^2
	
	B<-matrix(rep(0,N^2),ncol=N)
	for (i in 1:N) 
	    for (j in 1:N) B[i,j]<-C1[i]+C1[j]-C2-lookup[i,j] 
	B<-B/N

	RVAL$ev<-eigen(B,FALSE) # <--only.values   TRUE)
	kritwert<-.cramer.kritwertfft(Re((RVAL$ev)$values),conf.level,maxM,K)

	RVAL$p.value<-1-(kritwert$hypdist.Fx)[which.min(abs((kritwert$hypdist.x)[1:(3*length(kritwert$hypdist.x)/4)]-RVAL$statistic))];
	RVAL$crit.value<-kritwert$quantile
	RVAL$hypdist.x<-kritwert$hypdist.x
	RVAL$hypdist.Fx<-kritwert$hypdist.Fx
	if (RVAL$statistic>RVAL$crit.value) RVAL$result<-1
    }
    class(RVAL) <- "cramertest"
#print(date()) #!!!
    return(RVAL)
}

#' @keywords internal
# .cramer.statistic calculates the value of the test statistic T
# it is build so it can be called by "boot"
#   daten   is the original data set (is just given since boot wants it like that))
#   indexe  is the bootstrapped vector of indices (first m are observations of X, the next n of Y)
#   m       is the number of observations of X
#   n       is the number of observations of Y
#   lookup  is a matrix containing the distances of points i und j
.cramer.statistic<-function(daten,indexe,mm,nn,lookup) {
    xind<-indexe[1:mm]
    yind<-indexe[(mm+1):(mm+nn)]
    mm*nn/(mm+nn)*(2*sum(lookup[xind,yind])/(mm*nn)-sum(lookup[xind,xind])/(mm^2)-sum(lookup[yind,yind])/(nn^2))
}

#' @keywords internal
.cramer.characteristicfunction<-function(lambdasquare,t) { z<--0.5*log(1-2i*lambdasquare%*%t(t));return(exp(complex(length(t),rowsum(Re(z),rep(1,length(lambdasquare))),rowsum(Im(z),rep(1,length(lambdasquare)))))) }

#' @keywords internal
.cramer.kritwertfft<-function(lambdasquare,conf.level,maxM,K) {
	M<-2^11
	while (150*pi*M/K^2<(2*sum(lambdasquare)+lambdasquare[1])) M<-M*2
	M<-min(c(M,maxM))
	goodlimit<-150*pi*M/K^2
	a<-0                    
	t<-0:(M-1)*K/M                    
	x<-0:(M-1)*2*pi/K
	#Verteilungsfunktion
	t[1]<-1                    
	h<-.cramer.characteristicfunction(lambdasquare,t)/t*exp(-a*1i*t);
	h[1]<-complex(1,0,1)*sum(lambdasquare)
	Fx<-1/2-Im(K/(M*pi)*fft(h,inverse=FALSE))+K/(2*M*pi)*(sum(lambdasquare)+x+a)
	#quantile
        xindex<-which.min(abs(Fx[1:(M/2)]-conf.level))       
        #linear interpolation between two evaluated points of the distribution function (much better than just taking x[xindex]
        if (Fx[xindex]>conf.level) xindex<-xindex-1
        if (xindex<1) xindex<-1
        quantile<-x[xindex]+(conf.level-Fx[xindex])*(x[xindex+1]-x[xindex])/(Fx[xindex+1]-Fx[xindex])
	#warnings?
	if (Fx[M/2]<conf.level) warning("Quantile calculation discrepance. Try to increase K!")
	if (quantile>goodlimit) warning("Quantile beyond good approximation limit. Try to increase maxM or decrease K!")
	kritwert <- list(quantile=quantile,hypdist.x=x,hypdist.Fx=Fx)
	return(kritwert)
}	

#' @export
print.cramertest<-function(x,...) {
    cat("\n",x$d,"-dimensional ",x$method,"\n\n")
    cat("\tx-sample: ",x$m," values        ")
    cat("y-sample: ",x$n," values\n\n")
    if (x$crit.value>0) {
        cat("critical value for confidence level ",format(100 * x$conf.level),"% : ",x$crit.value,"\n")
        cat("observed statistic ",x$statistic,", so that\n\t hypothesis (\"x is distributed as y\") is ")
        cat(ifelse(x$result==0," ACCEPTED"," REJECTED"),".\n")
        cat("estimated p-value = ",x$p.value,"\n\n")
	if (x$sim!="eigenvalue") {
          cat("\t[result based on ",x$replicates," ",x$sim," bootstrap-replicates]\n\n");
	} else {
          cat("\t[result based on eigenvalue decomposition and inverse fft]\n\n");
	}
    } else {
        cat("observed statistic ",x$statistic,"\n\n")
    }
    invisible(x)
}


phiCramer<-function(x) return(sqrt(x)/2)
phiBahr<-function(x) return(1-exp(-x/2))
phiLog<-function(x) return(log(1+x))
phiFracA<-function(x) return(1-1/(1+x))
phiFracB<-function(x) return(1-1/(1+x)^2)

Try the cramer package in your browser

Any scripts or data that you put into this service are public.

cramer documentation built on May 29, 2024, 9:31 a.m.