#' Bayesian Distribution-Free Correlation and Concordance
#'
#' Given bivariate data, computes the sample number of concordant changes \code{nc}
#' between the two variates and the number of discordant changes \code{nd}.
#' Provides the frequentist \code{tau_A} correlation coefficient
#' \code{(nc-nd)/(nc+nd)}, and provides a Bayesian analysis of the population
#' concordance parameter \code{phi}: the limit of the proportion of concordance
#' changes between the variates.
#' For goodness-of-fit applications, provides a concordance measure that
#' corrects for the number of fitting parameters.
#'
#'
#' @param x Vector of x variable values
#' @param y Vector of y variable values
#' @param a0 First shape parameter for the prior beta distribution (default is 1)
#' @param b0 Second shape parameter for the prior beta distribution (default is 1)
#' @param prob_interval Desired width for interval estimates (default is .95)
#' @param fitting.parameters (Optional) If either x or y values are generated by a predictive model, the number of free parameters in the model (default is NULL)
#'
#' @return A list containing the following components:
#' @return \item{tau}{Nonparametric Tau-A correlation}
#' @return \item{sample_p}{Sample concordance proportion}
#' @return \item{nc}{Number of concordant comparisons}
#' @return \item{nd}{Number of discordant comparisons}
#' @return \item{a_post}{The first shape parameter for the posterior beta distribution for the concordance proportion}
#' @return \item{b_post}{The second shape parameter for the posterior beta distribution for the concordance proportion}
#' @return \item{a0}{The first shape parameter for the prior beta distribution for the concordance proportion}
#' @return \item{b0}{The second shape parameter for the prior beta distribution for the concordance proportion}
#' @return \item{prob_interval}{The probability within the interval estimates for the phi parameter}
#' @return \item{post_median}{Median of posterior distribution on phi}
#' @return \item{eti_lower}{Lower limit of the equal-tail interval with width specified by prob_interval}
#' @return \item{eti_upper}{Upper limit of the equal-tail interval with width specified by prob_interval}
#' @return \item{tau_star}{Corrected tau_A to account for the number of free fitting parameter in goodness-of-fit applications}
#' @return \item{nc_star}{The corrected number of concordant comparisons for a goodness-of-fit application when there is an integer value for \code{fitting.parameters}}
#' @return \item{nd_star}{The number of discordant comparison when there is an integer value for \code{fitting.parameters}}
#' @return \item{sample_p_star}{Correct proportion of concordant comparisons to account for free-fitting parameter for goodness-of-fit applications}
#' @return \item{a_post_star}{Corrected value for the first shape parameter for the posterior for the concordance proportion when there are free fitting parameter for goodness-of-fit applications}
#' @return \item{b_post_star}{The second shape parameter for the posterior distribution for the concordance proportion when there is a goodness-of-fit application}
#' @return \item{post_median_star}{The posterior median for the concordance proportion when there is a goodness-of-fit application}
#' @return \item{eti_lower_star}{Lower limit for the interval estimate when there is a goodness-of-fit application}
#' @return \item{eti_upper_star}{Upper limt for the interval estimate when there is a goodness-of-fit application}
#'
#' @details
#'
#' The product-moment correlation depends on Gaussian assumptions about the
#' residuals in a regression analysis. It is not robust because it is strongly
#' influenced by any extreme outlier scores for either of the two variates. A
#' rank-based analysis can avoid both of these limitations. The \code{dfba_bivariate_concordance()}
#' function is focused on a nonparametric concordance metric for characterizing
#' the association between the two bivariate measures.
#'
#' To illustrate the nonparametric concepts of concordance and discordance,
#' consider a specific example where there are five paired scores with
#' \deqn{x = {3.8, 4.7, 4.7, 4.7, 11.8}} and \deqn{y = [5.9, -4.1, 7.3, 7.3, 38.9].}
#' The ranks for the \eqn{x} variate are \eqn{1, 3, 3, 3, 5} and the corresponding
#' ranks for \eqn{y} are \eqn{2, 1, 3.5, 3.5, 5}, so the five points in terms of
#' their ranks are \eqn{P_1 = (1, 2)}, \eqn{P_2 = (3, 1)}, \eqn{P_3 = (3, 3.5)},
#' \eqn{P_4 = (3, 3.5)} and \eqn{P_5 = (5,5)}. The relationship between any two
#' of these points \emph{Pi} and \emph{Pj}, is either (1) concordant if the
#' sign of \eqn{R_{xi} - R_{xj}} is the same as the sign of
#' \eqn{R_{yi} - R_{yj}}, (2) discordant if signs are
#' different between \eqn{R_{xi}-R_{xj}} and \eqn{R_{yi}-R_{yj}}, or (3) null if
#' either \eqn{R_{xi}=R_{xj}} or if \eqn{R_{yi}=R_{yj}}. For the above example,
#' there are ten possible comparisons among the five points; six are concordant,
#' one is discordant, and there are three comparisons lost due to ties. In
#' general, given \eqn{n} bivariate scores there are \eqn{n(n-1)/2} total
#' possible comparisons. When there are ties in the \eqn{x} variate, there is
#' a loss of \eqn{T_x} comparisons, when there are ties in the \eqn{y} variate,
#' there are \eqn{T_y} lost comparisons. Ties in both \eqn{x} and \eqn{y} are denoted
#' \eqn{T_{xy}}. The total number of possible comparisons,
#' accounting for ties, is therefore: \deqn{n(n-1)/2-T_x-T_y+T_{xy},} where \eqn{T_{xy}}
#' is added to avoid double-counting of lost comparisons.
#'
#' In the above example, there are three lost comparisons due to ties in \eqn{x},
#' one lost comparison due to a tie in \eqn{y}, and one comparison lost to a tie
#' in both the \eqn{x} and \eqn{y} variates. Thus, there are \eqn{[(5*4)/2]-3-1+1=7}
#' comparisons for the above example. The \eqn{\tau_A} correlation is defined as
#' \eqn{(n_c-n_d)/(n_c+n_d)}, which is a value on the \eqn{[-1,1]} interval. However,
#' it is important to note the original developer of the frequentist \eqn{\tau}
#' correlation used a different coefficient that has come to be called
#' \eqn{\tau_B}, which is given as
#' \eqn{(n_c-n_d)/([(n*(n-1)/2)-Tx][(n*(n-1)/2)-Ty])^{.5}}. However, \eqn{\tau_B}
#' does not properly correct for tied scores, which is unfortunate
#' because \eqn{\tau_B} is the value returned by the \code{stats} function
#' \code{cor(..., method = "kendall")}. If there are no ties, then
#' \eqn{T_x = T_y = T_{xy} = 0} and \eqn{\tau_A = \tau_B}. But if there are ties,
#' then the proper coefficient is given by \eqn{\tau_A}. The \code{dfba_bivariate_concordance()}
#' function provides the proper correction for tied scores and outputs a sample
#' estimate for \eqn{\tau_A}.
#'
#' The focus for the Bayesian analysis is on the population proportion
#' of concordance, which is the limit of the ratio \eqn{n_c/(n_c+n_d)}. This
#' proportion is a value on the \eqn{[0,1]} interval, and it is called \eqn{\phi} (Phi).
#' \eqn{\phi} is also connected to the population \eqn{\tau_A} because
#' \eqn{\tau_A=(2\phi -1)}. Moreover, Chechile (2020) showed that the
#' likelihood function for observing \eqn{n_c} concordant changes and \eqn{n_d}
#' discordant changes is a censored Bernoulli process, so the likelihood is
#' proportional to \eqn{(\phi^{n_c})((1-\phi)^{n_d})}. In Bayesian statistics, the
#' likelihood function is only specified as a proportional function because,
#' unlike in frequentist statistics, the likelihood of unobserved and more
#' extreme events are not computed. This idea is the \emph{likelihood principle},
#' and its violation can lead to paradoxes (Lindley & Phillips, 1976). Also, the
#' likelihood need only be a proportional function because the proportionality
#' constant appears in both the numerator and denominator of Bayes theorem, so
#' it cancels out. If the prior for \eqn{\phi} is a beta distribution, then it
#' follows that the posterior is also a beta distribution (\emph{i.e.}, the beta
#' is a natural Bayesian conjugate function for Bernoulli processes). The
#' default prior for the \code{dfba_bivariate_concordance()} function is the flat prior (\emph{i.e.},
#' \code{a0 = 1} and \code{b0 = 1}).
#'
#' In the special case where the user has a model for predicting a variate in
#' terms of known quantities and where there are free-fitting parameters, the
#' \code{dfba_bivariate_concordance()} function's concordance parameter is a goodness-of-fit measure
#' for the scientific model. Thus, the bivariate pair are the observed value of
#' a variate along with the corresponding predicted score from the model. The
#' concordance proportion must be adjusted in these goodness-of-fit applications
#' to take into account the number of free parameters that were used
#' in the prediction model. Chechile and Barch (2021) argued that the fitting
#' parameters increases the number of concordant changes. Consequently, the
#' value for \eqn{n_c} is downward-adjusted as a function of the number of free
#' parameters. The Chechile-Barch adjusted \eqn{n_c} value for a case where there
#' are \eqn{m} free fitting parameters is \eqn{n_c-(n*m)+[m*(m+1)/2]}. As an example,
#' suppose that there are \eqn{n = 20} scores, and the prediction equation has
#' \eqn{m = 2} free parameters that result in creating a prediction for each
#' observed score (\emph{i.e.}, there are 20 paired values of observed score \code{x}
#' and predicted score \code{y}), and further suppose that this model results in
#' \eqn{n_c = 170} and \eqn{n_d = 20}. The value of \code{n_d} is kept at 20, but
#' the number of concordant changes is reduced to \eqn{170-(20*2)+(2*3/2) = 133.}
#'
#' @references
#' Chechile, R.A. (2020). Bayesian Statistics for Experimental Scientists: A
#' General Introduction Using Distribution_Free Statistics. Cambridge: MIT Press.
#'
#' Chechile, R.A., & Barch, D.H. (2021). A distribution-free, Bayesian
#' goodness-of-fit method for assessing similar scientific prediction equations.
#' Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2021.102638
#'
#' Lindley, D. V., & Phillips, L. D. (1976). Inference for a Bernoulli process
#' (a Bayesian view). The American Statistician, 30, 112-119.
#'
#' @importFrom stats qbeta
#' @importFrom stats dbeta
#' @importFrom stats complete.cases
#'
#' @examples
#'
#'
#' x <- c(47, 39, 47, 42, 44, 46, 39, 37, 29, 42, 54, 33, 44, 31, 28, 49, 32, 37, 46, 55, 31)
#' y <- c(36, 40, 49, 45, 30, 38, 39, 44, 27, 48, 49, 51, 27, 36, 30, 44, 42, 41, 35, 49, 33)
#'
#' dfba_bivariate_concordance(x, y)
#'
#' ## A goodness-of-fit example for a hypothetical case of fitting data in a
#' ## yobs vector with prediction model
#'
#' p = seq(.05,.95,.05)
#' ypred= 17.332 - (50.261*p) + (48.308*p^2)
#'
#' # Note the coefficients in the ypred equation were found first via a
#' # polynomial regression
#'
#' yobs<-c(19.805, 10.105, 9.396, 8.219, 6.110, 4.543, 5.864, 4.861, 6.136,
#' 5.789, 5.443, 5.548, 4.746, 6.484, 6.185, 6.202, 9.804, 9.332,
#' 14.408)
#'
#' dfba_bivariate_concordance(x = yobs,
#' y = ypred,
#' fitting.parameters = 3)
#'
#' @export
dfba_bivariate_concordance<-function(x,
y,
a0=1,
b0=1,
prob_interval=0.95,
fitting.parameters=NULL){
xy_input<-data.frame(x,y) #append x and y vectors
xy<-xy_input[complete.cases(xy_input),] #keep only rows with valid x and y values
removed_rows<-nrow(xy_input)-nrow(xy) #count removed rows
if (removed_rows>0){ #warning message if rows are removed
warning(paste0(removed_rows, " row(s) removed due to missing data"))
}
if(a0 <= 0|
a0 == Inf|
is.na(a0)|
b0 <= 0|
b0 == Inf|
is.na(b0)){
stop("Both the a0 and b0 shape parameters must be positive and finite")
}
#Replace x and y vectors with complete-case-restricted data
x1 <- xy$x
y1 <- xy$y
t_xi <- unname(table(x1)[table(x1)>1]) #Counting T_x sizes of ties
t_yi <- unname(table(y1)[table(y1)>1]) #Counting T_y sizes of ties
Tx<-sum((t_xi*(t_xi-1))/2) #Calculating Tx
Ty<-sum((t_yi*(t_yi-1))/2) #Calculating Ty
t_xyi<-unname(table(xy)[table(xy)>1]) #Calculating txyi
Txy<-sum(t_xyi*(t_xyi-1)/2) #Calculating Txy
n<-length(x1)
n_max<-n*(n-1)/2-Tx-Ty+Txy
xy_ranks<-data.frame(xrank=rank(x1, ties.method="average"),
yrank=rank(y1, ties.method="average"))
xy_c<-xy_ranks[order(x1, -y1),] # for n_c, sort on ascending x then descending y
xy$concordant<-rep(NA, nrow(xy))
for (i in seq_len(nrow(xy-1))){
xy$concordant[i]<-sum(xy_c$yrank[(i+1):length(xy_c$yrank)]>xy_c$yrank[i])
}
nc <- sum(xy$concordant, na.rm=TRUE)
nd <- n_max-nc
Tau <- (nc-nd)/n_max
pc <- (Tau+1)/2
a_post <- a0+nc
b_post <- b0+nd
post_median<-qbeta(0.5,
a_post,
b_post)
eti_lower<-qbeta((1-prob_interval)/2,
a_post,
b_post)
eti_upper<-qbeta(1-(1-prob_interval)/2,
a_post,
b_post)
dfba_bivariate_concordance_list<-list(tau = Tau,
nc = nc,
nd = nd,
sample_p = pc,
a0 = a0,
b0 = b0,
a_post = a_post,
b_post = b_post,
post_median = post_median,
prob_interval = prob_interval,
eti_lower = eti_lower,
eti_upper = eti_upper)
if (is.null(fitting.parameters) == FALSE){ #Calculate phi_star if fitting.parameters are present
Lc<-n*fitting.parameters-(fitting.parameters*(fitting.parameters+1)/2)
nc_star<-nc-Lc
Tau_star<-(nc_star-nd)/(nc_star+nd)
pc_star<-(Tau_star+1)/2
a_post_star<-a0+nc_star
post_median_star<-qbeta(0.5, a_post_star, b_post)
eti_lower_star<-qbeta((1-prob_interval)/2, a_post_star, b_post)
eti_upper_star<-qbeta(1-(1-prob_interval)/2, a_post_star, b_post)
dfba_bivariate_concordance_star_list<-append(dfba_bivariate_concordance_list,
list(tau_star=Tau_star,
nc_star=nc_star,
nd_star=nd,
sample_p_star=pc_star,
a_post_star=a_post_star,
b_post_star=b_post,
post_median_star=post_median_star,
prob_interval=prob_interval,
eti_lower_star=eti_lower_star,
eti_upper_star=eti_upper_star))
}
if(is.null(fitting.parameters)==TRUE){
new("dfba_bivariate_concordance_out", dfba_bivariate_concordance_list)
} else {
new("dfba_bivariate_concordance_star_out", dfba_bivariate_concordance_star_list)
}
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.