Principal Balances
In coda.base: A Basic Set of Functions for Compositional Data Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(digits = 4)

Principal balances are defined as follows [@pawlowsky2011]:

Given an $n$-sample of a $D$-part random composition, the set of Principal Balances (PB) is a set of $D-1$ balances satisfying the following conditions:

Each sample PB is obtained as the projection of a sample composition on a unitary composition or balancing element associated to the PB.

The first PB is the balance with maximum sample variance.

The $i$-th PB has maximum variance conditional to its balancing element being orthogonal to the previous 1st, 2nd, ..., $(i-1)$-th balancing elements.

Because of the large number of possibilities, for a given compositional sample, finding the PB is a computational demanding task. @fernandez2017 proposed different approaches to deal with the calculation and approximation of such type of basis. coda.base implements all methods presented in [@fernandez2017].

To illustrate coda.base functionalities, we use the following dataset:

library(coda.base)
X = parliament2017[,-1]

consisting of parties in the 2017 Catalan Parliament Elections.

Exact Principal Balances

coda.base can calculate the PB with the function pb_basis(). Function pb_basis() needs the parameter method to be set. To obtain the PB users needs to set method = "exact".

B1 = pb_basis(X, method = "exact")

Where the obtained sequential binary partition can be summarized with the following sequential binary tree:

plot_balance(B1)

apply(coordinates(X, B1), 2, var)

When method is set to "exact", exhaustive search is performed to obtain the PB. The time needed to calculate the PB grows exponentially regarding the number of parts of X. To iterate through all the possibilities, CoDaPack uses the algorithm @ruskey1993. Currently, exhaustive search can find the PB of a compositional data set with 20 parts in a reasonable amount of time (5 minutes approximately).

library(ggplot2)
D = 10

# time_ = sapply(5:20, function(D){
#   print(D)
#   x = matrix(rlnorm(100*D), ncol=D)
#   a = tic()
#   r = pb_basis(x, method = 'exact')
#   b = toc()
#   b$toc - b$tic
# })
# dplot = data.frame(
#   parts = tail(5:20, 11),
#   time = tail(time_, 11)
# )

dplot = structure(list(parts = 10:20, time = c(0.00499999999999545, 0.0119999999999436, 
0.0260000000000673, 0.0749999999999318, 0.243000000000052, 0.831999999999994, 
2.64400000000001, 9.56200000000001, 28.787, 95.4560000000001, 
311.133)), class = "data.frame", row.names = c(NA, -11L))

ggplot(data=dplot) +
  geom_point(aes(x=parts, y=time)) +
  geom_segment(aes(x=parts, xend =parts, y = 0, yend=time)) +
  labs(x = 'Number of parts', y = 'Time in seconds (logarithmic scale)', title = 'Time needed to calculate Principal Balances',
       caption = "Times measured in a MacBook Pro (13-inch, 2017) \n2,3 GHz Dual-Core Intel Core i5.16 GB 2133 MHz LPDDR3") +
  scale_x_continuous(breaks = tail(5:20, 11)) +
  scale_y_continuous(trans = 'log', breaks = c(0.004, 0.04, 0.4, 4, 40, 400), labels = c("0.004", "0.04", "0.4", 4, 40, 400)) +
  theme_minimal()

When the number of part is higher, the exact method should be discarded in favor of other alternatives. Different alternatives are available in @pawlowsky2011 and @fernandez2017. Between these alternatives, the once that can be implemented are agglomerate partition approach based on the Ward method [@pawlowsky2011, @fernandez2017] and the approximation based on a reduced approximation to the principal components [@fernandez2017].

Ward Method for Parts

For a composition $X$, @pawlowsky2011 proposed to build the variation array and use it to produce a matrix distance. The variation array calculates the variance between a pair of logratios.

D = as.dist(variation_array(X))
D

Combining pairs with lower variance, it is expected to get a final merging with high variability.

B2 = pb_basis(X, method = 'cluster')
plot_balance(B2)

apply(coordinates(X, B2), 2, var)

Constrained PCs Algorithm

B3 = pb_basis(X, method = 'constrained')
plot_balance(B3)

apply(coordinates(X, B3), 2, var)

hc = hclust_dendrogram(B1)
hcd = as.dendrogram(hc)
dd = dendro_data(hcd)
ggdendrogram(dd)
dd$segments = dd$segments %>%
  mutate(
    end_node = yend == 0
  )

p <- ggplot(dd$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend, linetype = end_node))+
  geom_label(data = dd$labels, aes(x, y, label = label),
            hjust = 0.5, angle = 90, size = 4) +
  theme_void() + scale_linetype_discrete(guide=FALSE)
p

## Build tree
build_tree_order = function(B, ibalance){
  balance = B[,ibalance]
  if(sum(balance != 0) == 2){
    return(ibalance)
  }
  L = NULL; R = NULL
  left_ = balance < 0
  if(sum(left_) > 1){
    L = Recall(B, which(apply((B != 0 & left_) | (B == 0 & !left_), 2, all)))
  }
  right_ = balance > 0
  if(sum(right_) > 1){
    R = Recall(B, which(apply((B != 0 & right_) | (B == 0 & !right_), 2, all)))
  }
  return(unname(c(L, R, ibalance)))
}
ord_ = build_tree_order(B, which(apply(B != 0, 2, all)))
TREE = setNames(lapply(1:ncol(X), function(i) list('c' = i)), names(X)[ORDER])
for(i in ord_){
  balance = B[,i]
  name_ = paste(sort(names(balance)[balance != 0]), collapse = '_')
  name_left = paste(sort(names(balance)[balance < 0]), collapse = '_')
  name_right = paste(sort(names(balance)[balance > 0]), collapse = '_')
  l_node = list(
    list(
      name = name_,
      left = name_left,
      right = name_right,
      l = TREE[[name_left]]$c,
      r = TREE[[name_right]]$c,
      c = (TREE[[name_left]]$c+TREE[[name_right]]$c)/2
  ))
  names(l_node) = name_
  TREE = c(TREE, l_node)
}
nodes = sapply(ord_, function(i){
  balance = B[,i]
  paste(sort(names(balance)[balance != 0]), collapse = '_')
})
TREE = TREE[nodes]
lapply(TREE, as_tibble) %>% bind_rows()
for(node in rev(nodes)){
  TREE[[node]]
}

# lapply(nodes, function(name_){
#   tibble(
#     name = name_,
#     name_l = TREE[[name_]]$left,
#     name_r = TREE[[name_]]$right,
#     l = TREE[[name_]]$l,
#     r = TREE[[name_]]$r,
#     c = TREE[[name_]]$c,
#     var = var(H[,i])
#   )
# }) %>% bind_rows()


H = coordinates(X, B)
H_mean = colMeans(H)
H_var = apply(H, 2, var)

names(X)[ORDER]

l_nodes = lapply(rev(ord_), function(i){
  list(

    balance = B[,i],
    mean = boot::inv.logit(H_mean[i]),
    var = H_var[i]
  )
})
names(l_nodes) = sapply(l_nodes, function(x) paste(sort(names(x$balance)[x$balance != 0]), collapse = '_'))
l_nodes

library(coda.base)
X = iris[,1:4]
B = pb_basis(X, method = 'exact')
rownames(B) = names(X)
H = coordinates(X, B)
apply(H, 2, var)

plot(H[,1], H[,2])

lX = split(X, iris$Species)
m_ = lapply(lX, colMeans)
s_ = lapply(lX, cov)
S = cov(X)
S_ = replicate(3, S, simplify = FALSE)

Prob = mapply(function(m,s){
  mvtnorm::dmvnorm(X, m, s)
}, m_, S_)

B_1 = pb_basis(Prob, method = 'exact')
H_1 = coordinates(Prob, B_1)
plot(H_1, col = iris$Species)

B_2 = pc_basis(Prob)
H_2 = coordinates(Prob, B_2)
plot(-H_2, col = iris$Species)

library(fpc)
dp = discrproj(X, iris$Species)
plot(dp$proj[,1:2], col = iris$Species)

References

@article{cite-key,
    Date-Added = {2020-06-13 08:44:39 +0000},
    Date-Modified = {2020-06-13 08:44:39 +0000},
    Id = {ref25},
    Title = {Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2011) Principal balances. In Egozcue JJ, Tolosana-Delgado R, Ortego M (eds) Proceedings of the 4th international workshop on compositional data analysis, Girona, Spain, pp 1--10},
    Ty = {STD}}

@article{cite-key,
    Author = {Mart{\'\i}n-Fern{\'a}ndez, J. A. and Pawlowsky-Glahn, V. and Egozcue, J. J. and Tolosona-Delgado, R.},
    Da = {2018/04/01},
    Date-Added = {2020-06-13 08:42:47 +0000},
    Date-Modified = {2020-06-13 08:42:47 +0000},
    Doi = {10.1007/s11004-017-9712-z},
    Id = {Mart{\'\i}n-Fern{\'a}ndez2018},
    Isbn = {1874-8953},
    Journal = {Mathematical Geosciences},
    Number = {3},
    Pages = {273--298},
    Title = {Advances in Principal Balances for Compositional Data},
    Ty = {JOUR},
    Url = {https://doi.org/10.1007/s11004-017-9712-z},
    Volume = {50},
    Year = {2018},
    Bdsk-Url-1 = {https://doi.org/10.1007/s11004-017-9712-z}}

Frank Ruskey LEcture note computer science 762 (1993) 205-206
@String{j-LECT-NOTES-COMP-SCI   = "Lecture Notes in Computer Science"}
@String{ack-nhfb = "Nelson H. F. Beebe,
                    University of Utah,
                    Department of Mathematics, 110 LCB,
                    155 S 1400 E RM 233,
                    Salt Lake City, UT 84112-0090, USA,
                    Tel: +1 801 581 5254,
                    FAX: +1 801 581 4148,
                    e-mail: \path|beebe@math.utah.edu|,
                            \path|beebe@acm.org|,
                            \path|beebe@computer.org| (Internet),
                    URL: \path|http://www.math.utah.edu/~beebe/|"}
@Article{Ruskey:1993:SCG,
  author =       "F. Ruskey",
  title =        "Simple Combinatorial Gray Codes Constructed by
                 Reversing Sublists",
  journal =      j-LECT-NOTES-COMP-SCI,
  volume =       "762",
  pages =        "201--208",
  year =         "1993",
  CODEN =        "LNCSD9",
  ISSN =         "0302-9743 (print), 1611-3349 (electronic)",
  ISSN-L =       "0302-9743",
  bibdate =      "Wed Sep 15 10:01:31 MDT 1999",
  bibsource =    "http://www.math.utah.edu/pub/tex/bib/lncs1993.bib",
  acknowledgement = ack-nhfb,
  keywords =     "algorithms; computation; ISAAC",
}