knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Simeng Shao

February 25, 2021

The hat package implements a hierarchical aggregation testing algorithm, as described in the paper "Controlling the False Split Rate in Tree-Based Aggregation". The package implements a hierarchical testing function that determines how to aggregate leaves with while controlling the false split rate (FSR), an error measure defined in the above paper. There are two primary use cases considered in the paper:

  1. aggregating observations with the same means and

  2. aggregating features with the same coefficients in linear regression.

We provide examples of these two use cases and then demonstrate a more general example of the hierarchical aggregation procedure.

Application to observation aggregation

Brief look at the data

This section is an example of using the package to achieve aggregation of observations based on their means. We calculate volatility of daily stocks price data that are derived from the US Stock Database ©2021 Center for Research in Security Prices (CRSP), The University of Chicago Booth School of Business @CRSP.

The tree structure is constructed based on the North American Industry Classification System (NAICS), an industry classification system that employs a six digit code: the first two digits designate the largest sector; the third, fourth, fifth and sixth digits designate the subsector, industry group, industry, and national industry, respectively.

One can load the volatility data and pre-trained tree structure into our environment:

library(hat)
nrow(stocks_volatility)
colnames(stocks_volatility)

The stocks_volatility data set contains r nrow(stocks_volatility) stocks' corresponding company names and the volatility during the 5-year period.

The distribution of volatility is right-skewed. Therefore, in our analysis, we will take the logarithm to reduce skew.

The tree structure is saved in stocks_tree, which is formed according to the NAICS hierarchy. We can look at the tree structure by using the function plot_tree(). However, we omit this for now because we will see the tree again very soon.

Aggregating observations

The idea of aggregation is based on the fact that some companies might share the same mean volatility of stock if they are "similar enough" in the tree. We assume a model

[ y_i = \theta_{k(i)} + \epsilon_i, \ \ \ k(i)\in{1,...,K}, i\in {1, ..., p}, ]

where, in the stocks example, $y_i$ is the volatility of the $i$-th stock. We will use the tree as a guide as it describes the similarity among companies and therefore helps narrow down reasonable aggregations to consider.

Our algorithm achieves aggregation in two main steps:

  1. Generate a p-value by an ANOVA test for each interior node.

  2. Use the hierarchical aggregation testing procedure to sequentially go down the tree in a fashion that will control the FSR of the overall aggregation of leaves that is produced.

The function aggregate_observations performs both of these steps:

result = aggregate_observations(y = log(stocks_volatility$volatility), 
                       sigma = NULL, 
                       tree = stocks_tree, 
                       alpha = 0.4)

The output result is a list that has four components: alpha is the target FSR level, groups gives the group assignments of the observations, rejections indicates if each interior node is rejected (this will come in handy for plotting the aggregation result on a tree), and p_vals gives all the p-values that were computed internally.

In this example, the achieved aggregation result contains r length(unique(result$groups)) groups. We plot the achieved aggregation on the tree using the function plot_aggregation (a bit busy as since are >2k stocks...).

plot_aggregation(rejections = result$rejections, tree = stocks_tree)

Application to feature aggregation

set.seed(1234)
n = 100
p = 50
k = 10
hc = hclust(dist((1:p) + runif(p)/p), method = "complete")
groups = cutree(hc, k)
coeffs = runif(k, 0, 18)[groups]
X = matrix(rnorm(p * n), nrow = n, ncol = p)
X = X * matrix(rbinom(n*p, size = 1, prob = 0.6),nrow = n, ncol = p)
y = X %*% coeffs + rnorm(n)

For this application, we will synthetize data for feature aggregating. We randomly generate a tree with r p leaves by hierarchical clustering, and cut the tree into r k disjoint groups. Then we simulate r k coefficient values corresponding to the r k groups. To mimick the scenario of rare features (@Yan2018RareFS), we generate a design matrix $X$ by a Gaussian-Bernoulli distribution. Finally, we build a linear model with gaussian noise as the response vector $y$.

To perform aggregation, we can use the function aggregate_features. The input requires

agg_result = aggregate_features(y = y, X = X,
                          tree = hc, alpha = 0.2)

For calculating FSP we can use

calculate_fsp(theta_est = agg_result$groups, theta_true = groups)

General example

The functions aggregate_observations and aggregate_features shown above both make use of the hierarchical aggregation testing framework introduced in the paper "Controlling the False Split Rate in Tree-Based Aggregation". Internally, both functions call a more general function called hierarchical_testing. If a user has a different way to construct nodewise p-values for aggregation, then the user should use hierarchical_testing as shown here.

This function takes a length-num_interior_node vector of p-values.

Here is a small example:

set.seed(123)
n <- 20
hc <- hclust(dist(runif(n)))
de <- as.dendrogram(hc)
hc_list <- dend_as_hclist(de)

compute_my_pvalue <- function(u) {
  # Example of a customized pvalue function:
  # u an element of hc_list

  # compute something more interesting than just runif here.
  # If has leaf as a child, not reject; otherwise reject.
  ifelse(any(hc_list[[u]]<0), 1, 0)
}

result_small <- hierarchical_test(tree = hc_list,
                  p_vals = sapply(1:length(hc_list), compute_my_pvalue),
                  alpha = 0.4,
                  independent = FALSE)

plot_aggregation(rejections = result_small$rejections, tree = hc_list)

References



simone0628/hat documentation built on June 1, 2024, 9 a.m.