In tweed1e/idiosyncratics:

knitr::opts_chunk$set(fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE)

Steps:

Create fake firm data with some time series properties
Process the data:
- add (lagged) firm weights,
- add growth rates,
- add upper/lower growth rate quantiles,
- deal with outliers based on those quantiles (drop or winsorize)
- demean the data (find idiosyncratic shocks $e_{it}$)
Pick subsample, predict aggregate growth, evaluate
- given $k$, pick $k$ largest firms in each year
- calculate aggregate growth rate for those firms
- use that to predict aggregate growth
- evaluate using root-mean-squared-error (RMSE)
Compare models for a range of $k$s and datasets

Step 0: Load the libraries.

``` {r echo = TRUE} library(idiosyncratics) library(tidyverse) library(MLmetrics) # for root-mean-squared-error (RMSE) calculation.

via devtools::install_github('tweed1e/fakedata')

library(fakedata)

## Step 1: Create the fake data.
``` {r echo = TRUE}
# Use three datasets to see if the results differ for different datasets.
l <- 4
tbl <- tibble(N = floor(seq(100, 1000, length.out = l)), 
              T = rep_len(50, l))

# Keep all the fake datasets in one dataframe so it's easy 
# to apply operations to all of them at once.
a <- bind_cols(tbl, tbl %>% { 
  map2(.$N, .$T, 
       .f = fakedata::fake_panel, 
                I = 2, # These are all arguments to the fake_panel function
                mean_sales = 30, 
                sd_sales = 0.75 * 30, 
                model = list(order = c(1,0,1), ar = 0.01, ma = 0.01), 
                missing = 0)
  } %>% enframe()) %>% select(-name) %>% rename(data = value)

# Add aggregate growth rate to the datasets.
add_aggregate <- . %>% group_by(year) %>% mutate(g_a = sum(w * g, na.rm = TRUE))

# Clean the data:
b <- a %>%
  mutate(data = data %>% map(rename, industry = I)) %>%
  mutate(data = map(data, add_weights, x_var = sales)) %>%
  mutate(data = map(data, lag_panel, x_var = w)) %>%
  mutate(data = map(data, add_aggregate)) %>%
  mutate(data = map(data, add_quantiles, g_var = 'g')) %>%
  mutate(data = map(data, drop_outliers, g_var = 'g', method = 'drop'))

Step 3: Pick subsamples and evaluate their predictions

At the end a given year, suppose we want to estimate the aggregate growth rate for that year. Our strategy is to ask a small sample of firms in Canada what each one of their growth rates were, and then use that to estimate the aggregate growth rate.

We know that firm sizes in the economy are skewed, which means large firms have a much larger effect on the aggregate growth rate, so maybe if we select the largest $k$ firms we'll get a good estimate of the aggregate growth rate. How do we choose $k$?

First, write a function that selects the top $k$ firms in a given year, for a given k, and use that sample calculate the weighted average growth rate in that year. Compare that estimate to the actual aggregate growth rate in that year. (Note: the $k$ sample doesn't have to be the same set in every year---the idea is, for a given year, pick the firms that were in the top $k$ in size and use those to predict next year. The sample doesn't need to have any memory of previous firms selected.)

The metric I use to evaluate the predictions is root-mean-squared-error (RMSE), relative to the RMSE of a naive model that uses the previous year's aggregate growth as an estimate of this year's.

``` {r echo = TRUE}

Function to calculate relative MAE

slice_k <- function(k, tbl, weight = TRUE, relative = TRUE, ...) { grouping <- quos(...)

tbl <- tbl %>% group_by(!!!grouping) %>% # group filter(!is.na(w)) %>% # filter NA weights arrange(year, -w) %>% # sort by weight, descending slice(1:k) %>% # pick top k by group group_by(year) %>% # group by year to calculate aggregate growth rates mutate(w = w / sum(w, na.rm = TRUE)) %>% # re-weight summarize(g = ifelse(weight == TRUE, sum(w * g, na.rm = TRUE), mean(g, na.rm = TRUE)), g_a = mean(g_a, na.rm = TRUE)) # aggregate growth rate.

# Return the MAE relative to a model that uses last year's # aggregate growth rate to predict this year's # Change this to use MAE instead. function MAE() from MLmetrics mae <- MAE(tbl$g, tbl$g_a) if (relative) { mae <- mae / MAE(lag(tbl$g_a)[-1], tbl$g_a[-1]) } mae }

A wrapper function to loop over k for each dataset, and return a dataframe

slice_1_k <- function(data, relative = TRUE) { k <- 10 # size of the sample. l <- 1:k
names(l) <- l map_dbl(l, slice_k, tbl = data, weight = TRUE, relative = relative, year) %>% enframe() }

add the relative RMSE to each dataset, for each k from 1...k.

c <- b %>% mutate(rmae = map(data, slice_1_k, relative = TRUE)) c

## Step 4: Graph the results:

``` {r echo = TRUE}
gr <- c[, c('N','T','rmae')] %>% 
  unnest() %>% # unnest converts the list-col dataframes into the regular long format we know
  mutate(k = factor(as.numeric(name)), rmae = value, N = factor(N)) 
ggplot(gr, aes(x = k, y = rmae, colour = N, group = N)) + 
  geom_line() + 
  geom_point()

Relative to the naive model (using last year's growth rate to predict this year's), picking the top $k$ firms each year seems to work slightly better, for $k>1$ (with this fake data), depending on the details of the dataset.

For a certain $k$, say $k=10$, the mean absolute error is: {r echo = TRUE} b %>% mutate(mae = map_dbl(data, slice_k, k = 10, weight = TRUE, relative = FALSE, year) %>% round(4)) %>% select(N, T, mae)

tweed1e/idiosyncratics documentation built on May 29, 2019, 10:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com