knitr::opts_chunk$set(fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE)
Steps:
``` {r echo = TRUE} library(idiosyncratics) library(tidyverse) library(MLmetrics) # for root-mean-squared-error (RMSE) calculation.
## Step 1: Create the fake data. ``` {r echo = TRUE} # Use three datasets to see if the results differ for different datasets. l <- 4 tbl <- tibble(N = floor(seq(100, 1000, length.out = l)), T = rep_len(50, l)) # Keep all the fake datasets in one dataframe so it's easy # to apply operations to all of them at once. a <- bind_cols(tbl, tbl %>% { map2(.$N, .$T, .f = fakedata::fake_panel, I = 2, # These are all arguments to the fake_panel function mean_sales = 30, sd_sales = 0.75 * 30, model = list(order = c(1,0,1), ar = 0.01, ma = 0.01), missing = 0) } %>% enframe()) %>% select(-name) %>% rename(data = value) # Add aggregate growth rate to the datasets. add_aggregate <- . %>% group_by(year) %>% mutate(g_a = sum(w * g, na.rm = TRUE)) # Clean the data: b <- a %>% mutate(data = data %>% map(rename, industry = I)) %>% mutate(data = map(data, add_weights, x_var = sales)) %>% mutate(data = map(data, lag_panel, x_var = w)) %>% mutate(data = map(data, add_aggregate)) %>% mutate(data = map(data, add_quantiles, g_var = 'g')) %>% mutate(data = map(data, drop_outliers, g_var = 'g', method = 'drop'))
At the end a given year, suppose we want to estimate the aggregate growth rate for that year. Our strategy is to ask a small sample of firms in Canada what each one of their growth rates were, and then use that to estimate the aggregate growth rate.
We know that firm sizes in the economy are skewed, which means large firms have a much larger effect on the aggregate growth rate, so maybe if we select the largest $k$ firms we'll get a good estimate of the aggregate growth rate. How do we choose $k$?
First, write a function that selects the top $k$ firms in a given year, for a given k, and use that sample calculate the weighted average growth rate in that year. Compare that estimate to the actual aggregate growth rate in that year. (Note: the $k$ sample doesn't have to be the same set in every year---the idea is, for a given year, pick the firms that were in the top $k$ in size and use those to predict next year. The sample doesn't need to have any memory of previous firms selected.)
The metric I use to evaluate the predictions is root-mean-squared-error (RMSE), relative to the RMSE of a naive model that uses the previous year's aggregate growth as an estimate of this year's.
``` {r echo = TRUE}
slice_k <- function(k, tbl, weight = TRUE, relative = TRUE, ...) { grouping <- quos(...)
tbl <- tbl %>% group_by(!!!grouping) %>% # group filter(!is.na(w)) %>% # filter NA weights arrange(year, -w) %>% # sort by weight, descending slice(1:k) %>% # pick top k by group group_by(year) %>% # group by year to calculate aggregate growth rates mutate(w = w / sum(w, na.rm = TRUE)) %>% # re-weight summarize(g = ifelse(weight == TRUE, sum(w * g, na.rm = TRUE), mean(g, na.rm = TRUE)), g_a = mean(g_a, na.rm = TRUE)) # aggregate growth rate.
# Return the MAE relative to a model that uses last year's # aggregate growth rate to predict this year's # Change this to use MAE instead. function MAE() from MLmetrics mae <- MAE(tbl$g, tbl$g_a) if (relative) { mae <- mae / MAE(lag(tbl$g_a)[-1], tbl$g_a[-1]) } mae }
slice_1_k <- function(data, relative = TRUE) {
k <- 10 # size of the sample.
l <- 1:k
names(l) <- l
map_dbl(l, slice_k, tbl = data, weight = TRUE, relative = relative, year) %>% enframe()
}
c <- b %>% mutate(rmae = map(data, slice_1_k, relative = TRUE)) c
## Step 4: Graph the results: ``` {r echo = TRUE} gr <- c[, c('N','T','rmae')] %>% unnest() %>% # unnest converts the list-col dataframes into the regular long format we know mutate(k = factor(as.numeric(name)), rmae = value, N = factor(N)) ggplot(gr, aes(x = k, y = rmae, colour = N, group = N)) + geom_line() + geom_point()
Relative to the naive model (using last year's growth rate to predict this year's), picking the top $k$ firms each year seems to work slightly better, for $k>1$ (with this fake data), depending on the details of the dataset.
For a certain $k$, say $k=10$, the mean absolute error is:
{r echo = TRUE}
b %>%
mutate(mae = map_dbl(data, slice_k,
k = 10, weight = TRUE,
relative = FALSE, year) %>% round(4)) %>%
select(N, T, mae)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.