knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The purpose of this vignette is to provide an overview of the prediction.game package. I will also intersperse notes about future plans, for the reference of both potential collaborators and myself.
The only function is play() which allows for playing the "prediction game," especially in class. Imagine that we are having a contest to determine who can come up with a number which is closest to a random draw from the mpg variable in the mtcars data frame. (There are r length(mtcars$mpg)
values for mpg, with a mean of r round(mean(mtcars$mpg), 1)
.)
suppressMessages(library(tidyverse)) library(prediction.game) play(n = 1000, guess_1 = 20.1, guess_2 = 19.2, formula = ~ sample(mtcars$mpg, size = 1)) %>% count(winner)
You might think that the mean should win more contests than some other number but, as you can see above, 19.2 is a "better" guess, meaning that it wins more contests. The "trick" is that mpg has a right skew, with a median, at r median(mtcars$mpg)
, lower than the mean. And, for getting closest to a random draw from a distribution, the median is better than the mean, when the two differ.
At the other extreme, we can "sample" the entire distribution.
play(n = 1000, guess_1 = 20.1, guess_2 = 19.2, formula = ~ mean(sample(mtcars$mpg, size = 32))) %>% count(winner)
guess_1 wins them all. This is unsurprising since all 1,000 experiments produce the same value, which is the mean. Things get more interesting when we sample somewhere between 1 and 32 observations.
play(n = 1000, guess_1 = 20.1, guess_2 = 19.2, formula = ~ mean(sample(mtcars$mpg, size = 5))) %>% count(winner)
Interesting. If the prediction game involves just sampling one value, the median is better. If we sample them all, the mean is (obviously) best. But, if we are taking the average of a sample of size 5, the mean is better, although not by much. If, instead of taking a sample of size 5, we shrink it to size 2, the median is a better guess.
play(n = 1000, guess_1 = 20.1, guess_2 = 19.2, formula = ~ mean(sample(mtcars$mpg, size = 2))) %>% count(winner)
Need a function, show(), which takes, via %>%, the result of play() and creates a cool animation using d3rain.
The loss function which decides which guess wins is very limitted. First, it requires that the formula return a single value to which it can compare the guesses. Do I need to fix this? Soon? Second, the loss function is currently hard-coded as absolute value of the differences.
We want to generalize this somehow by considering one guess to be better than another, not just because it wins a majority of distinct contests, but instead because the sum of some increasing function of its errors is smaller. For example, the median is a better guess if we are considering a sum of absolute errors while the mean is better for the sum of squared errors. How can we implement that case? I think we would need to allow the formula to return a vector. And then ensure that the loss function can deal with a vector. Is that really so hard? Want to be able to pass in a loss function . . .
Should we connect all this to bootstrapping? What we have, implicitly (?), is a competition over sort-of bootstraped samples . . .
Should we connect all this to the concept to cross-validation? That is, the same (sort of?) framework as what we have here --- lots of experiments, each using a different sample of the data --- is like cross-validation. Instead of taking a sample, you divide the data into 10 sets training and test data, run the two competing processes each on the training, and then compare how they do on the test.
Do we need to allow the guesses to be formulas? Not today. Calculate your guess however you want, using whatever statistical tricks you want. But, at that point, your guess is just as number, as is mine, and we are going to see who wins. Any randomness comes from the formula.
Maybe this should be a Shiny app? Would then need to pre-load all the data sets we care about.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.