Here are a few use cases that should be demonstrated in the paper.
Machine learning models tend to produce a single value, typically a mean, to use as a prediction. With distplyr, we can easily produce a predictive distribution. By way of demonstration, consider a kernel smoothing method with one predictor -- although the concept extends to multiple predictors and other machine learning methods.
Here is a predictive distribution of a penguin's flipper length if we know its bill length is 30mm. A Gaussian kernel with standard deviation of 2.5mm is used. Here's the cdf and a 90% prediction interval:
library(palmerpenguins) library(distplyr) yhat <- dst_empirical(flipper_length_mm, data = penguins, weights = dnorm(bill_length_mm - 30, sd = 2.5)) plot(yhat, "cdf")
eval_quantile(yhat, at = c(0.05, 0.95)) #> [1] 178 198
Created on 2021-07-02 by the reprex package (v0.3.0)
TO DO:
mtcars
).I've encountered a situation where a company's first priority was to predict median house price as best as they could, so they fit a machine learning model that predicted the median. By way of demonstration, we can do the same thing, but just with a simple machine learning model, like kNN (which works by taking the k nearest neighbours to some x value of interest, and then taking the median y value). We don't have to worry about optimizing k, because it's just a demo.
TO DO (this part does not require distplyr):
As a secondary goal, the company wanted a prediction interval, and it was determined that fitting a lognormal distribution for Y (given X) was desirable. Although distplyr currently isn't fully up to this task, it can handle a simple version of this task, in the case where the second parameter of the lognormal distribution ("log variance") is constant -- we can calculate the variance of the residuals on the log scale. Here's how:
TO DO:
geom_ribbon()
to plot a 90% prediction interval.There's a technique called Poisson regression, useful whenever the Y variable is a count variable. It fits the mean of Y as an exponential function of X, while also assuming that the distribution of Y (given X) is a Poisson distribution. To demonstrate a predictive distribution here, we'd just have to use a Poisson regression model to make predictions of the mean, and plug those mean predictions into dst_pois()
.
TO DO:
glm()
function:## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) data.frame(treatment, outcome, counts) # showing data glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
We'd just have to call the predict()
function with type = "response"
as an argument to get mean predictions, which can then go into dst_pois()
. Again, demonstrate maybe two predictive distributions (maybe just by plotting their PMF's).
Some other ideas:
mix()
ing the predictive distributions contained in the bin being predicted on. (I encountered this in a house price prediction problem: training data had exact X values -- something like square footage or list price -- but sometimes only a "bin" is available for prediction, such as "square footage between 1000 and 2000").Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.