# Monte-Carlo Methods for Prediction Functions In mmpf: Monte-Carlo Methods for Prediction Functions

```knitr::opts_chunk\$set(error = TRUE)
```

This packages allows you to to marginalize arbitrary prediction functions using Monte-Carlo integration. Since many prediction functions cannot be easily decomposed into a sum of low dimensional components marginalization can be helpful in making these functions interpretable.

`marginalPrediction` does this computation and then evaluates the marginalized function at a set grid points, which can be uniformly created, subsampled from the training data, or explicitly specified via the `points` argument.

The create of a uniform grid is handled by the `uniformGrid` method. If `uniform = FALSE` and the `points` argument isn't used to specify what points to evaluate, a sample of size `n` is taken from the data without replacement.

The function is integrated against a sample of size `n` taken without replacement from the `data`. The argument `int.points` can be used to override this (in which case you can specify `n = NA`). `int.points` is a vector of integerish indices which specify rows of the `data` to use instead.

```library(mmpf)
library(randomForest)
library(ggplot2)
library(reshape2)

data(swiss)

fit = randomForest(Fertility ~ ., swiss)
mp = marginalPrediction(swiss[, -1], "Education", c(10, nrow(swiss)), fit)
mp

ggplot(data.frame(mp), aes(Education, preds)) + geom_point() + geom_line()
```

The output of `marginalPrediction` is a `data.table` which contains the marginalized predictions and the grid points of the `vars`.

By default the Monte-Carlo expectation is computed, which is set by the `aggregate.fun` argument's default value, the `mean` function. Substituting, say, the median, would give a different output.

By passing the identity function to `aggregate.fun`, which simply returns its input exactly, the integration points are returned directly so that the `prediction` element of the return is a matrix of dimension `n`. `n`, although it is an argument, can be larger or smaller depending on the interaction between the input arguments `n` and `data`. For example if a uniform grid of size 10 is requested (via `n`) from a factor with only 5 levels, a uniform grid of size 5 is created. If `vars` is a vector of length greater than 1, then `n` becomes the size of the Cartesian product of the grids created for each element of `vars`, which can be at most `n^length(vars)`.

```mp = marginalPrediction(swiss[, -1], "Education", c(10, 5), fit, aggregate.fun = identity)
mp

ggplot(melt(data.frame(mp), id.vars = "Education"), aes(Education, value, group = variable)) + geom_point() + geom_line()
````

`predict.fun` specifies a prediction function to apply to the `model` argument. This function must take two arguments, `object` (where `model` is inserted) and `newdata`, which is a `data.frame` to compute predictions on, which is generated internally and is controlled by the other arguments. This allows `marginalPrediction` to handle cases in which predictions for a single data point are vector-valued. That is, classification tasks where probabilities are output, and multivariate regression and/or classification. In these cases `aggregate.fun` is applied separately to each column of the prediction matrix. `aggregate.fun` must take one argument `x`, a vector output from `predict.fun` and return a vector of no greater dimension than that of `x`.

```r
data(iris)

fit = randomForest(Species ~ ., iris)
mp = marginalPrediction(iris[, -ncol(iris)], "Petal.Width", c(10, 25), fit,
predict.fun = function(object, newdata) predict(object, newdata = newdata, type = "prob"))
mp

plt = melt(data.frame(mp), id.vars = "Petal.Width", variable.name = "class",
value.name = "prob")

ggplot(plt, aes(Petal.Width, prob, color = class)) + geom_line() + geom_point()
```

As mentioned before, `vars` can include multiple variables.

```mp = marginalPrediction(iris[, -ncol(iris)], c("Petal.Width", "Petal.Length"), c(10, 25), fit,
predict.fun = function(object, newdata) predict(object, newdata = newdata, type = "prob"))
mp

plt = melt(data.frame(mp), id.vars = c("Petal.Width", "Petal.Length"),
variable.name = "class", value.name = "prob")

ggplot(plt, aes(Petal.Width, Petal.Length, fill = prob)) + geom_raster() + facet_wrap(~ class)
```

Permutation importance is a Monte-Carlo method which estimates the importance of variables in determining predictions by computing the change in prediction error from repeatedly permuting the values of those variables.

`permutationImportance` can compute this type of importance under arbitrary loss functions and contrast (between the loss with the unpermuted and permuted data).

```permutationImportance(iris, "Sepal.Width", "Species", fit)
```

For methods which generate predictions which are characters or unordered factors, the default loss function is the mean misclassification error. For all other types of predictions mean squared error is used.

It is, for example, possible to compute the expected change in the mean misclassification rate by class. The two arguments to `loss.fun` are the permuted predictions and the target variable. In this case they are both vectors of factors.

`contrast.fun` takes the output of `loss.fun` on both the permuted and unpermuted predictions (`x` corresponds to the permuted predictions and `y` the unpermuted predictions).

This can, for example, be used to compute the mean misclassification error change on a per-class basis.

```permutationImportance(iris, "Sepal.Width", "Species", fit,
loss.fun = function(x, y) {
mat = table(x, y)
n = colSums(mat)
diag(mat) = 0
rowSums(mat) / n
},
contrast.fun = function(x, y) x - y)
```

## Try the mmpf package in your browser

Any scripts or data that you put into this service are public.

mmpf documentation built on Oct. 24, 2018, 9:04 a.m.