#' match_forecast_model
#'
#' Builds match_forecast_model to forecast match outcomes by performing bagging
#' using two level modeling. The approach takes into account the uncertainty in the inputs
#' to avoid overconfidence of ML models. Since the inputs to a match outcome prediction
#' problem are often uncertain the approach can enhance the predictive performance of ML models.
#'
#'
#'
#' For detailed explanation see the introduction vignette by running:
#' \code{vignette("introduction", package = "matchForecast")}
#'
#' @export
#'
#' @param data \code{data.frame} in \strong{match-objects}
#' format required by this package. Columns have to
#' include object ID's in the format: \strong{ID_<number>} and attributes in
#' the format \strong{<attr_name>_<number>}. <number> suffix connects
#' objects to attributes. The suffixed number of corresponding object and
#' attributes should be the same. Column for the outcome (dependant variable)
#' of the match should be denoted with \strong{y}. It can also include an
#' attribute \strong{TIME}.
#'
#' An example of the columns for basketball data:
#' \tabular{ccccccccc}{
#' TIME \tab y \tab ID_1 \tab ID_2 \tab P2M_1 \tab P3M_1 \tab P2M_2 \tab P3M_2 \tab.. \cr
#' }
#' In the basketball case ID_1 refers to ID of the home team and P2M_1, P3M_1 to the
#' attributes of the home team. ID_2 refers to the ID of away team and P2M_2, P3M_2 to its
#' attributes. y is the outcome of the match.
#'
#' @param input_model_specification Specifies the parametric assumptions of object
#' attributes. We can model attributes of objects as independent with uivariate models or as
#' dependent with multivariate models.
#'
#' Four univariate Bayesian models are supported out of the box: \itemize{
#' \item Poisson (Poisson-Gamma model), appropriate for count data,
#' see \code{\link{input_model_poisson}}
#' \item Bernoulli (Bernoulli-Beta model), appropriate for binary data,
#' see \code{\link{input_model_bernoulli}}
#' \item Normal (Normal model with sample variance), appropriate for numeric data when
#' modeling only mean, see \code{\link{input_model_normal}}
#' \item normal (Normal-Inverse Gamma model), appropriate for jointly modeling mean and variance,
#' see \code{\link{input_model_normal_ig}}
#' }
#'
#' One multivariate Bayesian model is supported:
#' Multivariate normal model with inverse Wishart prior for covariance matrix
#' (see \code{\link{input_model_mvnormal_iw}}). It is appropriate for jointly modeling numeric
#' attributes of objects. A custom model can also be used if a function is passed as
#' \code{input_model_specification} parameter. See the source code of other input models
#' for an example of how to write a custom model.
#'
#' \strong{How to invoke the models described above?}
#' For independent Bayesian models a string or a list should be passed in as
#' \code{input_model_specification} parameter. If it is a string, all attributes have the
#' same parametric assumption. If it is a list, the keys should correspond to
#' columns (attributes) and values to parametric assumptions. If a list of
#' name-value pairs is used, different attributes can have different
#' parametric assumptions. The strings corresponding to supported parametric assumptions described
#' above are 'poisson', 'bernoulli', 'normal', 'normal_ig'. Multivariate normal model with
#' inverse Wishart prior is invoked by passing in
#' \code{input_model_specification = list(dependent = T, type = "mvnormal_iw")}.
#' See examples below and input models linked above for more details.
#'
#' @param num_models Number of models (distributions) to obtain per attribute. Also matches the
#' number of resulting bagged ML models, since each bagged model is obtained on
#' one set of distributions.
#' @param transformation Specifies how attribute distributions are transformed
#' into actual features being fed into ML algorithm. For each of the parametric assumptions
#' mentioned above mean transformation is supported out of the box. It is invoked by
#' passing \code{transformation = "means"} as parameter. A custom function can also be
#' passed in. In this case it gets called for every object with a list of distributions
#' that were fitted to object's attributes by the package. See the source code of
#' \code{\link{object_model}} and \code{\link{transformation_means}} for more information.
#' @param weighting Optional parameter. A function that uses the \strong{TIME} attribute if
#' present to weight the prior matches by importance when obtaining attribute
#' distributions.
#' @param get_model Function that should build a ML model on one of \code{num_models} datasets
#' generated by the package. It receives match data with features of objects that are the output of
#' \code{transformation} function. The ML model returned needs to support the standard predict()
#' notation.
#' @param priors Optional parameter. Specifies conjugate priors for supported Bayesian models.
#' List of lists, one for each object. Names in the outer list correspond to object ID's.
#' The values in the outer lists are lists with names corresponding to attributes and values
#' to specifications of parameters of attribute prior distributions. The values in the inner lists
#' depend on the parametric assumption used. See the documentation of supported parametric assumptions
#' (e.g. \code{\link{input_model_poisson}}) for details on how to specify priors and examples below
#' on how to use them.
#' @param report_time Boolean denoting whether to report the execution time. Default is \code{FALSE}.
#' @return Structure of class \code{match_forecast_model} containing
#' properties:
#' \itemize{
#' \item first_level_model: list with keys being instance ID's and values corresponding first level
#' object models
#' \item second_level_model: list of length \code{num_models} with each entry being one of the bagged
#' predictive models
#' \item some other implicit parameters (inspect the object for more info)
#' }
#' @example R/examples/example_match_forecast_model.R
match_forecast_model <- function(data,
input_model_specification,
num_models,
transformation = NULL,
weighting = NULL,
get_model,
priors = NULL,
report_time = F) {
start_time <- Sys.time()
transformation_function <- get_transformation(transformation)
data_specification <- list(
cols_ids = get_id_colnames(data),
cols_static = get_static_colnames(data),
cols_measurements = get_measurement_colnames(data)
)
cols_measurements_suffixed <- NULL
for (i in 1:length(data_specification$cols_ids)) {
cols_measurements_suffixed <- c(
cols_measurements_suffixed,
paste(data_specification$cols_measurements, "_", i, sep = "")
)
}
data_specification[["cols_measurements_suffixed"]] <- cols_measurements_suffixed
first_level_model <- model_first_level(
data,
data,
input_model_specification,
num_models,
weighting = weighting,
priors = priors,
data_specification
)
second_level_model <- list()
for (i in 1:num_models) {
data_train_current <- build_dataset_from_fl_model(data, data_specification, first_level_model, i, transformation_function)
second_level_model[[i]] <- get_model(data_train_current)
}
model <- structure(
list(
first_level_model = first_level_model,
second_level_model = second_level_model,
input_model_specification = input_model_specification,
num_models_train = num_models,
data = data,
priors = priors,
transformation_function = transformation_function,
data_specification = data_specification,
weighting_train = weighting
),
class = "match_forecast_model"
)
if (report_time) {
print(paste("Model building time:", Sys.time() - start_time))
}
return(model)
}
#' predict.match_forecast_model
#'
#' Predict function for the match_forecast_model. Predicts the outcome for test
#' matches using each of the bagged predictive models built with match_forecast_model.
#' Unless a different \code{weighting} or \code{num_samples} is used for \code{predict}
#' and building the match_forecast_model, the same first level model is used to
#' construct test feature vectors.
#'
#' @export
#'
#' @param mf_model An instance of class match_forecast_model
#' @param data_new \code{data.frame} or \code{data.table} with match data
#' containing involved object's ID's and possibly TIME attribute.
#'
#' Example of columns for test set:
#' \tabular{ccc}{\tab TIME \tab ID_1 \tab ID_2\cr }
#' Note that the exact attributes of objects in new events are unknown and are therefore
#' estimated from data used for training.
#' @param predict_fun Function that gets as an argument one of the bagged models and
#' transformed test data (e.g. using "means" transformation). It should use the ML model's
#' predict function to obtain the outcome of new events.
#' @param num_models Optional parameter. If set, a different first level model is used for test
#' data set. Specifies how many distributions are fitted to each attribute of an object.
#' Analogous to \code{num_models} in \code{\link{match_forecast_model}}.
#' @param weighting Same as in \code{\link{match_forecast_model}}.
#' @param report_time Boolean denoting whether to report the execution time.
#' @return List of lists of predictions. The size of the outer list is \code{num_models^2},
#' since each of the bagged models is predicting on each of the generated test vectors
#' Each prediction (inner list) contains fields:
#' \itemize{
#' \item predictions: output of predict_fun for corresponding model test vector
#' \item idx_of_bagged_model: index of the bagged model
#' \item idx_of_test_set: index of the test set
#' }
#' @example R/examples/example_predict.R
predict.match_forecast_model <- function(
mf_model,
data_new,
num_models = NULL,
weighting = NULL,
predict_fun,
report_time = F) {
start_time <- Sys.time()
# If we want to weight matches differently for the test set, we need to specify number of models
if (toString(weighting) != toString(mf_model$weighting_train) || !is.null(num_models)) {
num_models <- if (is.null(num_models)) mf_model$num_models_train else num_models
first_level_model <- model_first_level(
data_new,
mf_model$data,
mf_model$input_model_specification,
num_models,
if (!is.null(weighting)) weighting else mf_model$weighting_train,
mf_model$priors,
mf_model$data_specification
)
} else {
first_level_model <- mf_model$first_level_model
num_models <- mf_model$num_models_train
}
predictions <- NULL
for (i in 1:num_models) {
data_test_current <- build_dataset_from_fl_model(
data_new,
mf_model$data_specification,
first_level_model,
i,
mf_model$transformation_function
)
for (j in 1:length(mf_model$second_level_model)) {
predictions <- c(
predictions,
list(
list(
idx_of_bagged_model = i,
idx_of_test_set = j,
predictions = predict_fun(mf_model$second_level_model[[j]], data_test_current)
)
)
)
}
}
if (report_time) {
print(paste("Model prediction time:", Sys.time() - start_time))
}
return(predictions)
}
#' get_transformation
#'
#' Returns a function representing the object transformation (i.e. way of obtaining
#' feature vectors from first level models of objects). Returns the function itself
#' if a function is passed in as parameter. If a string is passed, it has to be one
#' of the supported transformations (currently only "means").
get_transformation <- function(transformation) {
if (typeof(transformation) == "closure") {
return(transformation)
}
supported_transformations <- list(
means = transformation_means
)
if (typeof(transformation) != "character" || !transformation %in% names(supported_transformations)) {
stop("The transformation needs to be a function or one of supported transformations.")
}
return(supported_transformations[[transformation]])
}
#' model_first_level
#'
#' Builds a first level model for objects in the data set. Result is a list of
#' object models.
model_first_level <- function(
data_to_model,
data_all,
input_model_specification,
num_models,
weighting = NULL,
priors = NULL,
data_specification) {
# dataset description object just to avoid recalculation and
# reduce number of passing parameters
cols_ids <- data_specification$cols_ids
cols_measurements <- data_specification$cols_measurements
object_data_modeled <- list()
for (i in 1:nrow(data_to_model)) {
for (j in 1:length(cols_ids)) {
object_id <- data_to_model[i, ][[cols_ids[[j]]]]
if (is.null(object_data_modeled[[object_id]])) {
object_measurements <- get_object_measurements(
object_id,
data_all,
cols_ids,
cols_measurements
)
if (typeof(weighting) == "closure") {
object_measurements <- weighting(object_measurements)
} else if (typeof(weighting) == "character" && weighting == 'gaussian') {
object_measurements <- weighting_gaussian(object_measurements, 1)
}
object_measurements$TIME <- NULL
object_data_modeled[[object_id]] <- model_first_level_object(object_measurements, input_model_specification, num_models, priors[[object_id]])
}
}
}
return(object_data_modeled)
}
#' model_first_level_object
#'
#' Model the attributes of an object according to the input specification.
model_first_level_object <- function(data, input_model_specification, num_models, prior = NULL) {
# Models supported by default
supported_models <- list(
poisson = input_model_poisson,
normal = input_model_normal,
normal_ig = input_model_normal_ig,
bernoulli = input_model_bernoulli,
mvnormal_iw = input_model_mvnormal_iw
)
models <- list()
# Support for custom models
if (typeof(input_model_specification) == 'closure') {
return(input_model_specification(data, num_models))
}
# Simple types of models (each column modeled independently)
if (!is.list(input_model_specification) || is.null(input_model_specification$dependent) || !input_model_specification$dependent) {
column_models <- list()
for (colname in colnames(data)) {
if (typeof(input_model_specification) == 'character') {
if (is.null(supported_models[[input_model_specification]])) {
stop(paste(input_model_specification, "model not supported.", "Supported models are:", toString(names(supported_models))))
}
column_models[[colname]] <- supported_models[[input_model_specification]](data[[colname]], num_models, prior[[colname]])
} else {
if (is.null(input_model_specification[[colname]])) {
stop(paste("Model for attribute", colname, "not specified."))
} else if (is.null(supported_models[[input_model_specification[[colname]]]])) {
stop(paste(input_model_specification[[colname]], "model not supported.", "Supported models are:", toString(names(supported_models))))
}
column_models[[colname]] <- supported_models[[input_model_specification[[colname]]]](data[[colname]], num_models, prior[[colname]])
}
}
for (i in 1:num_models) {
obj_model <- object_model()
for (colname in colnames(data)) {
obj_model$models[[colname]] <- column_models[[colname]][[i]]
}
models[[i]] <- obj_model
}
}
# Modeling attributes as interdependent
# Currently only modelling all attributes with mvnormal is supported
else {
if (is.null(supported_models[[input_model_specification$type]])) {
stop(paste(input_model_specification$type, "model not supported.", "Supported models are:", toString(names(supported_models))))
}
models <- supported_models[[input_model_specification$type]](data, num_models, prior)
}
return(models)
}
#' get_object_measurements
#'
#' Retrieves measurements of all attributes of an object. One row contains data with measurements
#' for a single match.
get_object_measurements <- function(object_id, data_all, cols_ids, cols_measurements) {
data_object_measurements <- data.table(matrix(ncol = length(cols_measurements) + 1, nrow = nrow(data_all)))
colnames(data_object_measurements) <- c("TIME", cols_measurements)
for (col in colnames(data_object_measurements)) {
data_object_measurements[[col]] <- as.double(data_object_measurements[[col]])
}
data_object_measurements[["TIME"]] <- as.integer(data_object_measurements[["TIME"]])
for (i in 1:nrow(data_all)) {
for (j in 1:length(cols_ids)) {
if (object_id == data_all[i, ][[cols_ids[[j]]]]) {
set(data_object_measurements, i, "TIME", data_all[i, ][["TIME"]])
for (col_measurement in cols_measurements) {
set(
data_object_measurements,
i,
col_measurement,
data_all[i, ][[paste(col_measurement, "_", j, sep = "")]]
)
}
}
}
}
return(data_object_measurements[complete.cases(data_object_measurements), ])
}
#' build_dataset_from_fl_model
#'
#' Builds a dataset using the first level model and the transformation.
build_dataset_from_fl_model <- function(data, data_specification, first_level_model, num_of_model, transformation_function) {
cols_ids <- data_specification$cols_ids
cols_measurements <- data_specification$cols_measurements
cols_static <- data_specification$cols_static
dataset_new <- list()
for (i in 1:nrow(data)) {
dataset_row <- list()
for (col in names(data[i, ])) {
dataset_row[[col]] <- data[i, col]
}
for (j in 1:length(cols_ids)) {
transformed_values <- transformation_function(first_level_model[[data[i, cols_ids[[j]]]]][[num_of_model]])
for (attr in names(transformed_values)) {
dataset_row[[
paste(attr, "_", j, sep = "")
]] <- transformed_values[[attr]]
}
}
dataset_new[[i]] <- dataset_row
}
new_colnames <- names(dataset_new[[1]])
dataset_new <- data.frame(matrix(unlist(dataset_new), nrow = length(dataset_new), byrow = T))
colnames(dataset_new) <- new_colnames
for (col in data_specification$cols_measurements_suffixed) {
dataset_new[[col]] <- as.numeric(as.character(dataset_new[[col]]))
}
dataset_new$y <- as.numeric(as.character(dataset_new$y))
return(dataset_new)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.