R/match_forecast.R

Defines functions match_forecast_model predict.match_forecast_model get_transformation model_first_level model_first_level_object get_object_measurements build_dataset_from_fl_model

Documented in build_dataset_from_fl_model get_object_measurements get_transformation match_forecast_model model_first_level model_first_level_object predict.match_forecast_model

#' match_forecast_model
#'
#' Builds match_forecast_model to forecast match outcomes by performing bagging
#' using two level modeling. The approach takes into account the uncertainty in the inputs
#' to avoid overconfidence of ML models. Since the inputs to a match outcome prediction
#' problem are often uncertain the approach can enhance the predictive performance of ML models.
#'
#'
#'
#' For detailed explanation see the introduction vignette by running:
#' \code{vignette("introduction", package = "matchForecast")}
#'
#' @export
#'
#' @param data \code{data.frame} in \strong{match-objects}
#'   format required by this package. Columns have to
#'   include object ID's in the format: \strong{ID_<number>} and attributes in
#'   the format \strong{<attr_name>_<number>}. <number> suffix connects
#'   objects to attributes. The suffixed number of corresponding object and
#'   attributes should be the same. Column for the outcome (dependant variable)
#'   of the match should be denoted with \strong{y}. It can also include an
#'   attribute \strong{TIME}.
#'
#'   An example of the columns for basketball data:
#'   \tabular{ccccccccc}{
#'    TIME \tab y \tab ID_1 \tab ID_2 \tab P2M_1 \tab P3M_1 \tab P2M_2 \tab P3M_2 \tab.. \cr
#'   }
#'   In the basketball case ID_1 refers to ID of the home team and P2M_1, P3M_1 to the
#'   attributes of the home team. ID_2 refers to the ID of away team and P2M_2, P3M_2 to its
#'   attributes. y is the outcome of the match.
#'
#' @param input_model_specification Specifies the parametric assumptions of object
#'   attributes. We can model attributes of objects as independent with uivariate models or as
#'   dependent with multivariate models.
#'
#'  Four univariate Bayesian models are supported out of the box: \itemize{
#'   \item Poisson (Poisson-Gamma model), appropriate for count data,
#'   see \code{\link{input_model_poisson}}
#'   \item Bernoulli (Bernoulli-Beta model), appropriate for binary data,
#'   see \code{\link{input_model_bernoulli}}
#'   \item Normal (Normal model with sample variance), appropriate for numeric data when
#'   modeling only mean, see \code{\link{input_model_normal}}
#'   \item normal (Normal-Inverse Gamma model), appropriate for jointly modeling mean and variance,
#'   see \code{\link{input_model_normal_ig}}
#'   }
#'
#'   One multivariate Bayesian model is supported:
#'   Multivariate normal model with inverse Wishart prior for covariance matrix
#'   (see \code{\link{input_model_mvnormal_iw}}). It is appropriate for jointly modeling numeric
#'   attributes of objects. A custom model can also be used if a function is passed as
#'    \code{input_model_specification} parameter. See the source code of other input models
#'    for an example of how to write a custom model.
#'
#'   \strong{How to invoke the models described above?}
#'   For independent Bayesian models a string or a list should be passed in as
#'   \code{input_model_specification} parameter. If it is a string, all attributes have the
#'   same parametric assumption. If it is a list, the keys should correspond to
#'   columns (attributes) and values to parametric assumptions. If a list of
#'   name-value pairs is used, different attributes can have different
#'   parametric assumptions. The strings corresponding to supported parametric assumptions described
#'   above are 'poisson', 'bernoulli', 'normal', 'normal_ig'. Multivariate normal model with
#'   inverse Wishart prior is invoked by passing in
#'   \code{input_model_specification = list(dependent = T, type = "mvnormal_iw")}.
#'   See examples below and input models linked above for more details.
#'
#' @param num_models Number of models (distributions) to obtain per attribute. Also matches the
#'   number of resulting bagged ML models, since each bagged model is obtained on
#'   one set of distributions.
#' @param transformation Specifies how attribute distributions are transformed
#'   into actual features being fed into ML algorithm. For each of the parametric assumptions
#'   mentioned above mean transformation is supported out of the box. It is invoked by
#'   passing \code{transformation = "means"} as parameter. A custom function can also be
#'   passed in. In this case it gets called for every object with a list of distributions
#'   that were fitted to object's attributes by the package. See the source code of
#'   \code{\link{object_model}} and \code{\link{transformation_means}} for more information.
#' @param weighting Optional parameter. A function that uses the \strong{TIME} attribute if
#'   present to weight the prior matches by importance when obtaining attribute
#'   distributions.
#' @param get_model Function that should build a ML model on one of \code{num_models} datasets
#'  generated by the package. It receives match data with features of objects that are the output of
#'  \code{transformation} function. The ML model returned needs to support the standard predict()
#'   notation.
#' @param priors Optional parameter. Specifies conjugate priors for supported Bayesian models.
#' List of lists, one for each object. Names in the outer list correspond to object ID's.
#' The values in the outer lists are lists with names corresponding to attributes and values
#' to specifications of parameters of attribute prior distributions. The values in the inner lists
#' depend on the parametric assumption used. See the documentation of supported parametric assumptions
#' (e.g. \code{\link{input_model_poisson}}) for details on how to specify priors and examples below
#' on how to use them.
#' @param report_time Boolean denoting whether to report the execution time. Default is \code{FALSE}.
#' @return Structure of class \code{match_forecast_model} containing
#'   properties:
#'   \itemize{
#'   \item first_level_model: list with keys being instance ID's and values corresponding first level
#'   object models
#'   \item second_level_model: list of length \code{num_models} with each entry being one of the bagged
#'   predictive models
#'   \item some other implicit parameters (inspect the object for more info)
#'   }
#' @example R/examples/example_match_forecast_model.R
match_forecast_model <- function(data,
                                 input_model_specification,
                                 num_models,
                                 transformation = NULL,
                                 weighting = NULL,
                                 get_model,
                                 priors = NULL,
                                 report_time = F) {
  start_time <- Sys.time()

  transformation_function <- get_transformation(transformation)

  data_specification <- list(
    cols_ids = get_id_colnames(data),
    cols_static = get_static_colnames(data),
    cols_measurements = get_measurement_colnames(data)
  )

  cols_measurements_suffixed <- NULL
  for (i in 1:length(data_specification$cols_ids)) {
    cols_measurements_suffixed <- c(
      cols_measurements_suffixed,
      paste(data_specification$cols_measurements, "_", i, sep = "")
    )
  }

  data_specification[["cols_measurements_suffixed"]] <- cols_measurements_suffixed

  first_level_model <- model_first_level(
    data,
    data,
    input_model_specification,
    num_models,
    weighting = weighting,
    priors = priors,
    data_specification
  )

  second_level_model <- list()
  for (i in 1:num_models) {
    data_train_current <- build_dataset_from_fl_model(data, data_specification, first_level_model, i, transformation_function)
    second_level_model[[i]] <- get_model(data_train_current)
  }

  model <- structure(
    list(
      first_level_model = first_level_model,
      second_level_model = second_level_model,
      input_model_specification = input_model_specification,
      num_models_train = num_models,
      data = data,
      priors = priors,
      transformation_function = transformation_function,
      data_specification = data_specification,
      weighting_train = weighting
    ),
    class = "match_forecast_model"
  )

  if (report_time) {
    print(paste("Model building time:", Sys.time() - start_time))
  }
  return(model)
}

#' predict.match_forecast_model
#'
#' Predict function for the match_forecast_model. Predicts the outcome for test
#' matches using each of the bagged predictive models built with match_forecast_model.
#' Unless a different \code{weighting} or \code{num_samples} is used for \code{predict}
#' and building the match_forecast_model, the same first level model is used to
#' construct test feature vectors.
#'
#' @export
#'
#' @param mf_model An instance of class match_forecast_model
#' @param data_new \code{data.frame} or \code{data.table} with match data
#'  containing involved object's ID's and possibly TIME attribute.
#'
#'  Example of columns for test set:
#'  \tabular{ccc}{\tab TIME \tab ID_1 \tab ID_2\cr }
#'  Note that the exact attributes of objects in new events are unknown and are therefore
#'  estimated from data used for training.
#' @param predict_fun Function that gets as an argument one of the bagged models and
#' transformed test data (e.g. using "means" transformation). It should use the ML model's
#' predict function to obtain the outcome of new events.
#' @param num_models Optional parameter. If set, a different first level model is used for test
#' data set. Specifies how many distributions are fitted to each attribute of an object.
#' Analogous to \code{num_models} in \code{\link{match_forecast_model}}.
#' @param weighting Same as in \code{\link{match_forecast_model}}.
#' @param report_time Boolean denoting whether to report the execution time.
#' @return List of lists of predictions. The size of the outer list is \code{num_models^2},
#' since each of the bagged models is predicting on each of the generated test vectors
#' Each prediction (inner list) contains fields:
#'  \itemize{
#'  \item predictions: output of predict_fun for corresponding model test vector
#'  \item idx_of_bagged_model: index of the bagged model
#'  \item idx_of_test_set: index of the test set
#'  }
#' @example R/examples/example_predict.R
predict.match_forecast_model <- function(
  mf_model,
  data_new,
  num_models = NULL,
  weighting = NULL,
  predict_fun,
  report_time = F) {
  start_time <- Sys.time()

  # If we want to weight matches differently for the test set, we need to specify number of models
  if (toString(weighting) != toString(mf_model$weighting_train) || !is.null(num_models)) {
    num_models <- if (is.null(num_models)) mf_model$num_models_train else num_models
    first_level_model <- model_first_level(
      data_new,
      mf_model$data,
      mf_model$input_model_specification,
      num_models,
      if (!is.null(weighting)) weighting else mf_model$weighting_train,
      mf_model$priors,
      mf_model$data_specification
    )
  }  else {
    first_level_model <- mf_model$first_level_model
    num_models <- mf_model$num_models_train
  }

  predictions <- NULL
  for (i in 1:num_models) {

    data_test_current <- build_dataset_from_fl_model(
      data_new,
      mf_model$data_specification,
      first_level_model,
      i,
      mf_model$transformation_function
    )

    for (j in 1:length(mf_model$second_level_model)) {
      predictions <- c(
        predictions,
        list(
          list(
            idx_of_bagged_model = i,
            idx_of_test_set = j,
            predictions = predict_fun(mf_model$second_level_model[[j]], data_test_current)
          )
        )
      )
    }
  }

  if (report_time) {
    print(paste("Model prediction time:", Sys.time() - start_time))
  }
  return(predictions)
}

#' get_transformation
#'
#' Returns a function representing the object transformation (i.e. way of obtaining
#' feature vectors from first level models of objects). Returns the function itself
#' if a function is passed in as parameter. If a string is passed, it has to be one
#' of the supported transformations (currently only "means").
get_transformation <- function(transformation) {
  if (typeof(transformation) == "closure") {
    return(transformation)
  }
  supported_transformations <- list(
    means = transformation_means
  )
  if (typeof(transformation) != "character" || !transformation %in% names(supported_transformations)) {
    stop("The transformation needs to be a function or one of supported transformations.")
  }
  return(supported_transformations[[transformation]])
}

#' model_first_level
#'
#' Builds a first level model for objects in the data set. Result is a list of
#' object models.
model_first_level <- function(
  data_to_model,
  data_all,
  input_model_specification,
  num_models,
  weighting = NULL,
  priors = NULL,
  data_specification) {

  # dataset description object just to avoid recalculation and
  #   reduce number of passing parameters
  cols_ids <- data_specification$cols_ids
  cols_measurements <- data_specification$cols_measurements

  object_data_modeled <- list()
  for (i in 1:nrow(data_to_model)) {
    for (j in 1:length(cols_ids)) {
      object_id <- data_to_model[i, ][[cols_ids[[j]]]]
      if (is.null(object_data_modeled[[object_id]])) {
        object_measurements <- get_object_measurements(
          object_id,
          data_all,
          cols_ids,
          cols_measurements
        )

        if (typeof(weighting) == "closure") {
          object_measurements <- weighting(object_measurements)
        } else if (typeof(weighting) == "character" && weighting == 'gaussian') {
          object_measurements <- weighting_gaussian(object_measurements, 1)
        }
        object_measurements$TIME <- NULL
        object_data_modeled[[object_id]] <- model_first_level_object(object_measurements, input_model_specification, num_models, priors[[object_id]])
      }
    }
  }
  return(object_data_modeled)
}

#' model_first_level_object
#'
#' Model the attributes of an object according to the input specification.
model_first_level_object <- function(data, input_model_specification, num_models, prior = NULL) {
  # Models supported by default
  supported_models <- list(
    poisson = input_model_poisson,
    normal = input_model_normal,
    normal_ig = input_model_normal_ig,
    bernoulli = input_model_bernoulli,
    mvnormal_iw = input_model_mvnormal_iw
  )

  models <- list()
  # Support for custom models
  if (typeof(input_model_specification) == 'closure') {
    return(input_model_specification(data, num_models))
  }


  # Simple types of models (each column modeled independently)
  if (!is.list(input_model_specification) || is.null(input_model_specification$dependent) || !input_model_specification$dependent) {
    column_models <- list()
    for (colname in colnames(data)) {
      if (typeof(input_model_specification) == 'character') {
        if (is.null(supported_models[[input_model_specification]])) {
          stop(paste(input_model_specification, "model not supported.", "Supported models are:", toString(names(supported_models))))
        }
        column_models[[colname]] <- supported_models[[input_model_specification]](data[[colname]], num_models, prior[[colname]])
      } else {
        if (is.null(input_model_specification[[colname]])) {
          stop(paste("Model for attribute", colname, "not specified."))
        } else if (is.null(supported_models[[input_model_specification[[colname]]]]))  {
          stop(paste(input_model_specification[[colname]], "model not supported.", "Supported models are:", toString(names(supported_models))))
        }
        column_models[[colname]] <- supported_models[[input_model_specification[[colname]]]](data[[colname]], num_models, prior[[colname]])
      }
    }

    for (i in 1:num_models) {
      obj_model <- object_model()
      for (colname in colnames(data)) {
        obj_model$models[[colname]] <- column_models[[colname]][[i]]
      }
      models[[i]] <- obj_model
    }
  }
  # Modeling attributes as interdependent
  # Currently only modelling all attributes with mvnormal is supported
  else {
    if (is.null(supported_models[[input_model_specification$type]])) {
      stop(paste(input_model_specification$type, "model not supported.", "Supported models are:", toString(names(supported_models))))
    }
    models <- supported_models[[input_model_specification$type]](data, num_models, prior)
  }
  return(models)
}

#' get_object_measurements
#'
#' Retrieves measurements of all attributes of an object. One row contains data with measurements
#' for a single match.
get_object_measurements <- function(object_id, data_all, cols_ids, cols_measurements) {
  data_object_measurements <- data.table(matrix(ncol = length(cols_measurements) + 1, nrow = nrow(data_all)))
  colnames(data_object_measurements) <- c("TIME", cols_measurements)
  for (col in colnames(data_object_measurements)) {
    data_object_measurements[[col]] <- as.double(data_object_measurements[[col]])
  }
  data_object_measurements[["TIME"]] <- as.integer(data_object_measurements[["TIME"]])

  for (i in 1:nrow(data_all)) {
    for (j in 1:length(cols_ids)) {
      if (object_id == data_all[i, ][[cols_ids[[j]]]]) {
        set(data_object_measurements, i, "TIME", data_all[i, ][["TIME"]])
        for (col_measurement in cols_measurements) {
          set(
            data_object_measurements,
            i,
            col_measurement,
            data_all[i, ][[paste(col_measurement, "_", j, sep = "")]]
          )
        }
      }
    }
  }

  return(data_object_measurements[complete.cases(data_object_measurements), ])
}

#' build_dataset_from_fl_model
#'
#' Builds a dataset using the first level model and the transformation.
build_dataset_from_fl_model <- function(data, data_specification, first_level_model, num_of_model, transformation_function) {
  cols_ids <- data_specification$cols_ids
  cols_measurements <- data_specification$cols_measurements
  cols_static <- data_specification$cols_static

  dataset_new <- list()
  for (i in 1:nrow(data)) {
    dataset_row <- list()
    for (col in names(data[i, ])) {
      dataset_row[[col]] <- data[i, col]
    }
    for (j in 1:length(cols_ids)) {
      transformed_values <- transformation_function(first_level_model[[data[i, cols_ids[[j]]]]][[num_of_model]])
      for (attr in names(transformed_values)) {
        dataset_row[[
          paste(attr, "_", j, sep = "")
          ]] <- transformed_values[[attr]]
      }
    }
    dataset_new[[i]] <- dataset_row
  }
  new_colnames <- names(dataset_new[[1]])
  dataset_new <- data.frame(matrix(unlist(dataset_new), nrow = length(dataset_new), byrow = T))
  colnames(dataset_new) <- new_colnames

  for (col in data_specification$cols_measurements_suffixed) {
    dataset_new[[col]] <- as.numeric(as.character(dataset_new[[col]]))
  }
  dataset_new$y <- as.numeric(as.character(dataset_new$y))

  return(dataset_new)
}
fortunar/input_uncertainty_model documentation built on May 30, 2019, 6:26 a.m.