R/olr_function.R
In olr: Optimal Linear Regression

Documented in olr

#' olr: Optimal Linear Regression
#'
#' The olr function systematically evaluates multiple linear regression models by exhaustively fitting all possible combinations of independent variables against the specified dependent variable.
#' It selects the model that yields the highest adjusted R-squared (by default) or R-squared, depending on user preference.
#' In model evaluation, both R-squared and adjusted R-squared are key metrics:
#' R-squared measures the proportion of variance explained but tends to increase with the addition of predictors—regardless of relevance—potentially leading to overfitting.
#' Adjusted R-squared compensates for this by penalizing model complexity, providing a more balanced view of fit quality.
#' The goal of olr is to identify the most suitable model that captures the underlying structure of the data while avoiding unnecessary complexity.
#' By comparing both metrics, it offers a robust evaluation framework that balances predictive power with model parsimony.
#' Example Analogy:
#' Imagine a gardener trying to understand what influences plant growth (the dependent variable).
#' They might consider variables like sunlight, watering frequency, soil type, and nutrients (independent variables).
#' Instead of manually guessing which combination works best, the olr function automatically tests every possible combination of predictors and identifies the most effective model—based on either the highest R-squared or adjusted R-squared value.
#' This saves the user from trial-and-error modeling and highlights only the most meaningful variables for explaining the outcome.
#'
#' Complementary functions below follow the format: function(dataset, responseName = NULL, predictorNames = NULL) \cr \cr
#' \strong{olrmodels:} Returns the list of all evaluated models. Use \code{summary(olrmodels(dataset, responseName, predictorNames)[, x])} to inspect a specific model, where \code{x} is the model index. \cr \cr
#' \strong{olrformulas:} Returns the list of all regression formulas generated by \code{olr()}, each representing a unique combination of specified predictor variables regressed on the dependent variable, in the order created. \cr \cr
#' \strong{olrformulasorder:} Returns the same set of regression formulas as \code{olrformulas}, but sorted alphabetically by variable names within each formula. This helps users more easily locate or compare specific combinations of predictors. \cr \cr
#' \strong{adjr2list:} Returns adjusted R-squared values for all models. \cr \cr
#' \strong{r2list:} Returns R-squared values for all models. \cr \cr
#' 
#' \emph{Tip: To avoid errors from non-numeric columns (e.g., dates), remove them using \code{dataset <- dataset[, -1]}. Or use \code{load_custom_data(..., exclude_first_column = TRUE)}.}
#'
#' When \code{responseName} and \code{predictorNames} are \code{NULL}, the function will treat the first column of the \code{dataset} as the response variable and all remaining columns as predictors.
#' \strong{If the first column contains non-numeric or irrelevant data (e.g., a Date column), you must exclude it manually: \code{dataset <- crudeoildata[, -1]}}.
#'
#' Otherwise, you can utilize \strong{load_custom_data(data = "crudeoildata.csv", custom_path = NULL, exclude_first_column = TRUE)}, a custom function that allows you to load the data (crudeoildata) automatically without the first column.  
#'
#'
#' @param dataset is defined by the user and points to the name of the dataset that is being used.
#' @param responseName the response variable name defined as a string. For example, it represents a header in the data table.
#' @param predictorNames the predictor variable or variables that are the terms that are to be regressed against the \code{responseName}. Place desired headers from the \code{dataset} in here as a character vector.
#' @param adjr2 \code{adjr2 = TRUE} returns the regression summary for the maximum adjusted R-squared term. \code{adjr2 = FALSE} returns the regression summary for the maximum R-squared term.
#' @keywords olr optimal linear regression linear-model model model-selection adjusted r-squared combination
#' @return Returns the best-fitting linear model object based on either adjusted R-squared (default) or R-squared. Call \code{summary()} on the result to view full regression statistics.
#' @examples
#' # Please allow time for rendering after clicking "Run Examples"
#' crudeoildata <- read.csv(system.file("extdata", "crudeoildata.csv", package = "olr"))
#' dataset <- crudeoildata[, -1]
#' 
#' responseName <- 'CrudeOil'
#' predictorNames <- c('RigCount', 'API', 'FieldProduction', 'RefinerNetInput',
#'   'OperableCapacity', 'Imports', 'StocksExcludingSPR', 'NonCommercialLong',
#'   'NonCommercialShort', 'CommercialLong', 'CommercialShort', 'OpenInterest')
#'
#' olr(dataset, responseName, predictorNames, adjr2 = TRUE)
#'
#' @import plyr
#' @import stats
#' @importFrom utils combn
#' @export
olr <- function(dataset, responseName = NULL, predictorNames = NULL, adjr2 = TRUE) {
  
  if (is.null(responseName) && is.null(predictorNames)) {
    predictorNames <- colnames(dataset[-1])
    responseName <- colnames(dataset[1])
  }
  
  combine <- function(x, y) combn(y, x, paste, collapse = '+')
  combination_mat <- unlist(lapply(1:length(predictorNames), combine, predictorNames))
  combination_mat <- as.matrix(combination_mat)
  olrformulas <- lapply(combination_mat, function(v) paste(responseName, '~', v))
  
  # Use bquote to ensure the formula appears properly in model$call
  olrmodels <- lapply(olrformulas, function(f) {
    eval(bquote(lm(.(as.formula(f)), data = dataset)))
  })
  
  summarylist <- lapply(olrmodels, summary)
  
  if (adjr2 == "TRUE" || adjr2 == "True" || adjr2 == "true" || adjr2 == TRUE || adjr2 == T) {
    adjr2_vals <- sapply(summarylist, function(x) x$adj.r.squared)
    best_index <- which.max(adjr2_vals)
    cat("Returning model with max adjusted R-squared.\n\n")
  } else if (adjr2 == "FALSE" || adjr2 == "False" || adjr2 == "false" || adjr2 == FALSE || adjr2 == F) {
    r2_vals <- sapply(summarylist, function(x) x$r.squared)
    best_index <- which.max(r2_vals)
    cat("Returning model with max R-squared.\n\n")
  } else {
    stop("Invalid value for 'adjr2'. Must be TRUE or FALSE.")
  }
  
  # Print like standard lm() summary
  model <- olrmodels[[best_index]]
  cat("Call:\n")
  print(model$call)
  cat("\nCoefficients:\n")
  print(coef(model))
  
  invisible(model)
}




#' @rdname olr
#' @import plyr
#' @import stats
#' @importFrom utils combn
#' @export
olrmodels <- function(dataset, responseName = NULL, predictorNames = NULL){

  if ((is.null(responseName) == TRUE) & (is.null(predictorNames) == TRUE)) {
    predictorNames = colnames(dataset[-1])
    responseName = colnames(dataset[1])
  }
    combine <- function (x, y) {combn (y, x, paste, collapse = '+')}
    combination_mat <- unlist (lapply (1:length (predictorNames), combine, predictorNames))
    combination_mat  <- as.matrix(combination_mat)
    olrformulas <- lapply(combination_mat, function(v) paste(paste(responseName),'~', paste(v,collapse = '')))
    olrformulaorder <- olrformulas[order(unlist(olrformulas))]
    olrmodels <- lapply(olrformulas, function(x, data) eval(bquote(lm(.(x),data=dataset))), data=dataset)
    message("To view a specific model's summary, use: olrmodels(dataset, responseName, predictorNames)[[x]] or olrmodels(dataset)[[x]], where x is the model index.")
    message("Note: If too many predictors are included implicitly (e.g., using dataset without specifying predictorNames), some models may fail to generate due to formula size or data constraints.")
    message("For more consistent results, explicitly specify predictorNames.")
    invisible(olrmodels)

}

#' @rdname olr
#' @import plyr
#' @import stats
#' @importFrom utils combn
#' @export
olrformulas <- function(dataset, responseName = NULL, predictorNames = NULL){

  if ((is.null(responseName) == TRUE) & (is.null(predictorNames) == TRUE)) {
    predictorNames = colnames(dataset[-1])
    responseName = colnames(dataset[1])
  }
    combine <- function (x, y) {combn (y, x, paste, collapse = '+')}
    combination_mat <- unlist (lapply (1:length (predictorNames), combine, predictorNames))
    combination_mat  <- as.matrix(combination_mat)
    olrformulas <- lapply(combination_mat, function(v) paste(paste(responseName),'~', paste(v,collapse = '')))
    print(olrformulas)

}

#' @rdname olr
#' @import plyr
#' @import stats
#' @importFrom utils combn
#' @export
olrformulasorder <- function(dataset, responseName = NULL, predictorNames = NULL){

  if ((is.null(responseName) == TRUE) & (is.null(predictorNames) == TRUE)) {
    predictorNames = colnames(dataset[-1])
    responseName = colnames(dataset[1])
  }

    combine <- function (x, y) {combn (y, x, paste, collapse = '+')}
    combination_mat <- unlist (lapply (1:length (predictorNames), combine, predictorNames))
    combination_mat  <- as.matrix(combination_mat)
    olrformulas <- lapply(combination_mat, function(v) paste(paste(responseName),'~', paste(v,collapse = '')))
    olrformulaorder <- olrformulas[order(unlist(olrformulas))]
    print(olrformulaorder)

}

#' @rdname olr
#' @import plyr
#' @import stats
#' @importFrom utils combn
#' @export
adjr2list <- function(dataset, responseName = NULL, predictorNames = NULL) {
  
  # Set default response and predictors
  if (is.null(responseName) && is.null(predictorNames)) {
    predictorNames <- colnames(dataset[-1])
    responseName <- colnames(dataset[1])
  }
  
  # Generate formula combinations
  combine <- function(x, y) combn(y, x, paste, collapse = '+')
  combination_mat <- unlist(lapply(1:length(predictorNames), combine, predictorNames))
  combination_mat <- as.matrix(combination_mat)
  olrformulas <- lapply(combination_mat, function(v) paste(responseName, '~', v))
  
  # Fit all models
  olrmodels <- lapply(olrformulas, function(x, data) eval(bquote(lm(.(x), data = dataset))), data = dataset)
  
  # Get adjusted R-squared values
  adjr2_vals <- sapply(olrmodels, function(model) summary(model)$adj.r.squared)
  
  # Find best model
  best_index <- which.max(adjr2_vals)
  best_model <- olrmodels[[best_index]]
  best_formula <- best_model$call$formula
  best_value <- adjr2_vals[best_index]
  
  # Report result
  message("Highest adjusted R^2: ", round(best_value, 5), " for model: ", deparse(best_formula))
  
  return(best_value)
}





#' @rdname olr
#' @import plyr
#' @import stats
#' @importFrom utils combn
#' @export
r2list <- function(dataset, responseName = NULL, predictorNames = NULL) {
  
  # Set default response and predictors
  if (is.null(responseName) && is.null(predictorNames)) {
    predictorNames <- colnames(dataset[-1])
    responseName <- colnames(dataset[1])
  }
  
  # Generate formula combinations
  combine <- function(x, y) combn(y, x, paste, collapse = '+')
  combination_mat <- unlist(lapply(1:length(predictorNames), combine, predictorNames))
  combination_mat <- as.matrix(combination_mat)
  olrformulas <- lapply(combination_mat, function(v) paste(responseName, '~', v))
  
  # Fit all models
  olrmodels <- lapply(olrformulas, function(x, data) eval(bquote(lm(.(x), data = dataset))), data = dataset)
  
  # Get R-squared values
  r2_vals <- sapply(olrmodels, function(model) summary(model)$r.squared)
  
  # Find best model
  best_index <- which.max(r2_vals)
  best_model <- olrmodels[[best_index]]
  best_formula <- best_model$call$formula
  best_value <- r2_vals[best_index]
  
  # Report result
  message("Highest R^2: ", round(best_value, 5), " for model: ", deparse(best_formula))
  
  return(best_value)
}