estimate_runtime: Estimate runtime

View source: R/estimate_runtime.R

estimate_runtimeR Documentation

Estimate runtime

Description

Estimate runtime of fitting a computationally intensive model to a big dataset prior to the run itself, which, in some cases, may be measured in hours or days. The runtime is estimated by extrapolation from a best-fitting model (power, exponential or linear) fitted to a sample of runtimes in a small range.

Usage

estimate_runtime(code, subset_sizes, full_size)

Arguments

code

String: code executing the model in one line. Should specify execution of an iterable subset of the full dataset. Usually, this is done by setting the data argument inside the model's function to, for example, DT[1:i] or DT[sample(.N, i)] for data.tables or df[1:i,] for data.frames.

subset_sizes

Numeric vector: a range of subsets of the full dataset that have manageable running times (e.g. from several seconds to several minutes) that extends as far as practical into the full dataset. May require some trial-and-error to determine an optimal trade-off between the time it takes to produce an estimate and the accuracy of the estimate. As we would commonly want to estimate long runtimes fairly quickly, the accuracy won't be great, but the estimate would still be useful as a ballpark indicator.

full_size

Numeric value: full size of the dataset, i.e. nrow(DT). Has to be set manually, because we aren't passing the full dataset object to the function. Can also be set to any number to estimate the runtime of a model on a similar dataset of any size.

Value

Annotated ggplot2 graph showing estimated runtime over the full dataset's size.

Examples

## Not run: 
library(data.table)
library(randomForest)

n = 1e6

DT <- data.table(OUTCOME = sample(c(0L,1L), n, replace = T) |> as.factor(),
                 FEATURE1 = sample(LETTERS[1:4], n, replace = T),
                 FEATURE2 = sample(LETTERS[5:8], n, replace = T),
                 FEATURE3 = sample(LETTERS[9:15], n, replace = T),
                 FEATURE4 = sample(LETTERS[1:10], n, replace = T),
                 FEATURE5 = runif(n, 1, 10) |> round(2),
                 FEATURE6 = runif(n, 20, 40) |> round(2),
                 FEATURE7 = rnorm(n, 50, 20) |> round(2),
                 FEATURE8 = rnorm(n, 100, 40) |> round(2))

estimate_runtime(
  code = "randomForest(OUTCOME ~ ., data = DT[sample(.N, i)])",
  subset_sizes = c(2500,5000,10000,15000,25000,50000),
  full_size = n
)

## End(Not run)


nhsbsa-data-analytics/nhsbsaR documentation built on Jan. 25, 2025, 8:54 a.m.