CompEst: Complexity Estimation and Prediction
In GuessCompx: Empirically Estimates Algorithm Complexity

Description Usage Arguments Details Value Examples

_Main function for the complexity estimation of an algorithm

1
2
3

CompEst(d, f, random.sampling = FALSE, max.time = 30,
  start.size = NULL, replicates = 4, strata = NULL,
  power.factor = 2, alpha.value = 0.005, plot.result = TRUE)

`d`	the data.frame, vector or matrix on which the algorithm is to be tested
`f`	a user-defined function that runs the algorithm, taking d as first argument. No return value is needed.
`random.sampling`	boolean; if TRUE a random sample is taken at each step, if FALSE the first N observations are taken at each step. Choosing a random sampling is relevant whith the use of replicates to help the discrimination power for complexity functions.
`max.time`	maximum time in seconds allowed for each step of the analysis. The function will stop once this time limit has been reached. Default is 30 seconds. There is no such limitation regarding memory.
`start.size`	the size in rows of the first sample to run the algorithm. Default is 'floor(log2(nrow(d)))'. If strata is not NULL, we recommend to enter a multiple of the number of categories.
`replicates`	the number of replicated runs of the algorithm for a specific sample size. Allows to better discriminate the complexity functions. Default is 2.
`strata`	a string, name of the categorical column of d that must be used for stratified sampling. A fixed proportion of the categories will be sampled, always keeping at least one observation per category.
`power.factor`	the common ratio of the geometric progression of the sample sizes. Default is 2, and will make sample sizes double every step. Decimal numbers are allowed.
`alpha.value`	the alpha risk of the test whether the model is significantly different from a constant relation. Default is 0.005.
`plot.result`	boolean indicatif if a summary plot of all the complexity functions is to be displayed

The fit of a complexity function is one among: constant, linear, quadratic, cubic, logarithmic, square.root, n.log(n). Model comparison is achieved using Leave-One-Out error minimisation of the MSE (see ‘boot::cv.glm' doc). Note that when a CONSTANT relationship is predicted, it might simply mean that the max.time value is too low to show any tendency. For time series, the sampling removes the ts attribute to the input vector, so the user’s function shall include again this ts() if a frequency is needed; also remind to avoid random sampling for it will break the series.

A list with the best complexity function and the computation time on the whole dataset, for both time and memory complexity (Windows) and time complexity only (all other OS).

# Dummy function that mimics a constant time complexity and
# N.log(N) memory complexity:
f1 = function(df){
  Sys.sleep(rnorm(1, 0.1, 0.02))
  v = rnorm(n = nrow(df)*log(nrow(df))*(runif(1, 1e3, 1.1e3)))
}
out = CompEst(d = mtcars, f = f1, replicates=2, start.size=2, max.time = 1)
# Raises an alert for TIME complexity.
# Sometimes confuses MEMORY complexity with linear:
print(out)

# Real dist function analysis (everything is quadratic here):
f2 = dist
d  = ggplot2::diamonds[, 5:8]
CompEst(d = d, f = f2, replicates = 1, max.time = 1)

# For time series functions, your `f` argument may include ts()
# to avoid loosing this ts attribute at sampling
# It is also recommended to set `start.size` argument to 3 periods at least.
f = function(d) arima(ts(d, freq = 12), order=c(1,0,1), seasonal = c(0,1,1))
d = ggplot2::txhousing$sales
# Should return a linear trend for TIME:
CompEst(d, f, start.size = 4*12, random.sampling = FALSE)