knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(knapsack)
options(rmarkdown.html_vignette.check_title = FALSE)

This vignette will examine the knapsack package, that provides three different functions for solving the well-known discrete optimization problem called the knapsack problem, which can be formulated as follows: Imagine a knapsack that can carry a bunch of objects $i = 1,...,n$ up to a limited weight $W$, where each object has a weight $w_i$ and a value $v_i$. The task is to add objects to the knapsack and maximize the value without exceeding the limited weight. These three approaches whose functionality will be demonstrated are brute force search, dynamic programming and greedy heuristics, taking a data.frame with knapsack objects called x and a knapsack weight W as parameters.

Settings

The knapsack package contains a preloaded data set with 2000 knapsack objects, ready to be optimized by one of the three functions.

head(knapsack_objects)

However for this purpose, we will generate three different data frames containing different number of knapsack objects $n$, where each row corresponds to an object with columns representing the weight w and value v. To be more precise, we will generate three data frames with $n=16$, $n=500$ and $n=1000000$ number of objects, respectively.

# Function to generate knapsack objects
knapsack_obj <- function(n){
  data.frame(w=sample(1:4000, size = n, replace = TRUE), v=runif(n = n, 0, 10000))
}

set.seed(42)
# The three data sets
obj_16 <- knapsack_obj(16)
obj_500 <- knapsack_obj(500)
obj_1000000 <- knapsack_obj(1000000)

In the upcoming sections, we will set the limited weight to W=3500. Also, we will be using the proc.time function to measure the running time of each function.

Brute force search

Brute force search goes through every single combination of objects to be able to maximize the value. The method guarantees that the value is maximized, but then you have to accept the complexity $0(2^n)$. Let us start by using the function brute_force_knapsack() with $n=16$ knapsack objects.

# Brute force search with n = 16
ptm <- proc.time()  
brute_force_knapsack(x = obj_16, W = 3500)
t1 <- proc.time() - ptm
t1

The total value of the five elements r brute_force_knapsack(x = obj_16, W = 3500)$elements give the maximum value of r format(brute_force_knapsack(x = obj_16, W = 3500)$value, scientific=FALSE). We can also see that the elapsed time is r t1[3] for $n=16$.

Dynamic programming

We can benefit from the fact that the weights are discrete in this setup. This enables us to iterate over all possible values of $w$, to obtain the same results as the brute force, but with a better complexity $O(Wn)$. Now we increase the number of objects to $n=500$.

# Dynamic programming n = 500
ptm <- proc.time()
knapsack_dynamic(x = obj_500, W = 3500)
t2 <- proc.time() - ptm

There are r length(knapsack_dynamic(x = obj_500, W = 3500)$elements) elements that maximize the value. The elapsed time is r t2[3] for $n=500$.

Greedy heuristics

In contrast to the already explained algorithms, the greedy heuristics will return at least $50%$ of the true maximum value, but reduces the complexity to $O(n \log n)$.

# Greedy heuristics with n = 1000000
ptm <- proc.time()
gr <- greedy_knapsack(x = obj_1000000 , W = 3500)
gr$value
gr$elements[1:20]
t3 <- proc.time() - ptm

In the final example we used $n = 1000000$ number of objects and the elapsed time is r t3[3]. From this we can conclude that it is much more time efficient compared to the other algorithms. The greedy heuristics performed a bit slower for $n = 1000000$ than brute force search for $n=16$.

Comparison

We suggest to use the greedy heuristics if you have a larger data set and accept not being able to get maximum value. In contrast, if you are interested in maximizing the value and if the weights take discrete values, then we recommend to use dynamic programming. However, brute force search will always return the maximum value and can be used unless the time efficiency is of the highest priority.

Test suite

The implementation of the greedy algorithm returns the same solution as the test suites. Should the implementation of the greedy algorithm allow for the selection of the same item multiple times, the maximum value may be greater than in the 0-1 case.

Profiling and optimizing the code

To increase the speed we will profile and optimize the code by using the package lineprof to identify bottlenecks. By identifying bottlenecks, code can be rewritten to be more efficient. E.g for loops have been rewritten to apply functions to make the brute force function slightly faster - in some cases the speed increases about x3.

Rprof()
brute_force_knapsack(obj_16, 3500, parallel = FALSE)
Rprof(NULL)
summaryRprof()

# $by.self
#                        self.time self.pct total.time total.pct
# "<Anonymous>"               0.52    23.21       1.94     86.61
# "[[.data.frame"             0.46    20.54       0.74     33.04
# "mapply"                    0.40    17.86       1.94     86.61
# "[["                        0.32    14.29       1.06     47.32
# "brute_force_knapsack"      0.30    13.39       2.24    100.00
# "%in%"                      0.24    10.71       0.24     10.71

Parallelize Brute Force Search

Unfortunately, attempting to parallelize the brute force search does not yield great results regarding the time required to run the function.

# Brute force search with n = 8 and parallel = TRUE
ptm <- proc.time()  
brute_force_knapsack(x = knapsack_objects[1:8,], W = 3500, parallel = TRUE)
t1 <- proc.time() - ptm
t1

By profiling the code, we can identify the bottleneck as functions related to the use of clusters - the more the calculations are split, the higher the overhead.

# Profiling for brute force search with n = 8 and parallel = TRUE
Rprof()
brute_force_knapsack(knapsack_objects[1:8,], 3500, parallel = TRUE)
Rprof(NULL)
summaryRprof()

# $by.self
#                     self.time self.pct total.time total.pct
# "socketConnection"       1.06    80.30       1.06     80.30
# "sendData.SOCKnode"      0.20    15.15       0.20     15.15
# "newPSOCKnode"           0.04     3.03       1.12     84.85
# "system"                 0.02     1.52       0.02      1.52
# 
# $by.total
#                         total.time total.pct self.time self.pct
# "brute_force_knapsack"        1.32    100.00      0.00     0.00
# "newPSOCKnode"                1.12     84.85      0.04     3.03
# "makePSOCKcluster"            1.12     84.85      0.00     0.00
# "parallel::makeCluster"       1.12     84.85      0.00     0.00
# "socketConnection"            1.06     80.30      1.06    80.30
# "sendData.SOCKnode"           0.20     15.15      0.20    15.15
# "<Anonymous>"                 0.20     15.15      0.00     0.00
# "mapply"                      0.20     15.15      0.00     0.00
# "outer"                       0.20     15.15      0.00     0.00
# "parallel::clusterMap"        0.20     15.15      0.00     0.00
# "postNode"                    0.20     15.15      0.00     0.00
# "sendCall"                    0.20     15.15      0.00     0.00
# "sendData"                    0.20     15.15      0.00     0.00
# "staticClusterApply"          0.20     15.15      0.00     0.00
# "system"                      0.02      1.52      0.02     1.52
# 
# $sample.interval
# [1] 0.02
# 
# $sampling.time
# [1] 1.32

With a different approach or a more efficient transfer of data to the clusters, the calculations could be made faster. For n up to 20, the parallel computation does not beat the serial one. Beyond that, both variants are too slow to be practically used.

See the source code for the brute force algorithm for information on the exact implementation.



sofiejorgensen/knapsack documentation built on Oct. 16, 2020, 2:51 a.m.