In tarakc02/puzzlr: Tools for Puzzle Solvers

The Knapsack problem

The knapsack problem, in its simplest form, asks: given a sack with a maximum carrying capacity, and a set of items of varying weights and values, take a subset of the items that has the highest total value but that still fits in the sack. The puzzlr package includes functions to create instances of the knapsack problem:

library(puzzlr)
set.seed(97317)
ks <- weakly_correlated_instance(n = 50)
ks

Each instance consists of a capacity and a set of items:

capacity(ks)
items(ks)

By default, items in knapsack instances are ordered in decreasing order of value per unit weight, or density. The top item for a given knapsack instance can be accessed with the next_item function:

next_item(ks)

Given a knapsack instance, a new instance can be created by either taking or leaving the next item, leaving a sub problem:

take_next(ks)
leave_next(ks)
sub_problems(ks)

In this way, one can exhaustively enumerate all subsets of items: with the original problem as a root node, create two branches based on the subproblems take_next(ks) and leave_next(ks), and then construct each of their two subproblems, and so on until there are no more items left to consider. The optimal solution is found in one of the leaf nodes of the tree -- among those leaf nodes whose aggregate weight does not exceed capacity(ks), it is the one that maximizes the total value. Given n items, there are $2^n$ leaf nodes. So how can one find the optimal solution in a reasonable amount of time?

Branch and bound

Branch and bound is a family of algorithms for finding the provably optimal solution of a problem like the knapsack that can be represented in terms of branching sub-problems. In the case of the knapsack problem, given a fast way to estimate upper and lower bounds on the true optimum for a sub-problem, prune the search tree during its construction:

Keep track of the best lower bound seen so far.
When branching, if one of the subproblems has an upper bound that is lower than the best lower bound seen so far, prune the tree -- there is no need to continue exploring that path.

The lazily evaluated lists from the lazylist package provide an appealing way to implement branch-and-bound, due to the ability to define data structures recursively. I'll start with the high-level implementation and then fill in the details. I use the empty_stream() to prune or terminate a search path, and merge multiple search paths into one by using a search_strategy that prioritizes search nodes:

library(lazylist)
solution_tree <- function(root, 
                          branch,
                          search_strategy, 
                          best_so_far) {
    branches <- branch(root)
    if (length(branches) == 0) return(empty_stream())

    explore_and_prune <- function(node) {
        if (upper_bound(node) < best_so_far) return(empty_stream())
        new_best <- max(best_so_far, lower_bound(node))
        cons_stream(node,
                    solution_tree(node,
                                  branch = branch,
                                  search_strategy = search_strategy,
                                  best_so_far = new_best))
    }
    branches <- purrr::map(branches, explore_and_prune)
    purrr::reduce(branches, search_strategy)
}

Search strategy

A search strategy should be a function that takes two lazy-lists (or "streams") and merges them into one. Since it's being applied repeatedly at every branch, I like to visualize its effect as braiding or interleaving all of the branches together. The result is a single stream of search nodes, with ordering dependent on the search strategy. There are multiple search strategies I might implement this way: breadth-first, depth-first, etc. As a first example, I'll implement best-first search, in which, at each stage of the search, the "best" node is considered, with "best" defined as the node with the highest upper_bound. Again, the ability to define things recursively helps -- best_first is defined as taking the better of the two head elements of two streams, along with the best_first of the remaining elements of both streams:

best_first <- function(s1, s2) {
    if (is_emptystream(s1)) return(s2)
    if (is_emptystream(s2)) return(s1)
    node1 <- stream_car(s1)
    node2 <- stream_car(s2)
    if (upper_bound(node1) >= upper_bound(node2)) {
        return(cons_stream(node1, 
                           best_first(stream_cdr(s1), s2) ))
    }
    cons_stream(node2, 
                best_first(s1, stream_cdr(s2)) )
}

Search nodes and branching

I still need a way to create search nodes that have upper_bound and lower_bound methods. I'll define a search node as a knapsack instance with some additional information tacked on.

search_node <- function(ks, bounds) {
    # if there are no items left, then we can't add any more value
    if (n_items(ks) == 0) return(
        list(problem = ks,
             upper_bound = total_value(ks),
             lower_bound = total_value(ks))
    )
    b <- bounds(ks)
    list(problem = ks, 
         upper_bound = b$upper_bound,
         lower_bound = b$lower_bound)
}
upper_bound <- function(node) node$upper_bound
lower_bound <- function(node) node$lower_bound

Greedy: a simple bounding function

I've so far been working under the assumption that there is a quick way to estimate the upper and lower bounds of a knapsack instance. There are various trivial bounds to a problem, for instance a lower bound of 0, or an upper bound equal to the sum of values of all of items. I'm looking for two qualities from a bounding function, and they are in conflict with one another:

it should be fast to calculate, since it's calculated for every node in the search tree
it should give tight bounds: for a given node, a tighter upper bound makes it more likely to be pruned, while a tighter lower bound makes all other nodes more likely to be pruned. The more nodes that are pruned, the quicker the search.

Since items are already sorted in order of density, I can take items one at a time until the knapsack is full, giving a decent lower bound. To get an upper bound, I consider an easier to solve problem: what is the most value possible if I'm allowed to take fractional items? Taking items in order of density and then taking a fraction of the next item will result in a total value that is no less than the true optimum. The greedy function calculates both bounds, and also returns the remaining sub-problem after achieving the lower bound:

greedy <- function(ks) {
    item <- next_item(ks)

    # if the only remaining item doesn't fit
    if (n_items(ks) == 1 && item$weight > capacity(ks)) return(
        list(lower_bound = total_value(ks),
             upper_bound = total_value(ks),
             remaining = leave_next(ks))
    )

    remaining <- ks
    while (!is.null(item) && 
           item$weight <= capacity(remaining)) {
        remaining <- take_next(remaining)
        item <- next_item(remaining)
    }
    lower_bound <- total_value(remaining)

    upper_bound <- lower_bound
    if (capacity(remaining) > 0 && !is.null(item)) {
        partial_amount <- capacity(remaining) / item$weight
        upper_bound <- upper_bound + (partial_amount * item$value)
    }

    list(lower_bound = lower_bound,
         upper_bound = upper_bound,
         remaining = remaining)
}

greedy(ks)
# the root of the search tree for ks:
search_node(ks, bounds = greedy)

Generating the solution

I now have all of the necessary components of a knapsack solver:

solve_knapsack <- function(ks, bounds, search_strategy) {
    root <- search_node(ks, bounds = bounds)
    # branch function takes a search node and returns its children
    branch <- function(node) {
        subprobs <- sub_problems(node$problem)
        purrr::map(subprobs, search_node, bounds = bounds)
    }

    cons_stream(root, 
                solution_tree(root, 
                              branch = branch,
                              search_strategy = search_strategy,
                              best_so_far = root$lower_bound))
}

solution <- solve_knapsack(ks, bounds = greedy, search_strategy = best_first)
solution[50]

Recall that the leaf nodes collectively hold all possible solutions. Since I'm using best first search, the first leaf-node that is processed must be the optimal solution. If there were a better solution than the first leaf node, it would have to have a better upper bound, so it would have been processed earlier.

A leaf node is one where the subproblem has no remaining items to consider.

optimal <- stream_which(solution, 
                        function(x) upper_bound(x) == lower_bound(x))
optimal <- optimal[1]
optimal
solution[optimal]
# view which items were taken
taken_items(solution[optimal]$problem)