Huffman coding

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Huffman coding is a compression approach based on character frequencies. To code a string, we work out the frequency of each letter in the string and then build a tree where we put the letters in the leaves and structure the tree such that the most frequent letters are closest to the root. We then encode each letter by the path we must take in that tree to get to the leaf that contain that letter. We need one bit for each edge we follow; zero, say, if we go left and one if we go right.

In Haskell: the Craft of Functional Programming, there is an example of implementing Huffman coding in Haskell. In this vignette, I will adapt that to R with pmatch patterns. I have structured the vignette such that I switch between showing the original Haskell code and then the corresponding R code. If you are not familiar with Haskell, I think you should still be able to follow along.

Data types

The data structures used in the Haskell implementation are these:

data Tree = Leaf Char Int | Node Int Tree Tree
data Bit = L | R deriving (Eq, Show)
type HCode = [ Bit ]
type Table = [ (Char, HCode) ]

The first two are data types created from constructors, and we can define the corresponding R data structures using a := declaration:

library(pmatch)

Tree := Leaf(char : character, count : integer) | 
        Node(count : integer, left : Tree, right : Tree)
Bit := L | R

The other two, HCode and Table are lists over Bit and pairs of characters and HCode, respectively. We won't define types for these. In R, we don't have static types so we don't really name types. We need a list representation, however, and to make one that matches lists in Haskell, we define a linked list:

llist := NIL | CONS(car, cdr : llist)

On occasions, we have to create singleton lists, and rather than writing CONS(val,NIL) for those, I will define a function for it. I will define a type to handle pairs in pattern matching.

single <- function(val) CONS(val, NIL)
pair := PAIR(x, y)

Computing frequencies

Before we can encode a string we need to calculate the frequency of the characters in it. In the Haskell implementation this is done using two merge sorts, one that sorts the string characters alphabetically and count the number of occurrences of each and then another sort that orders the characters according to their frequency. Later on, we will build a tree from a list of characters, and by having the least frequent characters at the front of the list, and building from that end, we will propagate the less frequent characters to the bottom of the tree and the more frequent characters to the top.

The Haskell mergeSort function is implemented like this:

mergeSort :: ([a] -> [a] -> [a]) -> [a] -> [a]
mergeSort merge xs
  | length xs < 2 = xs
  | otherwise
      = merge (mergeSort merge first) (mergeSort merge second)
        where
          first = take half xs
          second = drop half xs
          half = (length xs) `div` 2

This function takes another function, merge, as a parameter. It is via this function we will implement the two different sort functions we need.

We don't have where blocks in R, so we need to compute first, second and half before the recursion, but otherwise the the R implementation follows the Haskell implementation. We need some list functions first, though, for length, take and drop. We can implement them like this:

llength <- function(llist, acc = 0) {
    cases(
        llist,
        NIL -> acc,
        CONS(car, cdr) -> llength(cdr, acc + 1)
    )
}

llrev <- function(llist, acc = NIL) {
    cases(
        llist,
        NIL -> acc,
        CONS(car, cdr) -> llrev(cdr, CONS(car, acc))
    )
}

lltake <- function(llist, k, acc = NIL) {
    if (k == 0) return(llrev(acc))
    cases(
        llist,
        NIL -> stop("There were less than k elements in the list"),
        CONS(car, cdr) -> lltake(cdr, k - 1, CONS(car, acc))
    )
}
lldrop <- function(llist, k, acc = NIL) {
    if (k == 0) return(llist)
    cases(
        llist,
        NIL -> stop("There were less than k elements in the list"),
        CONS(car, cdr) -> lldrop(cdr, k - 1)
    )
}

The llrev function, for reversing a linked list, is a helper function that lets us write tail-recursive take and drop functions. It isn't terribly important here since we do not use any tail-recursion optimisation, but once you get into the habit of writing your functions tail recurse it is hard not to. Later on in the vignette I will use simpler functions rather than tail-recursive ones to make the examples easier to follow.

Anyway, with the list functions in place, the merge sort can be implemented like this:

merge_sort <- function(merge, xs) {
    n <- llength(xs)
    if (n < 2) return(xs)
    half <- n %/% 2 
    first <- lltake(xs, half)
    second <- lldrop(xs, half)
    merge(merge_sort(merge, first), merge_sort(merge, second))
}

Sorting alphabetically and counting characters

To sort characters alphabetically and at the same time count the number of occurrences for each, we need a merge function. The Haskell function looks like this:

alphaMerge :: [(Char,Int)] -> [(Char,Int)] -> [(Char,Int)]
alphaMerge xs [] = xs
alphaMerge [] ys = ys
alphaMerge ((p,n) : xs) ((q,m) : ys)
  | (p == q)  = (p,n+m) : alphaMerge xs ys
  | (p < q)   = (p,n) : alphaMerge xs ((q,m) : ys)
  | otherwise = (q,m) : alphaMerge ((p,n) : xs) ys

We do not have conditions like | (p == q) in R, so we handle those using if-else-statements, but otherwise we translate the function statement by statement into this:

alpha_merge <- function(xs, ys) {
    cases(
        ..(xs, ys),
        ..(xs, NIL) -> xs,
        ..(NIL, ys) -> ys,
        ..(CONS(PAIR(p,n), xs), CONS(PAIR(q,m), ys)) -> {
            if (p == q) {
                CONS(PAIR(p,n+m), alpha_merge(xs, ys))
            } else if (p < q) {
                CONS(PAIR(p,n), alpha_merge(xs, CONS(PAIR(q,m), ys)))
            } else {
                CONS(PAIR(q,m), alpha_merge(CONS(PAIR(p,n), xs), ys))
            }
        }
    )
}

To test the function I want to make it a little easier to build linked lists, so I have written this helper function:

llist_from_list <- function(x) {
    llist <- NIL
    n <- length(x)
    while (n > 0) {
        llist <- CONS(x[[n]], llist)
        n <- n - 1
    }
    llist
}

Ok, let us see if we can merge two lists:

xs <- llist_from_list(list(
    PAIR("a", 1), PAIR("b", 2)
))
ys <- llist_from_list(list(
    PAIR("a", 3), PAIR("c", 3)
))
alpha_merge(xs, ys)

Yup, that went okay.

What about sorting? First I want a function for concatenating two linked lists, and then I will test it on the concatenation of xs and ys from above.

Concatenation can look like this:

llconcat <- function(xs, ys) {
    cases(xs, 
          NIL -> ys,
          CONS(x,xs) -> CONS(x, llconcat(xs, ys)))
}

(Here, I did choose a simple version over a tail-recursive one).

Onwards to the test:

zs <- llconcat(xs, ys)
merge_sort(alpha_merge, zs)
merge_sort(alpha_merge, llrev(zs))

Yup, that works.

Sorting based on character frequency

To compute the character frequency we need to write this function:

freqMerge :: [(Char,Int)] -> [(Char,Int)] -> [(Char,Int)]
freqMerge xs [] = xs
freqMerge [] ys = ys
freqMerge ((p,n):xs) ((q,m):ys)
  | (n < m || (n == m && p < q)) = (p,n) : freqMerge xs ((q,m):ys)
  | otherwise                    = (q,m) : freqMerge((p,n):xs) ys

The translation is done statement by statement and with if-statements instead of | conditions:

freq_merge <- function(xs, ys) {
    cases(
        ..(xs, ys),
        ..(xs, NIL) -> xs,
        ..(NIL, ys) -> ys,
        ..(CONS(PAIR(p,n), xs), CONS(PAIR(q,m), ys)) -> {
            if (n < m || (n == m && p < q)) {
                CONS(PAIR(p,n), freq_merge(xs, CONS(PAIR(q,m), ys)))
            } else {
                CONS(PAIR(q,m), freq_merge(CONS(PAIR(p,n), xs), ys))
            }      
        }
    )
}
xs <- llist_from_list(list(
    PAIR("a", 5), PAIR("b", 3), PAIR("c", 2)
))
merge_sort(freq_merge, xs)

Computing frequencies

To compute a list of character frequencies for a string, we need to combine the sort and merge functions. In Haskell, the frequency function looks like this:

frequency :: [Char] -> [(Char,Int)]
frequency
  = mergeSort freqMerge . mergeSort alphaMerge . map start
    where
      start ch = (ch, 1)

We see that we need another list function, map. I have implemented it this way:

llmap <- function(llist, f, acc = NIL) {
    cases(
        llist,
        NIL -> llrev(acc),
        CONS(car, cdr) -> llmap(cdr, f, CONS(f(car), acc))
    )
}

I want to work on lists of characters, so as a helper function I have written string_to_list that translates a string into a list containing the characters in the string:

string_to_list <- function(x)
    llist_from_list(strsplit(x, "")[[1]])

string_to_list("foobar")

For composing functions, we can use magrittr pipelines. If we do, we just have to remember that the functions are evaluated from left-to-right, whereas in the Haskell code, they are evaluated right-to-left. So

mergeSort freqMerge . mergeSort alphaMerge . map start

becomes

library(magrittr)
frequency = . %>% 
    string_to_list() %>% 
    llmap(function(char) PAIR(char, 1L)) %>%
    merge_sort(alpha_merge, .) %>%
    merge_sort(freq_merge, .)

If we call frequency on a string, we get a list of pairs of characters and integers. We should interpret that as a list of characters and the number of occurrences each have in the string.

frequency("foo")

Building trees

From a frequency list we need to build tree. This is the Haskell code:

makeTree :: [ (Char,Int) ] -> Tree
makeTree = makeCodes . toTreeList

toTreeList :: [ (Char,Int) ] -> [Tree]
toTreeList = map (uncurry Leaf)

The uncurl Leaf stuff is a technicality needed in Haskell. To translate a list of character-count pairs into a list of trees, we simply do this:

to_tree_list <- . %>% llmap(
    function(x) cases(x, PAIR(char,count) -> Leaf(char, count))
)
xs <- "foo" %>% string_to_list() %>% llmap(function(char) PAIR(char, 1L))
to_tree_list(xs)

The makeCodes function looks like this in Haskell:

makeCodes :: [ Tree ] -> Tree
makeCodes [t] = t
makeCodes ts = makeCodes (amalgamate ts)

and we can translate it almost mechanically:

make_codes <- function(ts) {
    cases(
        ts,
        CONS(t, NIL) -> t,
        ts -> make_codes(amalgamate(ts))
    )
}

The same goes for amalgamate:

amalgamate :: [ Tree ] -> [ Tree ]
amalgamate (t1:t2:ts) = insTree (pair t1 t2) ts
amalgamate <- function(ts) {
    cases(
        ts,
        CONS(t1, CONS(t2, ts)) -> ins_tree(pair(t1, t2), ts)
    )
}

We can see that we need functions pair and ins_tree for these functions. The pair function shouldn't be confused with the PAIR constructor—one of the reasons I put the latter in all uppercase. Other than that, the entire translation of the Haskell code is just done statement by statement:

pair :: Tree -> Tree -> Tree
pair t1 t2 = Node (v1 + v2) t1 t2
             where v1 = value t1
                   v2 = value t2

value :: Tree -> Int
value (Leaf _ n)   = n
value (Node n _ _) = n
value <- function(tree) {
    cases(
        tree,
        Leaf(., n) -> n,
        Node(n, ., .) -> n
    )
}
pair <- function(t1, t2) Node(value(t1) + value(t2), t1, t2)
insTree :: Tree -> [ Tree ] -> [ Tree ]
insTree t1 []          = [t1]
insTree t1 (t2 : rest) =
  if v1 < v2 then t1 : t2 : rest else t2 : insTree t1 rest
  where v1 = value t1
        v2 = value t2
ins_tree <- function(t, ts) {
    cases(
        ..(t, ts),
        ..(t, NIL) -> single(t),
        ..(t1, CONS(t2, rest)) -> {
            v1 <- value(t1)
            v2 <- value(t2)
            if (v1 < v2) {
                CONS(t1, CONS(t2, rest))
            } else {
                CONS(t2, ins_tree(t1, rest))
            }
        }
    )
}

The makeTree function was defined as a composition:

makeTree :: [ (Char,Int) ] -> Tree
makeTree = makeCodes . toTreeList

We do the same using a pipeline:

make_tree <- . %>%
    to_tree_list() %>%
    make_codes()
tree <- "foo" %>% frequency() %>% make_tree()
tree

I have written some code for plotting trees, in the function plot_tree. I don't show it here, it is a simple translation of the data structure into a table I can plot using tidy graph and ggraph, but we can use it to display trees:

library(tidygraph)
library(ggraph)

plot_tree <- function(tree) {
    dfn <- function(tree, n) {
        cases(
            tree,
            Leaf(char,count) ->
                list(structure(Leaf(char,count), number = n), n + 1L),
            Node(count, left, right) -> {
                bind[left,n] <- dfn(left, n)
                bind[right,n] <- dfn(right, n)
                list(structure(Node(count, left, right), number = n), n + 1L)
            }
        )
    }
    bind[tree,size] <- dfn(tree, 0L)

    node_label <- rep("", size)

    from <- vector("integer", length = size - 1)
    to <-  vector("integer", length = size - 1)
    edge_label <- vector("character", length = size - 1)

    index <- 1
    populate_table <- function(tree) {
        cases(
            tree,
            Leaf(char, count) ~ {
                node_label[attr(tree, "number") + 1] <<- char
            },
            Node(., left, right) ~ {
                this_node_n <- attr(tree, "number")
                left_node_n <- attr(left, "number")
                right_node_n <- attr(right, "number")
                from[index] <<- this_node_n ; to[index] <<- left_node_n
                edge_label[index] <<- "L"
                index <<- index + 1
                from[index] <<- this_node_n ; to[index] <<- right_node_n
                edge_label[index] <<- "R"

                index <<- index + 1
                populate_table(left)
                populate_table(right)
            }
        )
    }
    populate_table(tree)

    nodes <- data.frame(node_label = node_label, 
                        stringsAsFactors = FALSE)
    # add one because ggraph doesn't like zero
    edges <- data.frame(from = from + 1, to = to + 1, 
                        edge_label = edge_label,
                        stringsAsFactors = FALSE)
    tidygraph::tbl_graph(nodes, edges) %>%
        ggraph(layout = "tree") +
        geom_node_point(aes(filter = (node_label == "")), size = 5) +
        geom_node_text(aes(filter = (node_label != ""), label = node_label)) +
        geom_edge_link(aes(label = edge_label, hjust = 2),
                       arrow = arrow(length = unit(4, 'mm')), 
                       end_cap = circle(3, 'mm')) +
        theme_graph()
}
"foobarbaz" %>% frequency() %>% make_tree() %>% plot_tree()

Code tables

The result of an encoding will be a sequence of L and R bits. We will represent these in linked lists, so to make it simpler to display these I have written these helper functions:

as.list.llist <- function(x, all.names = FALSE, sorted = FALSE, ...) {
    n <- llength(x)
    v <- vector("list", length = n)
    i <- 1
    while (i <= n) {
        v[i] <- x$car
        i <- i + 1
        x <- x$cdr
    }
    v
}
as.vector.llist <- function(x, mode = "any") {
    unlist(as.list(x))
}

toString.code <- function(x, ...) {
    x %>% llmap(toString) %>% as.vector() %>% paste0(collapse = "")
}

code <- llist_from_list(
    list(L, L, R, L)
)
code
class(code) <- c("code", class(code))
code %>% toString() %>% cat()

I do not simply set the class of a list to "code". I could, and most of the time that wouldn't be a problem. However, in the specification of linked lists we required that cdr should be of type llist, so if we want to concatenate two codes—and I can reveal that we do later—then we need codes to have that type. I therefore make codes specialisations of llist with c("code", class(code)).

The encoding of a string will be implemented as follows: for each character in the string, we look up the code for that character, i.e. the sequence of left and right edges we need to reach the character in a tree, and we concatenate all these. In the implementation of the encoding we want a table that maps characters to their code, and in Haskell that table is computed using this function:

convert :: HCode -> Tree -> Table
convert cd (Leaf c n) = [(c,cd)]
convert cd (Node n t1 t2) = (convert (cd++[L]) t1) ++
                            (convert (cd++[R]) t2)

codeTable :: Tree -> Table
codeTable = convert []

We can implement it in R like this:

convert <- function(code, tree) {
    cases(
        tree,
        Leaf(char, count) -> single(
            PAIR(char,structure(code, class = c("code", class(code))))
        ),
        Node(count, t1, t2) -> llconcat(
            convert(llconcat(code, single(L)), t1),
            convert(llconcat(code, single(R)), t2)
        )
    )
}

code_table <- . %>% convert(NIL, .) %>% 
    structure(class = c("code_table", class(.)))

I have added a class to tables to make them easier to print.

toString.code_table <- function(x, ...) {
    x %>% llmap(function(p) 
        cases(p, PAIR(x,y) ~ paste0(x, " => ", toString(y)))
    ) %>% as.vector() %>% paste0(collapse = "\n")
}
tree <- "foobaar" %>% frequency() %>% make_tree()
table <- tree %>% code_table()
tree %>% plot_tree()
table %>% toString() %>% cat()

tree <- "foobarbaz" %>% frequency() %>% make_tree()
table <- tree %>% code_table()
tree %>% plot_tree()
table %>% toString() %>% cat()

Coding and decoding

Now, we have everything we need in order to encode and decode strings/codes. Both encoding and decoding requires a tree, so we define a function for building a tree from a string:

codes :: [Char] -> Tree
codes = makeTree . frequency
codes <- . %>% frequency() %>% make_tree()
tree <- codes("foobarbaz")
tree %>% code_table() %>% toString() %>% cat()

As I have already explained, encoding combines looking up the code for individual characters and concatenating all of them. The lookup is implemented in Haskell like this:

lookupTable :: Table -> Char -> HCode
lookupTable []            c = error "lookupTable"
lookupTable ((ch,n) : tb) c
  | ch == c                 = n
  | otherwise               = lookupTable tb c

And in R like this:

lookup_table <- function(table, char) {
   cases(
       table,
       NIL -> stop(paste(char, "was not found in the table")),
       CONS(PAIR(ch,n), tb) -> {
           if (ch == char) {
               n
           } else {
               lookup_table(tb, char)
           }
       }
   ) 
}

The encoding combines that with concatenation. We map over the characters and look up their code. We then concatenate the result.

codeMessage :: Table -> [Char] -> HCode
codeMessage tbl = concat . map (lookupTable tbl)

Our llconcat function only works on two linked lists, however, so we need to handle concatenation slightly differently. I have handled it by writing a reduce function that I can use to concatenate lists pairwise:

llreduce <- function(llist, f, acc) {
    cases(
        llist,
        NIL -> acc,
        CONS(val, rest) -> llreduce(rest, f, f(acc, val))  
    )
}

The code_message is very similar to the Haskell version, but I have added a translation from strings to lists before the main code (this isn't needed in Haskell because strings are already lists of characters) and I set the class of the result to make it easier to print coded messages.

code_message <- function(message, table) {
    message %>% string_to_list() %>%
        llmap(function(char) lookup_table(table, char)) %>%
        llreduce(llconcat, NIL) %>% 
        structure(class = c("code", class(.)))
}

tree <- codes("foobarbaz")
table <- tree %>% code_table()
code <- "foobar" %>% code_message(table)
code %>% toString() %>% cat()

Decoding works by following paths down the tree until we get to a character, which we then add to the list as the decoding of a part of the code, and doing this as long as we have path-bits in our encoded message.

decodeMessage :: Tree -> HCode -> [Char]
decodeMessage tr =
  decodeByTree tr
  where
    decodeByTree (Node n t1 t2) (L : rest) = decodeByTree t1 rest
    decodeByTree (Node n t1 t2) (R : rest) = decodeByTree t2 rest
    decodeByTree (Leaf c n)     rest       = c : decodeByTree tr rest
    decodeByTree t              []         = []

For this function, the R code closely follows the Haskell code, except that we have to put the decode_by_tree function at the beginning of the function instead of in a where block:

decode_message <- function(tree, code) {
    decode_by_tree <- function(tr, code) {
        cases(
            ..(tr, code),
            ..(Node(n, t1, t2), CONS(L, rest)) -> decode_by_tree(t1, rest),
            ..(Node(n, t1, t2), CONS(R, rest)) -> decode_by_tree(t2, rest),
            ..(Leaf(char, n), rest) -> CONS(char, decode_by_tree(tree, rest)),
            ..(t, NIL) -> NIL,
        )
    }
    decode_by_tree(tree, code) %>% as.vector() %>% paste0(collapse = "")
}

decode_message(tree, code)

That's it. Huffman coding in R.



Try the pmatch package in your browser

Any scripts or data that you put into this service are public.

pmatch documentation built on Oct. 19, 2018, 5:04 p.m.