In nick-ulle/ast: Low-level Static Analysis Tools for R Code

This vignette introduces the rstatic API for accessing information about an program.

Getting Started With Code Analysis

The rstatic package is all about extracting information from R code. We can use this information in various ways: to check for programming errors, to estimate hardware requirements, to gain a better understanding of how the code is organized (especially for unfamiliar code), to generate additional code, or to compile.

Code is typically written and stored as a string of text. Strings don't expose any information about structures within the code, so the first step in analysis is to parse the code into an intermediate representation. Since the purpose of an intermediate representation is to make code easier to explore and manipulate, many different intermediate representations exist. Choosing the right intermediate representation for a given task can make the task much easier, so many code tools, including this package, support more than one intermediate representation.

The rstatic package can convert R code to two different intermediate representations: abstract syntax trees and control flow graphs. In an abstract syntax tree, each node in the tree corresponds to an expression. When an expression is composed of subexpressions, these appear as its children in the tree.

This is similar to a parse tree, the R interpreter's built-in representation for code. The main difference is that abstract syntax trees abstract away some of the syntactic details that get retained in a parse tree. Syntax is usually not as important for analysis as the meaning, or semantics, of the code. In an R parse tree, the class of each node is determined by the syntax of the expression the node corresponds to. For example, any expression with call syntax f() is represented by a node with class call. This includes the return() statement, even though the meaning of the return statement is quite different from an ordinary function call.

Example 1: Interacting With Code

Let's start by examining the AST for an R implementation of the quadratic formula. The to_ast() function converts functions and quoted R expressions to abstract syntax trees.

quad = function(a, b, c) {
  rad = sqrt(b^2 - 4*a*c)
  roots = (-b + c(rad, -rad)) / 2*a

  return (roots)
}

library(rstatic)

ast = to_ast(quad)

ast

Printing ast reveals that it's a Function object and shows the code inside. For convenience, descendants of ASTNode list all field and method names when printed.

class(ast)

The contents of the function are in the body field. Since the function has multiple lines, they're wrapped in curly braces. Thus the body field holds a Brace object.

braces = ast$body

braces

Brace objects also have a body field, which lists the lines being wrapped. For this example, there are three lines: two assignments and a return statement, represented by Assign and Return objects, respectively.

braces$body

Inspecting the first Assign, we see that it has members read and write in addition to the inherited ASTNode members. The write field holds the variable being assigned (that is, the left side of the assignment).

lhs = braces$body[[1]]$write

lhs

Drilling down further, lhs holds a Symbol, which has fields basename, ssa_number, and name in addition to the inherited members. The basename field is original name of the variable in the R program, as a string.

lhs$basename

The ssa_number and name fields are used for SSA form, which is discussed in a subsequent section of this vignette. We can change basename to change the name of the variable in the code.

lhs$basename = "radical"

ast

ASTs are cumbersome to manipulate by hand. In practice, the easiest way to apply operations to a whole AST is to write a recursive S3 method. For instance, if we wanted to rename all instances of a variable, we could use the method below:

replace_var = function(node, old, new) {
  UseMethod("replace_var")
}

replace_var.Symbol = function(node, old, new) {
  if (node$basename == old)
    node$basename = new
  invisible(NULL)
}

replace_var.Brace =
function(node, old, new) {
  lapply(node$body, replace_var, old, new)
  invisible(NULL)
}

replace_var.Application = function(node, old, new) {
  lapply(node$args, replace_var, old, new)
  invisible(NULL)
}

replace_var.Function = function(node, old, new) {
  replace_var(node$body, old, new)
  invisible(NULL)
}

replace_var.Assign = function(node, old, new) {
  replace_var(node$read, old, new)
  replace_var(node$write, old, new)
  invisible(NULL)
}

replace_var.Literal = function(node, old, new) { }

This makes it easy to change rad to radical globally:

replace_var(ast, "rad", "radical")

ast

There are a few things worth noting in this example. First, the AST is updated in place rather than returned by replace_var. The NULL return values are not strictly necessary, but emphasize that replace_var is used only for its side effects. Although it wasn't necessary here, it's often useful to define a helper R6 class to hold additional state during recursive tree traversal.

Second, the method had to be defined for every class of node that could appear in the tree. This boilerplate is a tradeoff for flexibility. Intermediate classes such as Application (for expressions that have arguments) and Literal (for literals) help to reduce the boilerplate. For convenience, the package may have more intermediate classes in the future.

Third, in this example there's no real benefit, other than clarity, in traversing the AST rather than the parse tree for the code. This is usually the case for operations that don't use information about each node's context in the tree. On the other hand, for operations that do use contextual information, the AST makes it possible to inspect and even modify ancestors the current node.

Finally, the method could be vectorized in old and new by changing replace_var.Symbol().

Control Flow Graphs

The benefits of working with ASTs over parse trees are relatively small. Most of the analysis passes in rstatic operate on control flow graphs, which are convenient because they make control flow explicit and use a common language (graph edges) for all control flow structures.

The function shown below simulates a two-state Markov chain for n steps. The control flow is relatively simple, with just one for-loop and one return statement.

markov2 = function(n, p, init = 0) {
  x = c(init, numeric(n))
  for (i in 1:n) {
    idx = x[i] + 1
    x[i + 1] = rbinom(1, 1, p[idx])
  }

  return (x[-1])
}

We can convert the markov2 function to a control flow graph with the to_blocks() method. When given a function as input, to_blocks() will automatically convert it to an AST and then convert the AST to a CFG.

markov2 = to_blocks(markov2)

cfg = markov2$cfg
class(cfg)

As you might expect, the ControlFlowGraph class represents a control flow graph. In rstatic, each control flow graph corresponds to one function; to_blocks() automatically wraps code in a function definition as needed. ControlFlowGraph is an R6 class, so the cfg variable is a reference, as discussed in the previous section.

ControlFlowGraph is a subclass of FlowGraph, which represents flow graphs in general. The FlowGraph class has several fields, but the most important ones are blocks and graph. The blocks field holds a list with one element, or block, for each vertex in the graph. The names on the list match the vertex names on the graph. Each block holds metadata about the corresponding node. For convenience, elements of the blocks field can be accessed by name or index with the extraction operator [[ on the FlowGraph object.

For instance, the code below gets the first block in cfg.

cfg[[1]]

The graph field holds the actual graph, as an igraph graph. Calling plot() on a FlowGraph object displays a visualization of the graph.

cfg$graph

plot(cfg)

In a control flow graph, the vertices are blocks of code that execute linearly, without any branches in control flow. These vertices are sometimes called basic blocks. At the end of each basic block, control flows to another basic block, which may depend on a condition. The directed edges in the control flow graph show how control flows from block to block.

The blocks in the ControlFlowGraph class are basic blocks, which are represented by the BasicBlock class. The BasicBlock class has fields body, name, phi, and terminator. The name field holds the block's name, which is the same as the vertex name. By default, rstatic uses a "%" followed by an integer for block names. The "%" helps to avoid confusion with integer indices.

cfg[[1]]$name

The body field holds a list of ASTNode objects, one for each ordinary expression in the block.

cfg[[1]]$body

The terminator field holds the final expression in the block, which is always a branch to a different block or a return from the function the CFG represents. These are represented by subclasses of the Terminator class, which are discussed in more detail in a subsequent section.

cfg[[1]]$terminator

The phi field holds a list of Phi objects. These are related to SSA and appear at the beginning of the basic block when the block is printed. The Phi class is discussed in the "Static Single-Assignment Form" section.