This vignette introduces the rstatic API for accessing information about an program.
The rstatic package is all about extracting information from R code. We can use this information in various ways: to check for programming errors, to estimate hardware requirements, to gain a better understanding of how the code is organized (especially for unfamiliar code), to generate additional code, or to compile.
Code is typically written and stored as a string of text. Strings don't expose any information about structures within the code, so the first step in analysis is to parse the code into an intermediate representation. Since the purpose of an intermediate representation is to make code easier to explore and manipulate, many different intermediate representations exist. Choosing the right intermediate representation for a given task can make the task much easier, so many code tools, including this package, support more than one intermediate representation.
The rstatic package can convert R code to two different intermediate representations: abstract syntax trees and control flow graphs. In an abstract syntax tree, each node in the tree corresponds to an expression. When an expression is composed of subexpressions, these appear as its children in the tree.
This is similar to a parse tree, the R interpreter's built-in representation
for code. The main difference is that abstract syntax trees abstract away some
of the syntactic details that get retained in a parse tree. Syntax is usually
not as important for analysis as the meaning, or semantics, of the code. In an
R parse tree, the class of each node is determined by the syntax of the
expression the node corresponds to. For example, any expression with call
syntax f()
is represented by a node with class call
. This includes the
return()
statement, even though the meaning of the return statement is quite
different from an ordinary function call.
Let's start by examining the AST for an R implementation of the quadratic
formula. The to_ast()
function converts functions and quoted R expressions to
abstract syntax trees.
quad = function(a, b, c) { rad = sqrt(b^2 - 4*a*c) roots = (-b + c(rad, -rad)) / 2*a return (roots) } library(rstatic) ast = to_ast(quad) ast
Printing ast
reveals that it's a Function
object and shows the code inside.
For convenience, descendants of ASTNode
list all field and method names when
printed.
class(ast)
The contents of the function are in the body
field. Since the function has
multiple lines, they're wrapped in curly braces. Thus the body
field holds a
Brace
object.
braces = ast$body braces
Brace
objects also have a body
field, which lists the lines being wrapped.
For this example, there are three lines: two assignments and a return
statement, represented by Assign
and Return
objects, respectively.
braces$body
Inspecting the first Assign
, we see that it has members read
and write
in
addition to the inherited ASTNode
members. The write
field holds the
variable being assigned (that is, the left side of the assignment).
lhs = braces$body[[1]]$write lhs
Drilling down further, lhs
holds a Symbol
, which has fields basename
,
ssa_number
, and name
in addition to the inherited members. The basename
field is original name of the variable in the R program, as a string.
lhs$basename
The ssa_number
and name
fields are used for SSA form, which is discussed in
a subsequent section of this vignette. We can change basename
to change the
name of the variable in the code.
lhs$basename = "radical" ast
ASTs are cumbersome to manipulate by hand. In practice, the easiest way to apply operations to a whole AST is to write a recursive S3 method. For instance, if we wanted to rename all instances of a variable, we could use the method below:
replace_var = function(node, old, new) { UseMethod("replace_var") } replace_var.Symbol = function(node, old, new) { if (node$basename == old) node$basename = new invisible(NULL) } replace_var.Brace = function(node, old, new) { lapply(node$body, replace_var, old, new) invisible(NULL) } replace_var.Application = function(node, old, new) { lapply(node$args, replace_var, old, new) invisible(NULL) } replace_var.Function = function(node, old, new) { replace_var(node$body, old, new) invisible(NULL) } replace_var.Assign = function(node, old, new) { replace_var(node$read, old, new) replace_var(node$write, old, new) invisible(NULL) } replace_var.Literal = function(node, old, new) { }
This makes it easy to change rad
to radical
globally:
replace_var(ast, "rad", "radical") ast
There are a few things worth noting in this example. First, the AST is updated
in place rather than returned by replace_var
. The NULL
return values are
not strictly necessary, but emphasize that replace_var
is used only for its
side effects. Although it wasn't necessary here, it's often useful to define a
helper R6 class to hold additional state during recursive tree traversal.
Second, the method had to be defined for every class of node that could appear
in the tree. This boilerplate is a tradeoff for flexibility. Intermediate
classes such as Application
(for expressions that have arguments) and
Literal
(for literals) help to reduce the boilerplate. For convenience, the
package may have more intermediate classes in the future.
Third, in this example there's no real benefit, other than clarity, in traversing the AST rather than the parse tree for the code. This is usually the case for operations that don't use information about each node's context in the tree. On the other hand, for operations that do use contextual information, the AST makes it possible to inspect and even modify ancestors the current node.
Finally, the method could be vectorized in old
and new
by changing
replace_var.Symbol()
.
The benefits of working with ASTs over parse trees are relatively small. Most of the analysis passes in rstatic operate on control flow graphs, which are convenient because they make control flow explicit and use a common language (graph edges) for all control flow structures.
The function shown below simulates a two-state Markov chain for n
steps. The
control flow is relatively simple, with just one for-loop and one return
statement.
markov2 = function(n, p, init = 0) { x = c(init, numeric(n)) for (i in 1:n) { idx = x[i] + 1 x[i + 1] = rbinom(1, 1, p[idx]) } return (x[-1]) }
We can convert the markov2
function to a control flow graph with the
to_blocks()
method. When given a function as input, to_blocks()
will
automatically convert it to an AST and then convert the AST to a CFG.
markov2 = to_blocks(markov2) cfg = markov2$cfg class(cfg)
As you might expect, the ControlFlowGraph
class represents a control flow
graph. In rstatic, each control flow graph corresponds to one function;
to_blocks()
automatically wraps code in a function definition as needed.
ControlFlowGraph
is an R6 class, so the cfg
variable is a reference, as
discussed in the previous section.
ControlFlowGraph
is a subclass of FlowGraph
, which represents flow graphs
in general. The FlowGraph
class has several fields, but the most important
ones are blocks
and graph
. The blocks
field holds a list with one
element, or block, for each vertex in the graph. The names on the list match
the vertex names on the graph. Each block holds metadata about the
corresponding node. For convenience, elements of the blocks
field can be
accessed by name or index with the extraction operator [[
on the FlowGraph
object.
For instance, the code below gets the first block in cfg
.
cfg[[1]]
The graph
field holds the actual graph, as an igraph graph. Calling
plot()
on a FlowGraph
object displays a visualization of the graph.
cfg$graph plot(cfg)
In a control flow graph, the vertices are blocks of code that execute linearly, without any branches in control flow. These vertices are sometimes called basic blocks. At the end of each basic block, control flows to another basic block, which may depend on a condition. The directed edges in the control flow graph show how control flows from block to block.
The blocks in the ControlFlowGraph
class are basic blocks, which are
represented by the BasicBlock
class. The BasicBlock
class has fields
body
, name
, phi
, and terminator
. The name
field holds the block's
name, which is the same as the vertex name. By default, rstatic uses a "%"
followed by an integer for block names. The "%" helps to avoid confusion with
integer indices.
cfg[[1]]$name
The body
field holds a list of ASTNode
objects, one for each ordinary
expression in the block.
cfg[[1]]$body
The terminator
field holds the final expression in the block, which is always
a branch to a different block or a return from the function the CFG represents.
These are represented by subclasses of the Terminator
class, which are
discussed in more detail in a subsequent section.
cfg[[1]]$terminator
The phi
field holds a list of Phi
objects. These are related to SSA and
appear at the beginning of the basic block when the block is printed. The Phi
class is discussed in the "Static Single-Assignment Form" section.
The Terminator
class has only one method, successors()
, which returns the
indices of the basic blocks that control flows to from the terminator.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.