\DeclareMathOperator{\vector}{vec} \DeclareMathOperator{\Numeric}{numeric} \DeclareMathOperator{\Integer}{integer}
library(RTypeInference)
The purpose of this vignette is to describe how the goals of the RTypeInference package.
Type inference is the process of extracting information about the types of variables in a script. The type information is important because it reveals which operations are valid for a given variable and how much memory needs to be allocated for that variable. In some languages, programmers are required to annotate variables with types, but this is not the case in R. Instead, the R interpreter keeps track of types at run-time, so programmers don't have to provide any annotations. The downside of this is that tracking types has a memory cost and checking types has a time cost.
When we compile a script, we can examine more than just the current line (in contrast to interpreting a script). Often there are clues in the code that reveal the types of variables. We can use these clues to infer the types of some or even all variables. This process is called type inference.
Type inference is an area of active research in the computer science community. An early paper on type inference was "A Theory of Type Polymorphism in Programming" (Milner 1977). Milner formally defines a type system and provides an algorithm for inferring types under the system. The system is polymorphic, which means that it allows functions that can accept arguments of any type.
For instance, append()
is an example of a polymorphic function in R. The
first argument is typically a list, but the second argument can have any type.
The type of the second argument has no bearing on how it is appended to the
list.
Milner's type system is designed for functional programming languages with no imperative expressions (i.e., assignments or other expressions that have side effects). Although R encourages a functional style of programming, it also has imperative expressions. This makes adapting Milner's system to R more difficult. Fortunately, there is a way to work around assignments, and since the R community actively discourages writing code with side effects, other side effects may be rare in practice. Strategies for dealing with imperative code are discussed in a later section.
In Milner's system, types that are fully determined are called monotypes. These are atomic types like integers, real numbers, and Booleans. The set of monotypes also includes tagged unions, Cartesian products, functions, and lists of the atomic types. When a function is polymorphic, the parameterized types are called polytypes or type variables.
Example
The numeric()
function has exactly one parameter, length
, which specifies
the length of the created vector. The argument can be any scalar that can be
coerced to an integer; the function checks the argument type at run-time to
perform the coercion. In other words, the type of length
is a tagged union.
Several different types are permitted, but at run-time the argument must be
passed with a tag identifying the type. We denote tagged unions with $+$. Thus
we can type length
as
$$
\Integer + \Numeric,
$$
although technically other types types are also allowed. They've been omitted
here to keep the example succinct.
The value returned from numeric()
is always a numeric vector, which we denote
by
$$
\vector[\Numeric].
$$
Using $\to$ to denote the function type, it follows that the type of
numeric()
is
$$
(\Integer + \Numeric) \to \vector[\Numeric]
$$
Example
As an example of the type system, consider the lapply()
function, with the
simplifying assumption that the first argument is restricted to homogeneous
vectors. That is, the first argument must have type $\vector[\alpha]$ where
$\alpha$ is a type variable and $\vector[\cdot]$ denotes a vector of the given
type. The second argument must be a function type, which we denote with $\to$.
Since the second argument must be a function of elements of the first argument,
it has type $\alpha \to \beta$, where $\beta$ is a new, unconstrained type
variable. Combining all of this information, the type of the lapply()
function must be
$$ (\vector[\alpha] \times (\alpha \to \beta)) \to \vector[\beta] $$
That is, lapply()
takes a vector of some type $\alpha$ and transforms it into
a vector of some type $\beta$ based on the return type of the applied function.
The types $\alpha$ and $\beta$ are not fully determined until the function is
called, and can differ from one call to the next.
Since lapply()
works with non-homogeneous lists and allows passing
additional, fixed arguments to the applied function, this example is not
completely general. We will revisit this example later to explore how the type
system can be generalized.
ones = function(n) { rep.int(1, n) }
infer_types()
To get the types for the variables in an R program, we use the infer_types()
function. The input is the control flow graph for the R program, in SSA form.
Computing the CFG is handled by the rstatic package, and the details are
discussed in that package's documentation. Here we will take those details for
granted.
The idea of type inference is to generate a system of constraints on the types
used in the program, and then solve the system to get the types. The
infer_types()
function performs both of these steps, but they can also be
carried out separately with constrain()
and solve()
, respectively.
Each constraint comes from a different piece of evidence in the code about the
types. For instance, an assignment x = "hi"
signals that the variable x
is
a string. Many expressions allow us to narrow down the set of valid types for a
variable, but don't specify the exact type. When we see x + y
, we know that
unless +
has been redefined by the user, x
and y
must both be numbers.
However, we cannot tell from this expression alone whether they are logical
values, integers, floating point numbers, or complex numbers.
R allows variables to be redefined with a different type at any point in a program, so most variables are associated with a sequence of types rather than a single type. Knowing the sequence of types for a variable is only useful if we also know where in the program each type occurs.
To deal with this problem, constraints always refer to single static assignment (SSA) names rather than variable names. In SSA form, each (re)definition of a variable is given a unique name. Because of this, each SSA name can be unambiguously associated with a single type in a variable's sequence of types.
SSA names are not the only strategy for keeping track of constraints and types. We could think of types as being associated with specific expressions within a program rather than variables. This is a conceptually different approach to the problem. In this approach, type inference must find a type for every expression in the program. This is not any more difficult, but does require additional bookkeeping with associated memory costs.
In order to generate constraints, constrain()
traverses the program's CFG and
calls the S3 generic constrain_ast()
on each expression. For now, we can
customize how constraints are generated for arbitrary expressions by redefining
the constrain_ast()
methods as needed.
Consider a function that finds the area of an annulus when given the inner and outer radius
circle_area = function(r_inner, r_outer) { outer = pi * r_outer^2 inner = pi * r_inner^2 return (outer - inner) }
We can compute the constraints for this with
cfg = rstatic::to_cfg(circle_area) cons = constrain(cfg)
Constraints are collected in a ConstraintSet
, which has reference semantics
to simplify writing recursive constrain_ast()
methods. The order of the
constraints within the set does not matter. This is important because it allows
us to set constraints based on how a variable is used after the point where
that variable was defined.
In the example, we can see that the right-hand side of each constraint is printed like a function call. For now, types are not computed for calls during constraint generation; the computation is deferred until the system of constraints is solved.
By using constrain()
and solve()
separately, we can add to and modify the
constraints before solving. Constraints can be accessed by index with [[
(although they cannot yet be assigned with [[
). The method
The returned SolutionSet
lists the inferred type of each SSA name defined in
the function.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.