Using RTypeInference

The RTypeInference package provides type inference for R code.

The type inference strategy is based on Damas-Milner type inference. This strategy collects information about the type of each expression in the source code, combining information about related expressions.

As an example, consider the expression x <- 3.14. The value 3.14 has type double in R. Since the expression assigns 3.14 to the variable x, we can conclude that the variable x must also have type numeric.

Let's try out RTypeInference for this example.

First, we need to use rstatic's quote_blocks function to quote the expression. The quote_blocks function converts the expression into static single-assignment (SSA) form. A key feature of SSA form is that it numbers each variable assignment, so that different definitions of the same variable have distinct names. This is convenient for type inference because R allows variables to be assigned different types at different points in the source code.

Second, we call RTypeInference's constrain function on the quoted expression in order to collect information about the types.

Here's the complete code for the two steps described:

library(rstatic)
library(RTypeInference)
ex = quote_blocks(x <- 3.14)
constrain(ex)

The constrain function returns a Result object, an S4 object that represents intermediate results in RTypeInference. The object has two slots. We'll examine the first slot, constraints in a later example. The second slot, map, contains a mapping between variables in the source code and types.

The mapping reports that the variable x_1 is assigned ("defined as") a numeric value. Here x_1 corresponds to the variable x in the original expression. The _1 is the number added by the SSA form.

The mapping also reports that the variable x_1 is used (that is, read from) later in the code. This is a limitation of the current version of RTypeInference, likely to be fixed in later versions. For the curious reader, here's what's happening: RTypeInference assumes that any source code passed in for type inference is part of a function definition. The SSA form makes R's implicit return rule for functions explicit by inserting a variable ._return_ into the code and assigning the last expression to ._return_. Thus type inference processes the assignment x <- 3.14 and also the assignment ._return_ <- x. That's why there's an extra entry for ._return_ in the mapping.

The mapping is an S4 object of class SymbolMap. The class inherits list, and uses one named element for each variable in the source code.

Let's extend the example with another variable. Consider this code:

x <- 3.14
y <- x

The steps to run inference on this code are the same as before. Since the code has multiple lines, it needs to be surrounded by curly braces when passed to quote_blocks:

ex = quote_blocks({
  x <- 3.14
  y <- x
})
ans = constrain(ex)
ans

Let's look at how the results correspond to the code. As before, the map reports that x_1 is defined as RNumeric. The map also reports that the type of x_1 is used somewhere, as t1. The t1 is a placeholder, called a type variable, for the type of x_1. Type variables provide the type inference system with three kinds of flexibility:

  1. As placeholders for types that haven't been inferred yet. For instance, if the type of x was not clear from the assignment in the code above, we might still want to record the fact that y has the same type as x. These type variables are the ones type inference ultimately tries to "solve" for, replacing them with concrete types.

  2. As indicators of polymorphism. A polymorphic function is one that accepts more than one type of argument. In Damas-Milner type systems, polymorphism is all or nothing: a polymorphic function must be able to accept any type, and type variables are used to indicate polymorphism. Type variables that indicate polymorphism remain even after type inference is complete. In R, polymorphic functions are often restricted to a specific set of types rather than accepting any type.

  3. As indirection that makes the type inference system more robust. RTypeInference generates a new type variable each time a program variable is used, even if an earlier assignment establishes the type of that program variable. By not propagating the type information gathered from the assignment immediately, the step in type inference that collects information from source code is decoupled from the step in type inference that "solves" for the types by propagating the collected information. As a result, the information collection step always analyzes the entire source code, even if the source code contains type errors.

A constraint is a relationship between a pair of types (including type variables). Through the constrain function, type inference collects constraints based on the expressions in the source code. These appear in the constraints slot of the result.

In the example code, the assignment y_1 = x_1 causes type inference to generate a type variable t1 as a placeholder for the type of x_1. Since the earlier assignment x_1 = 3.14 reveals that the type of x_1 is numeric, the type inference system also generates a constraint that t1 is equivalent to the type RNumeric. This constraint is the first element of the constraints slot. Let's examine the constraint:

ans@constraints[[1]]

Each constraint records a relationship between a pair of types; these two elements correspond to slots t1 and t2 (no relation the type variable t1) on the constraint object. The typesys class Equivalence represents an equivalence constraint: the two types must be the same. Here we can see that t1 and t2 contain the type variable t1 and the type RNumeric, respectively.

Constraints also have a slot src with a reference to the source code expression that caused the constraint to be generated. In this case, the source code expression that caused the constraint to be generated is the expression x_1, which is part of the assignment y_1 <- x_1:

con1 = ans@constraints[[1]]
con1@src
con1@src$parent

As mentioned earlier, RTypeInference assumes the source code being analyzed is part of a function definition, so we also see constraints related to the return value (constraints 2 and 3 here). For instance, consider the second constraint:

con2 = ans@constraints[[2]]
con2

con2@src$parent

This constraint shows that RTypeInference thinks y_1 is the return value (since it is the value of the last expression). The type inference system generates another type variable, t2, as a placeholder for the type of y_1 in the return call. The earlier assignment y_1 <- x_1 established that t2 must have the same type as t1, so the inference system also generates this constraint.

Notice that the constraint is not an Equivalence but instead an ImplicitInstance. This kind of constraint allows for polymorphism. From the perspective of constrain, the type variable t1 is an unknown type, to be established later on by solving for the type variables. As a consequence, t1 could be a polymorphic function type. In that case, t2 is an instance of t1 (instantiated with concrete argument types, to be inferred based on how y_1 is called). This distinction between an Equivalence and an ImplicitInstance is only important for function types; if t2 is not a function type, then the inference system will treat the ImplicitInstance the same way it would an Equivalence.

The result from constrain is an intermediate result in type inference. We've presented it here because it is important to understand for anyone that wants to tinker with the type inference system, and because even on its own it contains useful information, such as where program variables are used.

To get to the final result from type inference, we need to have RTypeInference solve the system of constraints. At a high level, the strategy to solve the system is to treat each constraint as a substitution: given a constraint, the type variables on one side can be replaced with the types (or type variables) on the other side in all other constraints, and the original constraint removed. Repeating this process eventually yields a composition of substitutions that "solves" the constraint system, so that all type variables equivalent to a concrete type are replaced by that type.

The RTypeInference function csolve computes the substitution that solves the type inference problem. This function should be called directly on the result from constrain. For instance, in the example:

sub = csolve(ans)
sub

The result is a typesys Substitution object, which records the equivalence between each type variable and type. Recall that the SymbolMap in the result from constrain is our mapping between program variables and type information. Thus to relate the substitution from csolve back to the program variables in the source code, we need to call the substitution on the SymbolMap. For convenience, we can also call the substitution on the entire Result object. Here's the code:

# sub(ans@map)
# OR
sub(ans)

The substituted SymbolMap is the final result from type inference. In this case, inference is able to determine that x_1, y_1, and the return value are all numeric.

This section described the basic workflow for RTypeInference: call constrain to collect type information from the code, call csolve to use the constraints to solve for the type variables, and finally, apply the resulting substitution to the SymbolMap from constrain to relate the type variables back to the program variables.

A More Sophisticated Example

Now let's look at an example with a variable whose type is not made obvious from an assignment. The code is a function that computes the area of an annulus:

circle_area = function(r_inner, r_outer) {
  outer = pi * r_outer^2
  inner = pi * r_inner^2

  return (outer - inner)
}

We can use the to_blocks function to convert the function code to SSA form (rather than retyping the whole thing for quote_blocks). Then we can call constrain as before:

ex = to_blocks(circle_area)
ans = constrain(ex)
ans

As in the previous example, the result is a Result object with slots constraints and map. All symbols in the source code, including names of functions, are recorded in the symbol map, so it has many more entries here. Variables in the global environment, like *, pi, and ^ here, are recorded as having no definition in the provided source code.

Note: A bug in the current version of RTypeInference causes the parameters to be left out of the SymbolMap. This is related to trying to support nested function definitions and multiple scopes in type inference. I think the right way to handle this is to have the constraint generator create a new SymbolMap for each nested function, with a parent field (mimicking R's environments), but haven't tested this.

In order to make type inference more accurate, we can provide the constrain function with type information about variables up front (a future version of the package could optionally inspect an environment to collect some of this information automatically, or have a database of type information for built-in functions).

For instance, suppose we want to specify that pi is a numeric value. We can do this by creating a SymbolMap before calling constrain, and passing the map to constrain. Here's the code to create the symbol map and add a definition for pi:

m = SymbolMap()
m = set_defined_as(m, "pi", RNumeric)

The set_defined_as function assigns a type to a specific program variable in the SymbolMap. The corresponding get_defined_as retrieves the type a program variable is defined as from a SymbolMap.

With the map defined, we can pass this information into the type inference system when we call constrain:

constrain(ex, m)

Similarly, we might want let the type inference system know that (in this case) * is a function that takes two numeric values and returns a numeric value. We can do that by adding another entry to the symbol map before calling constrain:

m = set_defined_as(m, "*", RFunction(RNumeric, RNumeric, RNumeric))

The typesys RFunction constructor creates a function type; the last argument is always treated as the return type and the others as the argument. In the current version of typesys, argument types are only matched positionally, not by name. Let's generate the constraints with the updated symbol map:

ans = constrain(ex, m)

Although we haven't provided complete information to the type inference system (for instance, we did not provide a type for ^), the system can still recover a lot of information from the code and what we did provide. Let's solve for the type variables to see what we get:

sub = csolve(ans)
sub(ans)

In the result, we can see that the system determined that ^ is a function, and that its second argument and result must be numeric. The system was unable to determine the type of the first argument to ^, because the parameters r_outer and r_inner could potentially have any type.



duncantl/RTypeInference documentation built on Jan. 16, 2021, 12:30 a.m.