# Specializing Grammars for Inducing Faults

**Important:** Pyodide takes time to initialize.
Initialization completion is indicated by a red border around *Run all* button.

This post is the implementation of my paper *Input Algebras*

In my previous post on DDSet I explained how
one can abstract failure inducing inputs. The idea is that given an input such
as `'1+((2*3/4))'`

which induces a
failure in the program, one can extract a minimal
failure inducing input such as say `'((4))'`

, which can again be abstracted to
produce the expression `((<expr>))`

which precisely defines the fault
inducing input. Such an input is quite useful for the developers to understand
why a failure occurred.

However, such inputs are insufficient from a testing perspective. For example,
such a pattern fails to capture the contexts in which the doubled parenthesis
can appear, and hence induce the failure. That is, how do we use the
pattern `((<expr>))`

to produce inputs such as `'1+((2*3/4))'`

? This is what
this post will discuss.

Note that the guarantee of inducing failures is statistical. That is, the
abstract input is mined by evaluating produced
inputs for failure a fixed (configurable) number of times. It is possible that
some rare inputs may still fail to induce the failure. Since abstract possibly
failure inducing inputs is a mouthful, let us call these abstract failure
inducing inputs **evocative patterns** for short. In this post, we will see
how to transform such an **evocative pattern** to an **evocative grammar** that is
guaranteed to produce *evocative inputs* (that is, each input contains an
**evocative fragment** and the derivation tree of the input contains an
**evocative subtree**) in all contexts that are guaranteed (statistically) to
induce failures.

As before, let us start with importing our required modules.

We import the following modules

We have our input that causes the failure.

We first parse the input

Then reduce input

Finally, extract the abstract pattern.

However, it does not actually tell you in what all
circumstances the failure can be reproduced. The problem is that it
represents a *complete* abstract input. The only general part here is
`<expr>`

between the doubled parenthesis. So, it tells us that we can
produce `((1))`

, `((2 + 3))`

etc. but these are not the only possible
errors. Indeed, our original error: ‘1+((2*3/4))’ does not fit this
template. So, how can we rectify this limitation?

# A grammar that produces at least one evocative fragment per input generated.

The basic idea is that if we can find an *abstract subtree* of the
evocative pattern derivation tree, such that the presence of this abstract
subtree in the input guarantees the failure, then we can modify the grammar
such that this abstract subtree is always present (we call the abstract
subtree an **evocative subtree**). That is, for any input
from such a grammar, at least one instance of the evocative
subtree will be present. This is fairly easy to do if the generated tree
contains a nonterminal of the same kind as that of the top node of the evocative subtree.
Simply replace that node with the evocative subtree, and fill in the
abstract nonterminals in the evocative subtree with concrete expansions.

## Reachable Grammar

On the other hand, this gives us an idea. What if we modify the grammar
such that at least one instance of such a nonterminal is present? Such
a grammar is called the `reachable_grammar`

.
To produce a reaching grammar, first we need to find what nonterminals are
reachable from the expansion of a given nonterminal.
A nonterminal `<A>`

is reachable from another nonterminal `<B>`

if and only
if one of the expansion rules for `<B>`

contains the nonterminal `<A>`

or
it is reachable from one of the nonterminals in the expansion rules of `<B>`

Note that it is not enough to be the same nonterminal. That is, (for e.g.)
`<A>`

is not # reachable from `<A>`

if expansion rules of `<A>`

contain only
terminal symbols.

We can now collect it in a data structure for easier access

That is, if we want to know the nonterminals that are reachable from `<integer>`

,
we can simply do

That is, only `<digit>`

and `<integer>`

are reachable from the expansion of
nonterminal `<integer>`

## Reachable positions.

Next, given an evocative subtree, we want to find what tokens of the grammar can actually embed such a subtree. We call the top node of an evocative subtree its characterizing node, and the nonterminal the characterizing nonterminal of the evocative subtree.

Say we assume that `<factor>`

is the characterizing nonterminal. Here are the
locations in grammar where one can embed the evocative subtree.

## Insertion Grammar

Given the insertion locations, can we produce a grammar such that we can
*guarantee* that *at least one* instance of evocative subtree can be inserted?
To do that, all we need to guarantee is that the start node in any derivation
tree can *reach* the characterizing nonterminal.
For now, let us call our fault `F1`

. Let us indicate that our start symbol
guarantees reachability of characterizing nonterminal by specializing it. So, our new
start symbol is `<start F1>`

Next, for the start symbol to guarantee reachability to characterizing nonterminal,
all that we need to ensure is that *all* the expansion rules of start can
reach the characterizing nonterminal. On the other hand, for a guarantee that an
expansion rule can reach the characterizing nonterminal, all that is required is that
one of the nonterminals in that rule guarantees reachability.
We start with a few helper functions

Defining the `reachable_key()`

It is used as follows

## Pattern Grammar

Next, we need to ensure that our evocative subtree can form a unique subtree. For that, all we need to do is that all nodes in the subtree are named uniquely. Not all nodes in the evocative subtree needs unique names however. DDSet produces trees such that some nodes in the tree are left abstract. We leave these with the original node names.

Using

We can extract a unique pattern grammar from this tree.

We can convert this to grammar, but first, to display the grammar properly, we
define `display_grammar()`

We define `pattern_grammar()`

that wraps both calls.

Using it.

We define a procedure to reverse our pattern grammar to ensure we can get back our tree.

The tree can be recovered thus.

Given the reaching grammar and the pattern grammar, we can combine them to produce the complete grammar

Using it

Here, you will notice a problem:
There are a few nonterminals such as `<integer F1>`

that do not have a
definition. That is, it has no expansion (not even empty expansion). So,
any rule that uses it will also by definition have no possible expansions.
This has consequences during generation, forcing us to abandon partially
constructed trees. Hence, we define a `grammar_gc()`

## Cleanup of the grammar

Let us make sure that we are operating on a copy of the grammar.

Next, we find the empty keys in the grammar that do not have a definition.

Now, we need to remove such empty definitions, which also means any rules that
refer to the corresponding nonterminals also have to be removed. The
`remove_nonterminal`

function takes a nonterminal, and removes its references
from the given grammar.

When removing a rule, it can also happen that the corresponding nonterminal may be left with no further rules. Hence, we need to define finding and removing empty nonterminals from the grammar recursively.

These gives us the `grammar_gc()`

At this point we are ready to define our `atleast_one_fault_grammar()`

The new grammar is as follows

This grammar is now guaranteed to produce at least one instance of the characterizing node.

This seems to work. However, there is a wrinkle. Note that we set the
evocative subtree as this subtree `pattern[1][0][1][0][1][0]`

, which is a
`<factor>`

node. But why not any of the other subtrees? for example, why not
`evocative_pattern`

itself? Indeed, the effectiveness of
our specialized grammar is dependent on finding as small a evocative subtree
as possible that fully captures the fault. So, how do we automate it?
The idea is fairly simple. We start from the root of the evocative pattern.
We know that this pattern fully captures the predicate. That is, any valid input
generated from this pattern by filling in the abstract nodes is (statistically)
guaranteed to produce the failure. Next, we move to the children of this node.
If the node is abstract, we return immediately. If however, the child is not
abstract, we let the child be the evocative subtree, and try to reproduce the
failure. If the failure is not reproduced, we move to the next child. If the
failure is reproduced we recurse deeper into the current child.

Usage

That is, we found the correct evocative subtree. We can confirm this by comparing it against the subtree we identified manually.

We save this subtree for later use.

Here is another evocative pattern, but we define a different predicate.

Check the assertion

We first parse the input as before

Then reduce input

Finally, extract the evocative pattern.

Then extract the evocative subtree

prettyprinting

using

We save this evocative pattern for later use.

The new grammar is as follows

This grammar is now guaranteed to produce at least one instance of a divide by zero.

At this point, we have the ability to guarantee that a evocative
fragment is present in any inputs produced. In later posts I will discuss how
combine multiple such fragments together using `and`

, `or`

or `negation`

. I
will also discuss how to ensure `at most`

one fragment in the input and
`exactly`

one fragment or `exactly n`

fragments.
As before, the runnable source of this notebook can be found here