Empirical Evaluation of "SAT Solving for Termination Proofs with Recursive Path Orders and Dependency Pairs"

In the paper SAT Solving for Termination Proofs with Recursive Path Orders and Dependency Pairs we present a propositional encoding for syntactic path orders on terms, in connection with dependency pairs. We consider all common instances of syntactic recursive path orders (RPO), i.e., lexicographic path orders (LPO), multiset path orders (MPO), and path orders with status. This facilitates the application of SAT solvers for termination analysis of term rewrite systems (TRSs).

We address four main inter-related issues and show how to encode them as satisfiability problems of propositional formulas that can be efficiently handled by SAT solving: (A) the lexicographic comparison w.r.t. a permutation of the arguments; (B) the multiset extension of a base order; (C) the combined search for a path order together with an argument filter to orient a set of inequalities; and (D) how the choice of the argument filter influences the set of inequalities that have to be oriented (so-called usable rules).

In our experiments, we measure the power and the efficiency of our new SAT-based implementation in order to assess the impact of our contributions.


Implementation in AProVE

We integrated our approach in the termination tool AProVE which implements the DP framework. A special version of AProVE that allows to repeat the experiments below can be accessed via a special web interface.

The implementation can be parameterized by the following choices:

Most SAT solvers require a propositional formula in conjunctive normal form (CNF) as input. When transforming propositional formulas into CNF, we used the implementation of Tseitin's algorithm offered by the Java SAT library SAT4J. The CNF formula is then given to a state-of-the-art SAT solver.


Tools

To evaluate the usefulness of SAT solvers for termination analysis with RPO, we compared the new implementation with a version of AProVE where the search for RPO and argument filters is done by AProVE's own dedicated solver, which can be parameterized in the same way as our SAT-based implementation. To search for argument filters, the dedicated solver uses a vastly improved version of the algorithm described in the paper "Mechanizing and Improving Dependency Pairs".

To measure the performance of termination tools, since 2004 there is an annual International Termination Competition. Here, the tools are tested against each other on a large data base of examples (the so-called Termination Problem Data Base, TPDB). Since AProVE was the most powerful tool for proving termination of TRSs in all competitions, it provides the ideal setting to evaluate the impact of our new SAT encoding.


Settings

We ran AProVE in four configurations:

  1. In the first configuration, we try to prove termination by orienting rules repeatedly with RPO and by removing those rules that are strictly decreasing. This configuration is used to evaluate Contributions (A) and (B), i.e., the propositional encoding of RPO without argument filters.
  2. In the second configuration, we first build the initial DP problem and then apply a reduction pair processor based on RPO repeatedly. With this configuration we evaluate Contribution (C). Apart from the reduction pair processor, here we also use the dependency graph processor, which is the other main processor of the DP framework. It is used to treat sets of "mutually recursive dependency pairs" separately.
  3. The third configuration corresponds to the second one, but now the reduction pair processor only has to orient the usable rules. Here we evaluate our encoding of RPO with argument filters and usable rules, i.e., Contribution (D).
  4. In contrast to the first three configurations, in the fourth configuration we essentially used the AProVE version that participated in the last termination competition 2008. In this way we evaluate the impact of our contributions in a full version of AProVE. We experimented with three variants of this AProVE version: one variant uses our SAT-based contributions, one uses the dedicated solver to search for RPOs, and one does not use RPO at all.

For our new SAT-based implementation, we coupled AProVE with two of the best SAT solvers from the application section of the SAT Competition 2009: glucose, and PrecoSAT. For comparison, we also ran experiments with the popular solver MiniSAT2 as well as with the Java-based SAT solver available in the SAT4J library.


Examples

In our experiments, we tested the new SAT-based and the old non-SAT-based implementation on the examples of the Termination Problem Data Base (TPDB). This is the collection of problems used in the Termination Competition. It contains 1391 TRSs from different sources as a benchmark for termination analysis of term rewriting.


Experiments

The experiments were run on a 2.67 GHz Intel Core i7 and as in the Termination Competition, we used a time-out of 60 seconds for each example.

The experiments listed in the following table use the first configuration described above, i.e., the repeated application of rule removal using monotonic reduction pairs based on RPO. The column "dedicated" contains the results when using AProVE with the dedicated non-SAT-based solver while the subsequent four columns show the results for our SAT encoding using AProVE with different SAT solvers. As indicated in the column "order", we performed experiments for LPO with strict and non-strict precedence (denoted LPO/QLPO), for LPO with status (LPOS/QLPOS), for MPO (MPO/QMPO), and for RPO (RPO/QRPO). For each experiment, we give the number of TRSs which could be proved terminating ("proved"), the number of time-outs ("time-out"), and the analysis time in seconds for running AProVE on all 1391 TRSs ("runtime"). The "best" numbers are always printed in bold. You can get detailed information for each row by clicking on "Details" in the respective row leading to a comparison of the dedicated solver to the four SAT-based solvers. If you want to compare different restrictions of RPO, you can do so by clicking the header of a column.

order dedicated SAT4J MiniSAT2 PrecoSAT glucose
LPO proved 147147 147147147Details
time-out51 0 0 0 0
runtime 4158.3 26.7 190.3 165.5 170.8
QLPO proved 169169 169169169Details
time-out141 0 0 0 0
runtime 9597.6 36.8 210.3 193.5 193.0
LPOS proved 167167 167167167Details
time-out62 0 0 0 0
runtime 5148.9 26.9 194.2 166.4 175.3
QLPOSproved 193 194 194194194Details
time-out212 0 0 0 0
runtime 14528.6 62.3 225.3 221.9 204.1
MPO proved 107107 107107107Details
time-out77 0 0 0 0
runtime 6048.9 33.9 203.4 178.4 183.3
QMPO proved 120120 120120120Details
time-out176 0 0 0 0
runtime 12308.2 78.9 244.8 234.2 208.4
RPO proved 172172 172172172Details
time-out82 0 0 0 0
runtime 7160.6 38.3 205.8 187.9 187.2
QRPO proved 200 202 202202202Details
time-out234 1 1 1 1
runtime 16632.4 157.5308.4 347.4 287.3

The table shows that with our new SAT encoding, performance improves by orders of magnitude over the existing dedicated solver. Note that without a time-out, this effect would be aggravated. By using SAT, the number of time-outs reduces dramatically from up to 234 to at most 1. The only example that still produces a time-out has function symbols of high arity and AProVE can only show its termination by further sophisticated termination techniques in addition to RPO. Apart from this example, when using the SAT4J solver, no example takes longer than six seconds and only 21 TRSs take longer than one second each.

Comparing the variants of AProVE which use different SAT solvers, we see that their results are all similar with the exception of the SAT4J variant, which is on average 100 ms faster than the others. The reason is that SAT4J is called directly from AProVE since both tools are written in Java. For the other solvers, there is an overhead for creating and communicating with an external process.

It is easy to see that RPO is significantly more powerful than (Q)LPO. With RPO, one can prove termination of 202 instead of 169 examples. This clearly shows the impact of our Contributions (A) and (B), i.e., of the new encodings for those aspects of RPO which go beyond LPO (viz. lexicographic comparisons w.r.t. permutations and multiset comparisons).

The implementation of our SAT-based encoding uses sharing of subformulas to keep the size of the resulting propositional formulas polynomial. To demonstrate the effect of this sharing, we also performed experiments where we disabled sharing and therefore produced encodings of exponential size. The next table shows the results for RPO with non-strict precedence. The version "SAT" uses sharing of subformulas (i.e., it is identical to the best version from the first table). The version "no sharing" does not use sharing and as a reference, we also give the performance of the version with the dedicated solver in the column labeled "dedicated". Note that the dedicated solver avoids redundant computations by caching the relation between terms that have already been considered.

dedicatedno sharingSAT
proved200200202Details
time-out234291
runtime16632.43265.4157.5

The table shows that even in the rather easy setting of the first configuration, without sharing of subformulas we obtain 28 more time-outs while runtimes are increased by a factor of more than 20. Nevertheless, even without sharing, the SAT-based encoding is clearly favorable to the dedicated solver.

Still, using the first configuration does not lead to a powerful termination approach. The best instance just manages to show termination of 202 out of 1391 TRSs. Of course, one cannot expect to prove termination of all 1391 TRSs, as there are many non-terminating TRSs in the TPDB. But the full version of AProVE can show termination of 1008 TRSs. Thus, when using the first configuration which is restricted to rule removal with RPO, we only reach 20.0% of the power of the full system. Thus, next we consider the second configuration, i.e., the DP framework with the reduction pair processor. Here, we repeatedly use a reduction pair based on RPO with argument filters. In the following table we again see the numbers of proved TRSs and of time-outs as well as the total runtimes in seconds.

order dedicated SAT4J MiniSAT2 PrecoSAT glucose
LPOproved264304305303305Details
time-out46496176
runtime29508.01269.11954.22621.51545.2
QLPOproved284362363362363Details
time-out50224152215
runtime31885.82811.23066.73520.82479.4
LPOSproved275316316316317Details
time-out48177166
runtime30751.81194.21885.22662.91536.5
QLPOSproved298412413412413Details
time-out52422272415
runtime33717.42719.94303.33671.32507.0
MPOproved231261261261261Details
time-out49119122111
runtime31501.01876.61901.03116.81886.0
QMPOproved251335335335335Details
time-out52136272822
runtime33724.43768.83117.94193.93155.1
RPOproved280324325324325Details
time-out50317112310
runtime32714.61831.41914.83259.01914.4
QRPOproved297429430429432Details
time-out56236283822
runtime36418.04014.63468.95002.03305.2

The table shows that also when combining RPO with argument filters, our SAT-based approach is much more efficient than the dedicated solver. If we compare the results with those of the first table, we can make two observations.

First, the use of the DP framework results in a much more powerful approach. Now the best instance can prove termination of 432 out of 1391 TRSs. This means that we can now handle 42.9% of those examples that could be handled by the full version of AProVE. In other words, the best instance from the third table can prove 113.9% more examples than the best instance from the first table. This clearly shows the impact of our Contribution (C), i.e., of the new encoding for the combination of RPO with argument filters.

Second, the SAT-based approach now also leads to a number of time-outs. In general, the generated propositional formulas seem to be much harder to solve. In this setting, SAT4J's advantage of avoiding the overhead of external processes diminishes and using external solvers solvers like glucose is often more efficient.

Compared to termination proofs with the first configuration, by using the DP framework in the second configuration, power has been more than doubled. On the other hand, total runtime increases by a factor of more than 20 for the best instance. To reduce the size of the propositional formulas, next we use the third configuration, i.e., the reduction pair processor with usable rules. Here, one has to solve constraints based on RPO with argument filters. In the following table we once more see the numbers of proved TRSs and of time-outs as well as the total runtimes in seconds.

order dedicated SAT4J MiniSAT2 PrecoSAT glucose
LPOproved421435435435435Details
time-out1870212
runtime12653.2202.11769.41359.51544.2
QLPOproved453489489489489Details
time-out2419977
runtime15898.71503.62661.21936.22205.1
LPOSproved427440440440440Details
time-out1920212
runtime13354.0196.31839.51252.81499.3
QLPOSproved467530532532532Details
time-out275131096
runtime18307.11554.52515.32151.12251.0
MPOproved425433433433433Details
time-out2170222
runtime14643.8232.51752.41472.71568.6
QMPOproved467502502502502Details
time-out25915131311
runtime17186.41840.42683.02385.72444.4
RPOproved441454454454454Details
time-out2190222
runtime14972.6216.81755.01523.81612.6
QRPOproved485552554554555Details
time-out2921714139
runtime19489.41999.12718.32658.32413.8

The table shows that both the AProVE variant with the dedicated solver and the variants with SAT-based solvers are more efficient and significantly more powerful than the respective variants in the third table. The maximum number of time-outs is down to 9 for the most powerful variant. When comparing the most powerful variants from the third and the fourth table, the runtime decreases by 27.0%. When regarding the most powerful variants with strict precedences, the runtime even reduces by 88.2%. This indicates that the propositional formulas obtained in the third configuration are significantly easier to solve than the ones for the second configuration. The power of the termination approach resulting from the third configuration clearly shows the benefits of our Contribution (D) from the introduction, i.e., of the new encoding for usable rules. With respect to power, the best instance in the fourth table is able to show termination for 555 out of 1513 TRSs. This means that we now succeed on 55.1% of those TRSs that can be solved by the full version of AProVE. When using RPO with strict precedence, we can show termination of 45.0% of these TRSs within 217 seconds, i.e., with an average time of 143 ms for each example. When using RPO with non-strict precedence, more than 91% of the TRSs in the TPDB can be analyzed in less than one second for each example.

In order to assess the impact of our contributions in a full version of AProVE that is allowed to use all techniques available in the tool, we look at three AProVE versions which correspond to the fourth configuration. The version named "SAT" uses the contributions of our paper while "dedicated" uses RPO with the dedicated non-SAT-based solver and "no RPO" does not use any recursive path orders at all. The following table gives the numbers of proved TRSs (first row) and of time-outs (second row) as well as total runtimes in seconds (third row).

no RPOdedicatedSAT
proved9779841008Details
disproved234226234
time-out175177144
runtime14626.415981.513410.2

The table clearly shows that adding a SAT-based implementation of RPO is beneficial for the efficiency and power of a full version of AProVE. Using the dedicated solver, one can show termination of 7 TRSs where the "no RPO" version failed with a time-out. On the other hand, the "dedicated" variant yields 9 new time-outs. Thus, the total runtime is 1355.1 seconds (9.3%) more than when using no RPO. In other words, without SAT solving the use of RPO increases runtime substantially whereas power is only increased mildly. Furthermore, the full version of AProVE can also disprove termination. But when using RPO with the dedicated solver, we lose 8 disproofs of termination due to the runtime consumed by the search for RPO. So without our contributions, it is not clear whether the use of RPO is really recommendable in a full-scale termination prover.

In contrast, using the SAT-based variant, 31 examples are gained compared to the variant without RPO while we obtain no additional time-outs and keep all disproofs. The total runtime of the version using SAT-based RPO is 1216.2 seconds (8.3%) less than when using no RPO. So due to our contributions, now it really pays off to add RPO to a full termination prover.