Empirical Evaluation of "SAT Solving for Termination Proofs with Recursive Path Orders and Dependency Pairs"

In the paper SAT Solving for Termination Proofs with Recursive Path Orders and Dependency Pairs we present a propositional encoding for syntactic path orders on terms, in connection with dependency pairs. We consider all common instances of syntactic recursive path orders (RPO), i.e., lexicographic path orders (LPO), multiset path orders (MPO), and path orders with status. This facilitates the application of SAT solvers for termination analysis of term rewrite systems (TRSs).

We address four main inter-related issues and show how to encode them as satisfiability problems of propositional formulas that can be efficiently handled by SAT solving: (A) the lexicographic comparison w.r.t. a permutation of the arguments; (B) the multiset extension of a base order; (C) the combined search for a path order together with an argument filter to orient a set of inequalities; and (D) how the choice of the argument filter influences the set of inequalities that have to be oriented (so-called usable rules).

In our experiments, we measure the power and the efficiency of our new SAT-based implementation in order to assess the impact of our contributions.

Implementation in AProVE

We integrated our approach in the termination tool AProVE which implements the DP framework. A special version of AProVE that allows to repeat the experiments below can be accessed via a special web interface.

The implementation can be parameterized by the following choices:

When comparing tuples of terms, does one just permit lexicographic extensions (leading to LPO(S)), just multiset extensions (leading to MPO), or both (leading to RPO)?
In the case of lexicographic extensions, does one just use left to right comparisons (leading to LPO) or arbitrary permutations (leading to LPO with status, LPOS)?
Does one just allow strict precedences (leading to LPO, LPOS, MPO, RPO) or does one also permit non-strict precedences (leading to QLPO, QLPOS, QMPO, QRPO)?

Most SAT solvers require a propositional formula in conjunctive normal form (CNF) as input. When transforming propositional formulas into CNF, we used the implementation of Tseitin's algorithm offered by the Java SAT library SAT4J. The CNF formula is then given to a state-of-the-art SAT solver.

Tools

To evaluate the usefulness of SAT solvers for termination analysis with RPO, we compared the new implementation with a version of AProVE where the search for RPO and argument filters is done by AProVE's own dedicated solver, which can be parameterized in the same way as our SAT-based implementation. To search for argument filters, the dedicated solver uses a vastly improved version of the algorithm described in the paper "Mechanizing and Improving Dependency Pairs".

To measure the performance of termination tools, since 2004 there is an annual International Termination Competition. Here, the tools are tested against each other on a large data base of examples (the so-called Termination Problem Data Base, TPDB). Since AProVE was the most powerful tool for proving termination of TRSs in all competitions, it provides the ideal setting to evaluate the impact of our new SAT encoding.

Settings

We ran AProVE in four configurations:

In the first configuration, we try to prove termination by orienting rules repeatedly with RPO and by removing those rules that are strictly decreasing. This configuration is used to evaluate Contributions (A) and (B), i.e., the propositional encoding of RPO without argument filters.
In the second configuration, we first build the initial DP problem and then apply a reduction pair processor based on RPO repeatedly. With this configuration we evaluate Contribution (C). Apart from the reduction pair processor, here we also use the dependency graph processor, which is the other main processor of the DP framework. It is used to treat sets of "mutually recursive dependency pairs" separately.
The third configuration corresponds to the second one, but now the reduction pair processor only has to orient the usable rules. Here we evaluate our encoding of RPO with argument filters and usable rules, i.e., Contribution (D).
In contrast to the first three configurations, in the fourth configuration we essentially used the AProVE version that participated in the last termination competition 2008. In this way we evaluate the impact of our contributions in a full version of AProVE. We experimented with three variants of this AProVE version: one variant uses our SAT-based contributions, one uses the dedicated solver to search for RPOs, and one does not use RPO at all.

For our new SAT-based implementation, we coupled AProVE with two of the best SAT solvers from the application section of the SAT Competition 2009: glucose, and PrecoSAT. For comparison, we also ran experiments with the popular solver MiniSAT2 as well as with the Java-based SAT solver available in the SAT4J library.

Examples

In our experiments, we tested the new SAT-based and the old non-SAT-based implementation on the examples of the Termination Problem Data Base (TPDB). This is the collection of problems used in the Termination Competition. It contains 1391 TRSs from different sources as a benchmark for termination analysis of term rewriting.

Experiments

The experiments were run on a 2.67 GHz Intel Core i7 and as in the Termination Competition, we used a time-out of 60 seconds for each example.

The experiments listed in the following table use the first configuration described above, i.e., the repeated application of rule removal using monotonic reduction pairs based on RPO. The column "dedicated" contains the results when using AProVE with the dedicated non-SAT-based solver while the subsequent four columns show the results for our SAT encoding using AProVE with different SAT solvers. As indicated in the column "order", we performed experiments for LPO with strict and non-strict precedence (denoted LPO/QLPO), for LPO with status (LPOS/QLPOS), for MPO (MPO/QMPO), and for RPO (RPO/QRPO). For each experiment, we give the number of TRSs which could be proved terminating ("proved"), the number of time-outs ("time-out"), and the analysis time in seconds for running AProVE on all 1391 TRSs ("runtime"). The "best" numbers are always printed in bold. You can get detailed information for each row by clicking on "Details" in the respective row leading to a comparison of the dedicated solver to the four SAT-based solvers. If you want to compare different restrictions of RPO, you can do so by clicking the header of a column.

order		dedicated	SAT4J	MiniSAT2	PrecoSAT	glucose
LPO	proved	147	147	147	147	147	Details
	time-out	51	0	0	0	0
	runtime	4158.3	26.7	190.3	165.5	170.8
QLPO	proved	169	169	169	169	169	Details
	time-out	141	0	0	0	0
	runtime	9597.6	36.8	210.3	193.5	193.0
LPOS	proved	167	167	167	167	167	Details
	time-out	62	0	0	0	0
	runtime	5148.9	26.9	194.2	166.4	175.3
QLPOS	proved	193	194	194	194	194	Details
	time-out	212	0	0	0	0
	runtime	14528.6	62.3	225.3	221.9	204.1
MPO	proved	107	107	107	107	107	Details
	time-out	77	0	0	0	0
	runtime	6048.9	33.9	203.4	178.4	183.3
QMPO	proved	120	120	120	120	120	Details
	time-out	176	0	0	0	0
	runtime	12308.2	78.9	244.8	234.2	208.4
RPO	proved	172	172	172	172	172	Details
	time-out	82	0	0	0	0
	runtime	7160.6	38.3	205.8	187.9	187.2
QRPO	proved	200	202	202	202	202	Details
	time-out	234	1	1	1	1
	runtime	16632.4	157.5	308.4	347.4	287.3

The table shows that with our new SAT encoding, performance improves by orders of magnitude over the existing dedicated solver. Note that without a time-out, this effect would be aggravated. By using SAT, the number of time-outs reduces dramatically from up to 234 to at most 1. The only example that still produces a time-out has function symbols of high arity and AProVE can only show its termination by further sophisticated termination techniques in addition to RPO. Apart from this example, when using the SAT4J solver, no example takes longer than six seconds and only 21 TRSs take longer than one second each.

Comparing the variants of AProVE which use different SAT solvers, we see that their results are all similar with the exception of the SAT4J variant, which is on average 100 ms faster than the others. The reason is that SAT4J is called directly from AProVE since both tools are written in Java. For the other solvers, there is an overhead for creating and communicating with an external process.

It is easy to see that RPO is significantly more powerful than (Q)LPO. With RPO, one can prove termination of 202 instead of 169 examples. This clearly shows the impact of our Contributions (A) and (B), i.e., of the new encodings for those aspects of RPO which go beyond LPO (viz. lexicographic comparisons w.r.t. permutations and multiset comparisons).

The implementation of our SAT-based encoding uses sharing of subformulas to keep the size of the resulting propositional formulas polynomial. To demonstrate the effect of this sharing, we also performed experiments where we disabled sharing and therefore produced encodings of exponential size. The next table shows the results for RPO with non-strict precedence. The version "SAT" uses sharing of subformulas (i.e., it is identical to the best version from the first table). The version "no sharing" does not use sharing and as a reference, we also give the performance of the version with the dedicated solver in the column labeled "dedicated". Note that the dedicated solver avoids redundant computations by caching the relation between terms that have already been considered.

	dedicated	no sharing	SAT
proved	200	200	202	Details
time-out	234	29	1
runtime	16632.4	3265.4	157.5

The table shows that even in the rather easy setting of the first configuration, without sharing of subformulas we obtain 28 more time-outs while runtimes are increased by a factor of more than 20. Nevertheless, even without sharing, the SAT-based encoding is clearly favorable to the dedicated solver.

Still, using the first configuration does not lead to a powerful termination approach. The best instance just manages to show termination of 202 out of 1391 TRSs. Of course, one cannot expect to prove termination of all 1391 TRSs, as there are many non-terminating TRSs in the TPDB. But the full version of AProVE can show termination of 1008 TRSs. Thus, when using the first configuration which is restricted to rule removal with RPO, we only reach 20.0% of the power of the full system. Thus, next we consider the second configuration, i.e., the DP framework with the reduction pair processor. Here, we repeatedly use a reduction pair based on RPO with argument filters. In the following table we again see the numbers of proved TRSs and of time-outs as well as the total runtimes in seconds.

order		dedicated	SAT4J	MiniSAT2	PrecoSAT	glucose
LPO	proved	264	304	305	303	305	Details
	time-out	464	9	6	17	6
	runtime	29508.0	1269.1	1954.2	2621.5	1545.2
QLPO	proved	284	362	363	362	363	Details
	time-out	502	24	15	22	15
	runtime	31885.8	2811.2	3066.7	3520.8	2479.4
LPOS	proved	275	316	316	316	317	Details
	time-out	481	7	7	16	6
	runtime	30751.8	1194.2	1885.2	2662.9	1536.5
QLPOS	proved	298	412	413	412	413	Details
	time-out	524	22	27	24	15
	runtime	33717.4	2719.9	4303.3	3671.3	2507.0
MPO	proved	231	261	261	261	261	Details
	time-out	491	19	12	21	11
	runtime	31501.0	1876.6	1901.0	3116.8	1886.0
QMPO	proved	251	335	335	335	335	Details
	time-out	521	36	27	28	22
	runtime	33724.4	3768.8	3117.9	4193.9	3155.1
RPO	proved	280	324	325	324	325	Details
	time-out	503	17	11	23	10
	runtime	32714.6	1831.4	1914.8	3259.0	1914.4
QRPO	proved	297	429	430	429	432	Details
	time-out	562	36	28	38	22
	runtime	36418.0	4014.6	3468.9	5002.0	3305.2

The table shows that also when combining RPO with argument filters, our SAT-based approach is much more efficient than the dedicated solver. If we compare the results with those of the first table, we can make two observations.

First, the use of the DP framework results in a much more powerful approach. Now the best instance can prove termination of 432 out of 1391 TRSs. This means that we can now handle 42.9% of those examples that could be handled by the full version of AProVE. In other words, the best instance from the third table can prove 113.9% more examples than the best instance from the first table. This clearly shows the impact of our Contribution (C), i.e., of the new encoding for the combination of RPO with argument filters.

Second, the SAT-based approach now also leads to a number of time-outs. In general, the generated propositional formulas seem to be much harder to solve. In this setting, SAT4J's advantage of avoiding the overhead of external processes diminishes and using external solvers solvers like glucose is often more efficient.

Compared to termination proofs with the first configuration, by using the DP framework in the second configuration, power has been more than doubled. On the other hand, total runtime increases by a factor of more than 20 for the best instance. To reduce the size of the propositional formulas, next we use the third configuration, i.e., the reduction pair processor with usable rules. Here, one has to solve constraints based on RPO with argument filters. In the following table we once more see the numbers of proved TRSs and of time-outs as well as the total runtimes in seconds.

order		dedicated	SAT4J	MiniSAT2	PrecoSAT	glucose
LPO	proved	421	435	435	435	435	Details
	time-out	187	0	2	1	2
	runtime	12653.2	202.1	1769.4	1359.5	1544.2
QLPO	proved	453	489	489	489	489	Details
	time-out	241	9	9	7	7
	runtime	15898.7	1503.6	2661.2	1936.2	2205.1
LPOS	proved	427	440	440	440	440	Details
	time-out	192	0	2	1	2
	runtime	13354.0	196.3	1839.5	1252.8	1499.3
QLPOS	proved	467	530	532	532	532	Details
	time-out	275	13	10	9	6
	runtime	18307.1	1554.5	2515.3	2151.1	2251.0
MPO	proved	425	433	433	433	433	Details
	time-out	217	0	2	2	2
	runtime	14643.8	232.5	1752.4	1472.7	1568.6
QMPO	proved	467	502	502	502	502	Details
	time-out	259	15	13	13	11
	runtime	17186.4	1840.4	2683.0	2385.7	2444.4
RPO	proved	441	454	454	454	454	Details
	time-out	219	0	2	2	2
	runtime	14972.6	216.8	1755.0	1523.8	1612.6
QRPO	proved	485	552	554	554	555	Details
	time-out	292	17	14	13	9
	runtime	19489.4	1999.1	2718.3	2658.3	2413.8

The table shows that both the AProVE variant with the dedicated solver and the variants with SAT-based solvers are more efficient and significantly more powerful than the respective variants in the third table. The maximum number of time-outs is down to 9 for the most powerful variant. When comparing the most powerful variants from the third and the fourth table, the runtime decreases by 27.0%. When regarding the most powerful variants with strict precedences, the runtime even reduces by 88.2%. This indicates that the propositional formulas obtained in the third configuration are significantly easier to solve than the ones for the second configuration. The power of the termination approach resulting from the third configuration clearly shows the benefits of our Contribution (D) from the introduction, i.e., of the new encoding for usable rules. With respect to power, the best instance in the fourth table is able to show termination for 555 out of 1513 TRSs. This means that we now succeed on 55.1% of those TRSs that can be solved by the full version of AProVE. When using RPO with strict precedence, we can show termination of 45.0% of these TRSs within 217 seconds, i.e., with an average time of 143 ms for each example. When using RPO with non-strict precedence, more than 91% of the TRSs in the TPDB can be analyzed in less than one second for each example.

In order to assess the impact of our contributions in a full version of AProVE that is allowed to use all techniques available in the tool, we look at three AProVE versions which correspond to the fourth configuration. The version named "SAT" uses the contributions of our paper while "dedicated" uses RPO with the dedicated non-SAT-based solver and "no RPO" does not use any recursive path orders at all. The following table gives the numbers of proved TRSs (first row) and of time-outs (second row) as well as total runtimes in seconds (third row).

	no RPO	dedicated	SAT
proved	977	984	1008	Details
disproved	234	226	234
time-out	175	177	144
runtime	14626.4	15981.5	13410.2

The table clearly shows that adding a SAT-based implementation of RPO is beneficial for the efficiency and power of a full version of AProVE. Using the dedicated solver, one can show termination of 7 TRSs where the "no RPO" version failed with a time-out. On the other hand, the "dedicated" variant yields 9 new time-outs. Thus, the total runtime is 1355.1 seconds (9.3%) more than when using no RPO. In other words, without SAT solving the use of RPO increases runtime substantially whereas power is only increased mildly. Furthermore, the full version of AProVE can also disprove termination. But when using RPO with the dedicated solver, we lose 8 disproofs of termination due to the runtime consumed by the search for RPO. So without our contributions, it is not clear whether the use of RPO is really recommendable in a full-scale termination prover.

In contrast, using the SAT-based variant, 31 examples are gained compared to the variant without RPO while we obtain no additional time-outs and keep all disproofs. The total runtime of the version using SAT-based RPO is 1216.2 seconds (8.3%) less than when using no RPO. So due to our contributions, now it really pays off to add RPO to a full termination prover.