Empirical Evaluation of "Mechanizing and Improving Dependency Pairs"

In our experiments, we tested AProVE on the examples of the Termination Problem Data Base. This is the collection of problems used in the annual termination competition. It contains 773 TRSs from different sources as a benchmark for termination analysis of term rewriting. Our experiments are presented in two tables. In the first table, we tried to prove (full) termination and in the second, we tried to prove innermost termination of all examples. Although there are classes of TRSs where one should only prove innermost termination since here, innermost termination already implies full termination, we did not use this observation in the first table in order to obtain a clearer evaluation of our contributions for full termination. Similarly, we also did not use any additional recent refinements of the dependency pair technique in our experiments in order to clearly assess the impact of the contributions in this paper In our experiments, we used AProVE with the following techniques:

Thm. 12 applies the dependency graph processor with the estimation of Definition 9 and the reduction pair processor of Theorem 12. This is the basic dependency pair technique without new improvements.
Thm. 17 uses the dependency graph and the reduction pair processor of Theorem 17. Thus, usable rules are also applied for termination proofs. For innermost termination, this is the same asThm. 12.
Thm. 26 uses the dependency graph and the reduction pair processor of Theorem 26. Thus, now one applies the usable rules w.r.t. argument filterings for both termination and innermost termination.

Moreover, we use five different kinds of reduction pairs for the reduction pair processor:

EMB is the embedding order.
LPO is the lexicographic path order where we allow different symbols to be equal in the precedence.
POLO-filter searches for linear polynomial interpretations with coefficients from {0,1}. Here, we do not yet use the results of Section 7.1 which combine the search for a polynomial order with the search for an argument filtering. Instead, we first determine an argument filtering and search for a monotonic polynomial order afterwards. Moreover, we also do not use the improvement of Section 7.2. Instead, in the reduction pair processor, we try all possibilities in order to make one dependency pair in the DP problem strictly decreasing.
POLO-7.1 differs from POLO-filter by using the results of Section 7.1. Now we do not search for an argument filtering anymore but we look for a (not necessarily monotonic) polynomial order. So in particular, when applying usable rules w.r.t. an argument filtering in Theorem 26, we proceed as in Theorem 46.
POLO-7.1+7.2 is like POLO-7.1, but it also uses the refinements of Section 7.2 which avoid the search for a strictly decreasing dependency pair when applying the reduction pair processor.

For the first three reduction pairs, the reduction pair processor has to search for argument filterings. Here, we use the method of Section 6.

Finally, in each of the above settings, we experiment with different variants for the application of the transformation processors of Definition 28:

"no" means that we do not use any transformations at all.
"older" is is the heuristic of Definition 33 for safe transformations. The combination with the reduction pair processor is done as described at the end of Section 5. However, we do not use the new forward instantiation transformation and instead of Definition 28, we use the previous applicability conditions for the transformations from older papers.
"old" is like "older", but with the new more liberal applicability conditions from Definition 28 instead.
"safe" is the heuristic of Definition 33 with the transformations from Definition 28. So in contrast to "old", now we also use forward instantiation.
"(1)+(2)" differs from "safe" by only regarding those transformation steps as "safe" which satisfy condition (1) or (2) of Definition 33.
"lim" uses the transformations of Definition 28 with a different heuristic than Definition 33. Now at most five transformations are allowed for each dependency pair. To combine the transformations with the reduction pair processor, the strategy at the end of Section 5 is modified as follows: At most two transformation steps are allowed before applying the reduction pair processor, while the remaining transformation steps (up to five in total) are performed afterwards.

You can download a special version of AProVE with all settings described above. So this version of AProVE permits to re-run all our experiments:

AProVE 1.2 JAR06 Special Edition (download and start with: java -jar aprove06.jar, requires Java 5 or higher)

For each example we used a time limit of 60 seconds. This corresponds to the way that tools were evaluated in the annual competitions for termination tools. The computer used was an AMD Athlon 64 at 2.2 GHz running Sun's J2SE 1.5.0 under GNU/Linux 2.6.10 which is similar in speed to the computer used in the 2005 competition.

In the tables, we give the number of examples where termination could be proved ("Y"), where AProVE failed within the time limit of 60 seconds ("F"), and where it failed due to a time-out ("TO"). In square brackets, we give the average runtime (in seconds) needed for TRSs where AProVE could prove termination and where it failed within the time limit. Clicking on "[Details]" leads to a table with the detailed runtimes and results for all examples.

Termination Proofs

The full table containing all possible combinations of our settings can be found here.

Line	Algorithm	Order	Tr.	Y	F	TO

1 [Details]	Thm. 12	EMB	no	89 [0.8]	680 [1.6]	4
2 [Details]	Thm. 12	LPO	no	245 [3.0]	399 [3.6]	129
3 [Details]	Thm. 12	POLO-filter	no	234 [1.7]	426 [5.3]	113
4 [Details]	Thm. 12	POLO-7.1	no	251 [0.9]	512 [1.6]	10
5 [Details]	Thm. 12	POLO-7.1+7.2	no	251 [0.9]	520 [1.5]	2

6 [Details]	Thm. 17	EMB	no	174 [0.7]	588 [1.4]	11
7 [Details]	Thm. 17	LPO	no	277 [1.5]	387 [2.4]	109
8 [Details]	Thm. 17	POLO-filter	no	331 [1.8]	361 [2.0]	81
9 [Details]	Thm. 17	POLO-7.1	no	341 [1.0]	425 [1.4]	7
10 [Details]	Thm. 17	POLO-7.1+7.2	no	341 [1.0]	432 [1.4]	0

11 [Details]	Thm. 26	EMB	no	233 [0.8]	529 [1.6]	11
12 [Details]	Thm. 26	LPO	no	292 [1.6]	374 [2.1]	107
13 [Details]	Thm. 26	POLO-filter	no	361 [1.8]	334 [1.8]	78
14 [Details]	Thm. 26	POLO-7.1	no	368 [1.1]	388 [1.8]	17
15 [Details]	Thm. 26	POLO-7.1+7.2	no	369 [1.1]	399 [1.7]	5

16 [Details] 17 [Details]	Thm. 26	POLO-7.1+7.2	older old	390 [1.2]	349 [2.4]	34
18 [Details]	Thm. 26	POLO-7.1+7.2	(1)+(2)	374 [1.2]	394 [1.8]	5
19 [Details]	Thm. 26	POLO-7.1+7.2	lim	396 [2.1]	268 [1.9]	109
20 [Details]	Thm. 26	POLO-7.1+7.2	safe	406 [1.1]	330 [2.5]	37

Innermost Termination Proofs

The full table containing all possible combinations of our settings can be found here.

Line	Algorithm	Order	Tr.	Y	F	TO

1/6 [Details]	Thm. 12/17	EMB	no	246 [0.8]	516 [1.8]	11
2/7 [Details]	Thm. 12/17	LPO	no	330 [2.1]	350 [3.2]	93
3/8 [Details]	Thm. 12/17	POLO-filter	no	378 [2.1]	324 [2.7]	71
4/9 [Details]	Thm. 12/17	POLO-7.1	no	388 [1.0]	379 [1.6]	6
5/10 [Details]	Thm. 12/17	POLO-7.1+7.2	no	388 [1.2]	384 [1.7]	1

11 [Details]	Thm. 26	EMB	no	289 [0.9]	473 [1.6]	11
12 [Details]	Thm. 26	LPO	no	341 [1.6]	341 [2.4]	91
13 [Details]	Thm. 26	POLO-filter	no	402 [1.7]	305 [1.9]	66
14 [Details]	Thm. 26	POLO-7.1	no	408 [1.2]	350 [1.9]	15
15 [Details]	Thm. 26	POLO-7.1+7.2	no	408 [1.1]	360 [2.0]	5

16 [Details]	Thm. 26	POLO-7.1+7.2	older	465 [1.3]	253 [5.6]	55
17 [Details]	Thm. 26	POLO-7.1+7.2	old	467 [1.3]	250 [5.5]	56
18 [Details]	Thm. 26	POLO-7.1+7.2	(1)+(2)	441 [1.6]	314 [3.5]	18
19 [Details]	Thm. 26	POLO-7.1+7.2	lim	386 [3.7]	144 [2.4]	243
20 [Details]	Thm. 26	POLO-7.1+7.2	safe	480 [1.7]	217 [6.9]	76

Evaluating Section 3 and 4:

Comparing the results for Thm. 12 (lines 1-5), Thm. 17 (lines 6-10), and Thm. 26 (lines 11-15) shows the benefits of our contributions from Sections 3 and 4.

The following links lead to tables comparing Thm. 12 vs. Thm. 17 vs. Thm. 26 for termination using

EMB (lines 1,6,11)    [only differences]
LPO (lines 2,7,12)    [only differences]
POLO-filter (lines 3,8,13)    [only differences]
POLO-7.1 (lines 4,9,14)    [only differences]
POLO-7.1+7.2 (lines 5,10,15)    [only differences]
The following links lead to tables comparing Thm. 12/17 vs. Thm. 26 for innermost termination using

EMB (lines 1/6,11)    [only differences]
LPO (lines 2/7,12)    [only differences]
POLO-filter (lines 3/8,13)    [only differences]
POLO-7.1 (lines 4/9,14)    [only differences]
POLO-7.1+7.2 (lines 5/10,15)    [only differences]

Irrespective of the underlying reduction pair, Thm. 17 is always more powerful than Thm. 12 and increases the success rate by up to 95.5%. Thm. 26 improves upon Thm. 17 further and increases power by up to 33.9% (up to 17.5% for innermost termination).

Evaluating Section 7:

To measure the impact of our contributions in Section 7, we compare the naive use of monotonic polynomial orders (POLO-filter, line 13) with the more sophisticated approaches of Section 7.1 and 7.2 (lines 14-15):

The following links lead to tables comparing Thm. 26 using POLO-filter vs. POLO-7.1 for

termination (lines 13,14)    [only differences]
innermost termination (lines 13,14)    [only differences]
The following links lead to tables comparing Thm. 26 using POLO-7.1 vs. POLO-7.1+7.2 for

termination (lines 14,15)    [only differences]
innermost termination (lines 14,15)    [only differences]

As all three methods are equally powerful in principle, the difference is in efficiency. Indeed, the step from POLO-filter to POLO-7.1 and further to POLO-7.1+7.2 reduces the overall runtime dramatically. The reason is that now a failure can often be detected quickly for examples which led to a time-out before. Thus, while the number of examples where termination can be proved within the time limit only increases slightly, the number of time-outs reduces substantially (by at least 77.2% when going from POLO-filter to POLO-7.1 and by at least 66.7% when going from POLO-7.1 to POLO-7.1+7.2). This fast-failure behavior of POLO-7.1+7.2 permits the use of further techniques in order to attempt a termination proof within the time limit.

Evaluating Section 5:

Now we evaluate the dependency pair transformations from Section 5:

The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "no" vs. "older" transformations for

termination (lines 15,16)    [only differences]
innermost termination (lines 15,16)    [only differences]
The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "older" vs. "old" transformations for

termination (lines 16,17)    [only differences]
innermost termination (lines 16,17)    [only differences]
The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "old" vs. "safe" transformations for

termination (lines 17,18)    [only differences]
innermost termination (lines 17,18)    [only differences]

If one uses the existing "older" transformations with our new heuristic for their application (line 16), then power is increased by 5.7% (14.0% for innermost termination). While the use of dependency pair transformations increases power, it can of course also increase runtimes. This holds especially for innermost termination proofs, since here the transformations are much more often applicable. Definition 28 extends the dependency pair transformations in two ways: we presented more liberal applicability conditions for the transformations in the innermost case and we introduced a new forward instantiation transformation. While the new applicability conditions only lead to a minor improvement of 0.4% in power (cf. the "old" heuristic, line 17), the forward instantiation technique increases power again by 4.1% (2.8% for innermost termination), cf. the "safe heuristic in line 20.

Finally, we also evaluate our heuristic from Definition 33 which describes when to apply dependency pair transformations:

The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "safe" vs. "lim" transformations for

termination (lines 18,19)    [only differences]
innermost termination (lines 18,19)    [only differences]
The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "safe" vs. "(1)+(2)" transformations for

termination (lines 18,20)    [only differences]
innermost termination (lines 18,20)    [only differences]

We consider two possible alternatives. The first alternative is a restriction to those transformations which make the DP problem "smaller", i.e., which correspond to Definition 33 (1) or (2), cf. line 18. Our experiments show that allowing one additional narrowing, instantiation, and forward instantiation step (as in the "safe" heuristic, line 20) has considerable advantages since it increases power by 8.6% (8.8% for innermost termination). As a second alternative, we tried a "lim"-heuristic, which simply applies transformations up to a certain limit. The experiments demonstrate that our "safe"-heuristic is significantly better: it increases power (in particular for innermost termination, where the success rate is improved by 24.4%) and it reduces runtimes dramatically since the "lim"-heuristic leads to extremely many time-outs.

Experiments with AProVE 1.2:

In the end, we also ran the experiment with the automatic mode of the newest AProVE-version (AProVE 1.2) which features several additional techniques (e.g., semantic labelling and match-bounds) in addition to the results of this paper. In the following table, we give the number of examples where termination could be disproved in the "N" column:

AProVE 1.2 automatic mode Y N F TO
for termination 576 [2.3] 94 [2.7] 11 [26.7] 92

for innermost termination 617 [2.6] 61 [2.9] 14 [8.7] 81

AProVE 1.2 automatic mode	Y	N	F	TO
for termination	576 [2.3]	94 [2.7]	11 [26.7]	92
for innermost termination	617 [2.6]	61 [2.9]	14 [8.7]	81

Now AProVE could prove termination of 576 examples while it disproved termination for 94 TRSs. Innermost termination could be shown for 617 examples while it could be disproved for 61 TRSs. This corresponds exactly to the results of AProVE in the termination competition 2005. The average runtime for a successful proof was 2.3s (2.6s for innermost termination) and the average time for a successful disproof was 2.7s (2.9s for innermost termination). Finally, AProVE failed on 11 examples within the time limit (14 for innermost termination) and the average runtime for these failures was 26.7s for termination and 8.7s for innermost termination.

To summarize, our experiments indicate that the contributions of the paper are indeed very useful in practice. This also holds if our results are combined with other termination techniques.