Empirical Evaluation of "Mechanizing and Improving Dependency Pairs"

In our experiments, we tested AProVE on the examples of the Termination Problem Data Base. This is the collection of problems used in the annual termination competition. It contains 773 TRSs from different sources as a benchmark for termination analysis of term rewriting. Our experiments are presented in two tables. In the first table, we tried to prove (full) termination and in the second, we tried to prove innermost termination of all examples. Although there are classes of TRSs where one should only prove innermost termination since here, innermost termination already implies full termination, we did not use this observation in the first table in order to obtain a clearer evaluation of our contributions for full termination. Similarly, we also did not use any additional recent refinements of the dependency pair technique in our experiments in order to clearly assess the impact of the contributions in this paper In our experiments, we used AProVE with the following techniques:

Moreover, we use five different kinds of reduction pairs for the reduction pair processor:

For the first three reduction pairs, the reduction pair processor has to search for argument filterings. Here, we use the method of Section 6.

Finally, in each of the above settings, we experiment with different variants for the application of the transformation processors of Definition 28:

You can download a special version of AProVE with all settings described above. So this version of AProVE permits to re-run all our experiments:

AProVE 1.2 JAR06 Special Edition (download and start with: java -jar aprove06.jar, requires Java 5 or higher)

For each example we used a time limit of 60 seconds. This corresponds to the way that tools were evaluated in the annual competitions for termination tools. The computer used was an AMD Athlon 64 at 2.2 GHz running Sun's J2SE 1.5.0 under GNU/Linux 2.6.10 which is similar in speed to the computer used in the 2005 competition.

In the tables, we give the number of examples where termination could be proved ("Y"), where AProVE failed within the time limit of 60 seconds ("F"), and where it failed due to a time-out ("TO"). In square brackets, we give the average runtime (in seconds) needed for TRSs where AProVE could prove termination and where it failed within the time limit. Clicking on "[Details]" leads to a table with the detailed runtimes and results for all examples.



Termination Proofs

The full table containing all possible combinations of our settings can be found here.



Line Algorithm Order Tr. Y F TO
1 [Details]Thm. 12 EMB no 89 [0.8] 680 [1.6] 4
2 [Details]Thm. 12 LPO no 245 [3.0] 399 [3.6] 129
3 [Details]Thm. 12 POLO-filter no 234 [1.7] 426 [5.3] 113
4 [Details]Thm. 12 POLO-7.1 no 251 [0.9] 512 [1.6] 10
5 [Details]Thm. 12 POLO-7.1+7.2 no 251 [0.9] 520 [1.5] 2
6 [Details]Thm. 17 EMB no 174 [0.7] 588 [1.4] 11
7 [Details]Thm. 17 LPO no 277 [1.5] 387 [2.4] 109
8 [Details]Thm. 17 POLO-filter no 331 [1.8] 361 [2.0] 81
9 [Details]Thm. 17 POLO-7.1 no 341 [1.0] 425 [1.4] 7
10 [Details]Thm. 17 POLO-7.1+7.2 no 341 [1.0] 432 [1.4] 0
11 [Details]Thm. 26 EMB no 233 [0.8] 529 [1.6] 11
12 [Details]Thm. 26 LPO no 292 [1.6] 374 [2.1] 107
13 [Details]Thm. 26 POLO-filter no 361 [1.8] 334 [1.8] 78
14 [Details]Thm. 26 POLO-7.1 no 368 [1.1] 388 [1.8] 17
15 [Details]Thm. 26 POLO-7.1+7.2 no 369 [1.1] 399 [1.7] 5
16 [Details]
17 [Details]
Thm. 26 POLO-7.1+7.2 older
old
390 [1.2] 349 [2.4] 34
18 [Details]Thm. 26 POLO-7.1+7.2 (1)+(2) 374 [1.2] 394 [1.8] 5
19 [Details]Thm. 26 POLO-7.1+7.2 lim 396 [2.1] 268 [1.9] 109
20 [Details]Thm. 26 POLO-7.1+7.2 safe 406 [1.1] 330 [2.5] 37




Innermost Termination Proofs

The full table containing all possible combinations of our settings can be found here.



Line Algorithm Order Tr. Y F TO
1/6 [Details]Thm. 12/17 EMB no 246 [0.8] 516 [1.8] 11
2/7 [Details]Thm. 12/17 LPO no 330 [2.1] 350 [3.2] 93
3/8 [Details]Thm. 12/17 POLO-filter no 378 [2.1] 324 [2.7] 71
4/9 [Details]Thm. 12/17 POLO-7.1 no 388 [1.0] 379 [1.6] 6
5/10 [Details]Thm. 12/17 POLO-7.1+7.2 no 388 [1.2] 384 [1.7] 1
11 [Details]Thm. 26 EMB no 289 [0.9] 473 [1.6] 11
12 [Details]Thm. 26 LPO no 341 [1.6] 341 [2.4] 91
13 [Details]Thm. 26 POLO-filter no 402 [1.7] 305 [1.9] 66
14 [Details]Thm. 26 POLO-7.1 no 408 [1.2] 350 [1.9] 15
15 [Details]Thm. 26 POLO-7.1+7.2 no 408 [1.1] 360 [2.0] 5
16 [Details]Thm. 26 POLO-7.1+7.2 older 465 [1.3] 253 [5.6] 55
17 [Details]Thm. 26 POLO-7.1+7.2 old 467 [1.3] 250 [5.5] 56
18 [Details]Thm. 26 POLO-7.1+7.2 (1)+(2) 441 [1.6] 314 [3.5] 18
19 [Details]Thm. 26 POLO-7.1+7.2 lim 386 [3.7] 144 [2.4] 243
20 [Details]Thm. 26 POLO-7.1+7.2 safe 480 [1.7] 217 [6.9] 76




Evaluating Section 3 and 4:

Comparing the results for Thm. 12 (lines 1-5), Thm. 17 (lines 6-10), and Thm. 26 (lines 11-15) shows the benefits of our contributions from Sections 3 and 4.
The following links lead to tables comparing Thm. 12 vs. Thm. 17 vs. Thm. 26 for termination using The following links lead to tables comparing Thm. 12/17 vs. Thm. 26 for innermost termination using
Irrespective of the underlying reduction pair, Thm. 17 is always more powerful than Thm. 12 and increases the success rate by up to 95.5%. Thm. 26 improves upon Thm. 17 further and increases power by up to 33.9% (up to 17.5% for innermost termination).


Evaluating Section 7:

To measure the impact of our contributions in Section 7, we compare the naive use of monotonic polynomial orders (POLO-filter, line 13) with the more sophisticated approaches of Section 7.1 and 7.2 (lines 14-15):
The following links lead to tables comparing Thm. 26 using POLO-filter vs. POLO-7.1 for The following links lead to tables comparing Thm. 26 using POLO-7.1 vs. POLO-7.1+7.2 for
As all three methods are equally powerful in principle, the difference is in efficiency. Indeed, the step from POLO-filter to POLO-7.1 and further to POLO-7.1+7.2 reduces the overall runtime dramatically. The reason is that now a failure can often be detected quickly for examples which led to a time-out before. Thus, while the number of examples where termination can be proved within the time limit only increases slightly, the number of time-outs reduces substantially (by at least 77.2% when going from POLO-filter to POLO-7.1 and by at least 66.7% when going from POLO-7.1 to POLO-7.1+7.2). This fast-failure behavior of POLO-7.1+7.2 permits the use of further techniques in order to attempt a termination proof within the time limit.


Evaluating Section 5:

Now we evaluate the dependency pair transformations from Section 5:
The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "no" vs. "older" transformations for The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "older" vs. "old" transformations for The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "old" vs. "safe" transformations for
If one uses the existing "older" transformations with our new heuristic for their application (line 16), then power is increased by 5.7% (14.0% for innermost termination). While the use of dependency pair transformations increases power, it can of course also increase runtimes. This holds especially for innermost termination proofs, since here the transformations are much more often applicable. Definition 28 extends the dependency pair transformations in two ways: we presented more liberal applicability conditions for the transformations in the innermost case and we introduced a new forward instantiation transformation. While the new applicability conditions only lead to a minor improvement of 0.4% in power (cf. the "old" heuristic, line 17), the forward instantiation technique increases power again by 4.1% (2.8% for innermost termination), cf. the "safe heuristic in line 20.

Finally, we also evaluate our heuristic from Definition 33 which describes when to apply dependency pair transformations:

The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "safe" vs. "lim" transformations for The following links lead to tables comparing Thm. 26 using POLO-7.1+7.2 with "safe" vs. "(1)+(2)" transformations for
We consider two possible alternatives. The first alternative is a restriction to those transformations which make the DP problem "smaller", i.e., which correspond to Definition 33 (1) or (2), cf. line 18. Our experiments show that allowing one additional narrowing, instantiation, and forward instantiation step (as in the "safe" heuristic, line 20) has considerable advantages since it increases power by 8.6% (8.8% for innermost termination). As a second alternative, we tried a "lim"-heuristic, which simply applies transformations up to a certain limit. The experiments demonstrate that our "safe"-heuristic is significantly better: it increases power (in particular for innermost termination, where the success rate is improved by 24.4%) and it reduces runtimes dramatically since the "lim"-heuristic leads to extremely many time-outs.


Experiments with AProVE 1.2:

In the end, we also ran the experiment with the automatic mode of the newest AProVE-version (AProVE 1.2) which features several additional techniques (e.g., semantic labelling and match-bounds) in addition to the results of this paper. In the following table, we give the number of examples where termination could be disproved in the "N" column:

AProVE 1.2 automatic modeYNFTO
for termination576 [2.3]94 [2.7]11 [26.7]92
for innermost termination617 [2.6]61 [2.9]14 [8.7]81

Now AProVE could prove termination of 576 examples while it disproved termination for 94 TRSs. Innermost termination could be shown for 617 examples while it could be disproved for 61 TRSs. This corresponds exactly to the results of AProVE in the termination competition 2005. The average runtime for a successful proof was 2.3s (2.6s for innermost termination) and the average time for a successful disproof was 2.7s (2.9s for innermost termination). Finally, AProVE failed on 11 examples within the time limit (14 for innermost termination) and the average runtime for these failures was 26.7s for termination and 8.7s for innermost termination.

To summarize, our experiments indicate that the contributions of the paper are indeed very useful in practice. This also holds if our results are combined with other termination techniques.