A Detailed Analysis on the Effectiveness of Automatic Filtering

cover
17 Jan 2024

Authors:

(1) Dominic Petrak, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany;

(2) Nafise Sadat Moosavi, Department of Computer Science, The University of Sheffield, United Kingdom;

(3) Ye Tian, Wluper, London, United Kingdom;

(4) Nikolai Rozanov, Wluper, London, United Kingdom;

(5) Iryna Gurevych, UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany.

Table of Links

Abstract & Introduction

Related Work

Datasets Examined

Manual Error Type Analysis and Taxonomies

Automatic Filtering for Potentially Relevant Dialogs

Statistical Analysis

Evaluation and Experiments

Discussion

Conclusion, Limitation, Acknowledgments, and References

A Integrated Error Taxonomy – Details

B Error-Indicating Sentences And Phrases

C Automatic Filtering – Implementation

D Automatic Filtering – Sentence-Level Analysis

E Task-Oriented Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

G Inter-Annotator Agreement – Detailed Analysis

H Annotation Guidelines

I Hyperparameters and Baseline Experiments

J Human-Human Dialogs – Examples

F Effectiveness Of Automatic Filtering – A Detailed Analysis

For the statistical analysis in Section 6, we consider 20 dialogs from each similarity range, i.e., 50%−60%, 60%−70%, 70%−80%, 80%−90%, 90% − 100% (if available, see also Appendix D) for each dataset examined. As the data in the upper ranges (80%−100%) is scarce in case of WoW (Dinan et al., 2019), PC (Zhang et al., 2018), and BABI (Bordes et al., 2017), the filtered dialogs consists only of 555 dialogs (instead of 600 like the randomly selected dialogs). Table 12 shows the errors annotated for the statistical analysis with respect to the similarity ranges identified by automatic filtering (meaning that each dialog contains at least one user response with a sentence identified to be similar to at least one error-indicating sentence in this similarity range). Overall (O) represents the number of dialogs randomly sampled from the respective similarity range, and Error (E) represents the number of dialogs identified in our manual analysis to contain an error in a system utterance.

Table 12: Identified errors in all datasets across similarity ranges.

Overall, only 58 dialogs of the randomly selected ones (9.6%) contain errors. In the case of automatic filtering, we observe 130 of such cases. Therefore, automatic filtering shows to facilitate the process of identifying errors in system utterances. Even if the number of identified errors is overall low, most errors are identified in the range of 60% −100%, excluding the densest section in case of MWoZ (Budzianowski et al., 2018), SGD (Rastogi et al., 2020), PC and WoW, 50% − 60% (see also Figure 2).

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.