Publications | Sugawara Group at National Institute of Informatics

2025

TactfulToM: Do LLMs have the Theory of Mind ability to understand White Lies?

Yiwei Liu, Emma Jane Pretty, Jiahao Huang, and 1 more author

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Abs DOI

While recent studies explore Large Language Models’ (LLMs) performance on Theory of Mind (ToM) reasoning tasks, research on ToM abilities that require more nuanced social context is limited, such as white lies. We introduce TactfulToM, a novel English benchmark designed to evaluate LLMs’ ability to understand white lies within real-life conversations and reason about prosocial motivations behind them, particularly when they are used to spare others’ feelings and maintain social harmony. Our benchmark is generated through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations to maintain the information asymmetry between participants necessary for authentic white lies. We show that TactfulToM is challenging for state-of-the-art models, which perform substantially below humans, revealing shortcomings in their ability to fully comprehend the ToM reasoning that enables true understanding of white lies.
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Momoka Furuhashi, Kouta Nakayama, Takashi Kodama, and 1 more author

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Abs DOI

Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored.We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations.Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring.Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations.Our code is available at https://github.com/momo0817/checklist-effectiveness-study.
Specification-Aware Machine Translation and Evaluation for Purpose Alignment

Yoko Kayano and Saku Sugawara

In Proceedings of the Tenth Conference on Machine Translation, Nov 2025

Abs DOI

In professional settings, translation is guided by communicative goals and client needs, often formalized as specifications.While existing evaluation frameworks acknowledge the importance of such specifications, these specifications are often treated only implicitly in machine translation (MT) research.Drawing on translation studies, we provide a theoretical rationale for why specifications matter in professional translation, as well as a practical guide to implementing specification-aware MT and evaluation.Building on this foundation, we apply our framework to the translation of investor relations texts from 33 publicly listed companies.In our experiment, we compare five translation types, including official human translations and prompt-based outputs from large language models (LLMs), using expert error analysis, user preference rankings, and an automatic metric. The results show that LLM translations guided by specifications consistently outperformed official human translations in human evaluations, highlighting a gap between perceived and expected quality.These findings demonstrate that integrating specifications into MT workflows, with human oversight, can improve translation quality in ways aligned with professional practice.
MCQFormatBench: Robustness Tests for Multiple-Choice Questions

Hiroo Takizawa, Saku Sugawara, and Akiko Aizawa

In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM\texttwosuperior), Jul 2025

Abs

Multiple-choice questions (MCQs) are often used to evaluate large language models (LLMs). They measure LLMs’ general common sense and reasoning abilities, as well as their knowledge in specific domains such as law and medicine. However, the robustness of LLMs to various question formats in MCQs has not been thoroughly evaluated. While there are studies on the sensitivity of LLMs to input variations, research into their responsiveness to different question formats is still limited. In this study, we propose a method to construct tasks to comprehensively evaluate the robustness against format changes of MCQs by decomposing the answering process into several steps. Using this dataset, we evaluate nine LLMs, such as Llama3-70B and Mixtral-8x7B. We find the lack of robustness to differences in the format of MCQs. It is crucial to consider whether the format of MCQs influences their evaluation scores when assessing LLMs using MCQ datasets.
Development of Numerical Error Detection Tasks to Analyze the Numerical Capabilities of Language Models

Taku Sakamoto, Saku Sugawara, and Akiko Aizawa

In Proceedings of the 31st International Conference on Computational Linguistics, Jan 2025

Abs

Numbers are used to describe quantities in various scenarios in daily life; therefore, numerical errors can significantly affect the meaning of the entire sentence, and even a single-letter error can be fatal. Detecting numerical errors often requires a high level of commonsense and is difficult even with the recent large language models (LLMs). In this study, we create a benchmark dataset of numerical error detection that uses automatically generated numerical errors. In our analysis, we classify the numerical errors based on the properties of the errors and investigate the ability of the model from several perspectives, including the error class, error size, and passage domain. The experimental results indicate that GPT-3.5, GPT-4, and Llama-3-Instruct (8B) perform well in the numerical error detection task; however, they are not as accurate as humans. We find that the LLMs misidentified correct numbers as errors more frequently than the humans did. In particular, the analysis demonstrates that the current LLMs still need improvement for detecting numerical errors requiring calculations or extensive prior knowledge.
Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models

Futa Waseda, Saku Sugawara, and Isao Echizen

In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 2025

Abs DOI

Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods—overfitting in supervised AT and lack of semantic awareness in unsupervised AT—achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work—centering high-quality linguistic supervision in robust visual representation learning.

2024

Rationale-Aware Answer Verification by Pairwise Self-Evaluation

Akira Kawabata and Saku Sugawara

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs DOI

Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier’s ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.
Can Language Models Induce Grammatical Knowledge from Indirect Evidence?

Miyu Oba, Yohei Oseki, Akiyo Fukatsu, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs DOI

What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans use indirect evidence efficiently, which is considered one of the inductive biases contributing to efficient language acquisition. To explore this question, we introduce the Wug InDirect Evidence Test (WIDET), a dataset consisting of training instances inserted into the pre-training data and evaluation instances. We inject synthetic instances with newly coined wug words into pretraining data and explore the model’s behavior on evaluation data that assesses grammatical acceptability regarding those words. We prepare the injected instances by varying their levels of indirectness and quantity. Our experiments surprisingly show that language models do not induce grammatical knowledge even after repeated exposure to instances with the same structure but differing only in lexical items from evaluation instances in certain language phenomena. Our findings suggest a potential direction for future research: developing models that use latent indirect evidence to induce grammatical knowledge.
Modeling Overregularization in Children with Small Language Models

Akari Haga, Saku Sugawara, Akiyo Fukatsu, and 4 more authors

In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024

Abs DOI

The imitation of the children’s language acquisition process has been explored to make language models (LMs) more efficient.In particular, errors caused by children’s regularization (so-called overregularization, e.g., using wroted for the past tense of write) have been widely studied to reveal the mechanisms of language acquisition. Existing research has analyzed regularization in language acquisition only by modeling word inflection directly, which is unnatural in light of human language acquisition. In this paper, we hypothesize that language models that imitate the errors children make during language acquisition have a learning process more similar to humans. To verify this hypothesis, we analyzed the learning curve and error preferences of verb inflections in small-scale LMs using acceptability judgments. We analyze the differences in results by model architecture, data, and tokenization. Our model shows child-like U-shaped learning curves clearly for certain verbs, but the preferences for types of overgeneralization did not fully match the observations in children.
What Makes Language Models Good-enough?

Daiki Asami and Saku Sugawara

In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024

Abs DOI

Psycholinguistic research suggests that humans may build a representation of linguistic input that is ‘good-enough’ for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation dataset (7,680 examples), which is designed to test the effects of two plausibility types, eight construction types, and three degrees of memory cost on language processing. To annotate GELP, we first conduct a crowdsourcing experiment whose design follows prior psycholinguistic studies. Our model evaluation against the annotated GELP then reveals that the full model as well as models with fewer layers and/or self-attention heads exhibit a good-enough performance. This result suggests that models with shallower depth and fewer heads can learn good-enough language processing.
AUTOMATIC FEEDBACK GENERATION FOR SHORT ANSWER QUESTIONS USING ANSWER DIAGNOSTIC GRAPHS

M. Furuhashi, H. Funayama, Y. Iwase, and 5 more authors

In EDULEARN24 Proceedings, Palma, Spain, 1-3 july, 2024 2024

DOI

2023

PROPRES: Investigating the Projectivity of Presupposition with Various Triggers and Environments

Daiki Asami and Saku Sugawara

In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), Dec 2023

Abs DOI

What makes a presupposition of an utterance —information taken for granted by its speaker— different from other pragmatic inferences such as an entailment is projectivity (e.g., the negative sentence the boy did not stop shedding tears presupposes the boy had shed tears before). The projectivity may vary depending on the combination of presupposition triggers and environments. However, prior natural language understanding studies fail to take it into account as they either use no human baseline or include only negation as an entailment-canceling environment to evaluate models’ performance. The current study attempts to reconcile these issues. We introduce a new dataset, projectivity of presupposition (PROPRES), which includes 12k premise–hypothesis pairs crossing six triggers involving some lexical variety with five environments. Our human evaluation reveals that humans exhibit variable projectivity in some cases. However, the model evaluation shows that the best-performed model, DeBERTa, does not fully capture it. Our findings suggest that probing studies on pragmatic inferences should take extra care of the human judgment variability and the combination of linguistic items.
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Akira Kawabata and Saku Sugawara

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023

Abs DOI

To precisely evaluate a language model’s capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiple-choice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives.
On Degrees of Freedom in Defining and Testing Natural Language Understanding

Saku Sugawara and Shun Tsugita

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

Abs DOI

Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing’s original interpretation of the Turing test and reveal that an effective test of NLU does not provide an operational definition; it merely provides inductive evidence that the test subject understands the language sufficiently well to meet stakeholder objectives. In other words, stakeholders are free to arbitrarily define NLU through their objectives. To use the test results as inductive evidence, stakeholders must carefully assess if the interpretation of test scores is valid or not. However, designing and using NLU tests involve other degrees of freedom, such as specifying target skills and defining evaluation metrics. As a result, achieving consensus among stakeholders becomes difficult. To resolve this issue, we propose a validity argument, which is a framework comprising a series of validation criteria across test components. By demonstrating that current practices in NLU studies can be associated with those criteria and organizing them into a comprehensive checklist, we prove that the validity argument can serve as a coherent guideline for designing credible test sets and facilitating scientific communication.
Probing Physical Reasoning with Counter-Commonsense Context

Kazushi Kondo, Saku Sugawara, and Akiko Aizawa

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jul 2023

Abs DOI

In this study, we create a CConS (Counter-commonsense Contextual Size comparison) dataset to investigate how physical commonsense affects the contextualized size comparison task; the proposed dataset consists of both contexts that fit physical commonsense and those that do not. This dataset tests the ability of language models to predict the size relationship between objects under various contexts generated from our curated noun list and templates. We measure the ability of several masked language models and encoder-decoder models. The results show that while large language models can use prepositions such as “in” and “into” in the provided context to infer size relationships, they fail to use verbs and thus make incorrect judgments led by their prior physical commonsense.
Which Shortcut Solution Do Question Answering Models Prefer to Learn?

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa

Proceedings of the AAAI Conference on Artificial Intelligence, Jun 2023

Abs DOI

Question answering (QA) models for reading comprehension tend to exploit spurious correlations in training sets and thus learn shortcut solutions rather than the solutions intended by QA datasets. QA models that have learned shortcut solutions can achieve human-level performance in shortcut examples where shortcuts are valid, but these same behaviors degrade generalization potential on anti-shortcut examples where shortcuts are invalid. Various methods have been proposed to mitigate this problem, but they do not fully take the characteristics of shortcuts themselves into account. We assume that the learnability of shortcuts, i.e., how easy it is to learn a shortcut, is useful to mitigate the problem. Thus, we first examine the learnability of the representative shortcuts on extractive and multiple-choice QA datasets. Behavioral tests using biased training sets reveal that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA, respectively. We find that the more learnable a shortcut is, the flatter and deeper the loss landscape is around the shortcut solution in the parameter space. We also find that the availability of the preferred shortcuts tends to make the task easier to perform from an information-theoretic viewpoint. Lastly, we experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set; the more learnable a shortcut is, the smaller the proportion of anti-shortcut examples required to achieve comparable performance on shortcut and anti-shortcut examples. We claim that the learnability of shortcuts should be considered when designing mitigation methods.
Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and 1 more author

In Findings of the Association for Computational Linguistics: EACL 2023, May 2023

Abs DOI

To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effectiveness of UR tasks (including both sentence-level and entity-level tasks) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness. While the previous models have not been explicitly trained on an entity-level reasoning prediction task, we build a multi-task model that performs three tasks together: sentence-level supporting facts prediction, entity-level reasoning prediction, and answer prediction. Experimental results on 2WikiMultiHopQA and HotpotQA-small datasets reveal that (1) UR tasks can improve QA performance. Using four debiased datasets that are newly created, we demonstrate that (2) UR tasks are helpful in preventing reasoning shortcuts in the multi-hop QA task. However, we find that (3) UR tasks do not contribute to improving the robustness of the model on adversarial questions, such as sub-questions and inverted questions. We encourage future studies to investigate the effectiveness of entity-level reasoning in the form of natural language questions (e.g., sub-question forms).
A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension

Xanh Ho, Johannes Mario Meissner, Saku Sugawara, and 1 more author

2023

2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, and 3 more authors

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Dec 2022

Abs DOI

Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image–caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision–language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.
Debiasing Masks: A New Framework for Shortcut Mitigation in NLU

Johannes Mario Meissner, Saku Sugawara, and Akiko Aizawa

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Dec 2022

Abs DOI

Debiasing language models from unwanted behaviors in Natural Language Understanding (NLU) tasks is a topic with rapidly increasing interest in the NLP community. Spurious statistical correlations in the data allow models to perform shortcuts and avoid uncovering more advanced and desirable linguistic features.A multitude of effective debiasing approaches has been proposed, but flexibility remains a major issue. For the most part, models must be retrained to find a new set of weights with debiased behavior.We propose a new debiasing method in which we identify debiased pruning masks that can be applied to a finetuned model. This enables the selective and conditional application of debiasing behaviors.We assume that bias is caused by a certain subset of weights in the network; our method is, in essence, a mask search to identify and remove biased weights.Our masks show equivalent or superior performance to the standard counterparts, while offering important benefits.Pruning masks can be stored with high efficiency in memory, and it becomes possible to switch among several debiasing behaviors (or revert back to the original biased model) at inference time. Finally, it opens the doors to further research on how biases are acquired by studying the generated masks. For example, we observed that the early layers and attention heads were pruned more aggressively, possibly hinting towards the location in which biases may be encoded.
Look to the Right: Mitigating Relative Position Bias in Extractive Question Answering

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa

In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Dec 2022

Abs DOI

Extractive question answering (QA) models tend to exploit spurious correlations to make predictions when a training set has unintended biases. This tendency results in models not being generalizable to examples where the correlations do not hold. Determining the spurious correlations QA models can exploit is crucial in building generalizable QA models in real-world applications; moreover, a method needs to be developed that prevents these models from learning the spurious correlations even when a training set is biased. In this study, we discovered that the relative position of an answer, which is defined as the relative distance from an answer span to the closest question-context overlap word, can be exploited by QA models as superficial cues for making predictions. Specifically, we find that when the relative positions in a training set are biased, the performance on examples with relative positions unseen during training is significantly degraded. To mitigate the performance degradation for unseen relative positions, we propose an ensemble-based debiasing method that does not require prior knowledge about the distribution of relative positions. We demonstrate that the proposed method mitigates the models’ reliance on relative positions using the biased and full SQuAD dataset. We hope that this study can help enhance the generalization ability of QA models in real-world applications.
How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

Xanh Ho, Saku Sugawara, and Akiko Aizawa

In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Nov 2022

DOI
Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Mana Ashida and Saku Sugawara

In Proceedings of the 29th International Conference on Computational Linguistics, Oct 2022
What Makes Reading Comprehension Questions Difficult?

Saku Sugawara, Nikita Nangia, Alex Warstadt, and 1 more author

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

DOI
Penalizing Confident Predictions on Largely Perturbed Inputs Does Not Improve Out-of-Distribution Generalization in Question Answering

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa

2022

2021

Can Question Generation Debias Question Answering Models? A Case Study on Question–Context Lexical Overlap

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa

In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, Nov 2021

DOI
Embracing Ambiguity: Shifting the Training Target of NLI Models

Johannes Mario Meissner, Napat Thumwanit, Saku Sugawara, and 1 more author

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Aug 2021

DOI
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Nikita Nangia, Saku Sugawara, Harsh Trivedi, and 3 more authors

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug 2021

Abs DOI

Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human–model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, Aug 2021

DOI
Benchmarking Machine Reading Comprehension: A Psychological Perspective

Saku Sugawara, Pontus Stenetorp, and Akiko Aizawa

In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Apr 2021

DOI

2020

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and 1 more author

In Proceedings of the 28th International Conference on Computational Linguistics, Dec 2020

Abs DOI Bib

A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question. In this study, we present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data. In our dataset, we introduce the evidence information containing a reasoning path for multi-hop questions. The evidence information has two benefits: (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model. We carefully design a pipeline and a set of templates when generating a question-answer pair that guarantees the multi-hop steps and the quality of the questions. We also exploit the structured format in Wikidata and use logical rules to create questions that are natural but still require multi-hop reasoning. Through experiments, we demonstrate that our dataset is challenging for multi-hop models and it ensures that multi-hop reasoning is required.
@inproceedings{ho-etal-2020-constructing, title = {Constructing A Multi-hop {QA} Dataset for Comprehensive Evaluation of Reasoning Steps}, author = {Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics}, month = dec, year = {2020}, address = {Barcelona, Spain (Online)}, publisher = {International Committee on Computational Linguistics}, url = {https://aclanthology.org/2020.coling-main.580/}, doi = {10.18653/v1/2020.coling-main.580}, pages = {6609--6625}, }
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and 1 more author

In Proceedings of the 28th International Conference on Computational Linguistics, Dec 2020

DOI
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets

Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and 1 more author

In AAAI, 2020

2018

What Makes Reading Comprehension Questions Easier?

Saku Sugawara, Kentaro Inui, Satoshi Sekine, and 1 more author

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Oct 2018

Abs DOI Bib

A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiple-choice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.
@inproceedings{sugawara-etal-2018-makes, title = {What Makes Reading Comprehension Questions Easier?}, author = {Sugawara, Saku and Inui, Kentaro and Sekine, Satoshi and Aizawa, Akiko}, editor = {Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun{'}ichi}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, month = oct, year = {2018}, address = {Brussels, Belgium}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D18-1453/}, doi = {10.18653/v1/D18-1453}, pages = {4208--4219} }

2017

Evaluation Metrics for Machine Reading Comprehension: Prerequisite Skills and Readability

Saku Sugawara, Yusuke Kido, Hikaru Yokono, and 1 more author

In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2017

Abs DOI

Knowing the quality of reading comprehension (RC) datasets is important for the development of natural-language understanding systems. In this study, two classes of metrics were adopted for evaluating RC datasets: prerequisite skills and readability. We applied these classes to six existing datasets, including MCTest and SQuAD, and highlighted the characteristics of the datasets according to each metric and the correlation between the two classes. Our dataset analysis suggests that the readability of RC datasets does not directly affect the question difficulty and that it is possible to create an RC dataset that is easy to read but difficult to answer.
Prerequisite Skills for Reading Comprehension: Multi-Perspective Analysis of MCTest Datasets and Systems

Saku Sugawara, Hikaru Yokono, and Akiko Aizawa

Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2017

Abs DOI

One of the main goals of natural language processing (NLP) is synthetic understanding of natural language documents, especially reading comprehension (RC). An obstacle to the further development of RC systems is the absence of a synthetic methodology to analyze their performance. It is difficult to examine the performance of systems based solely on their results for tasks because the process of natural language understanding is complex. In order to tackle this problem, we propose in this paper a methodology inspired by unit testing in software engineering that enables the examination of RC systems from multiple aspects. Our methodology consists of three steps. First, we define a set of prerequisite skills for RC based on existing NLP tasks. We assume that RC capability can be divided into these skills. Second, we manually annotate a dataset for an RC task with information regarding the skills needed to answer each question. Finally, we analyze the performance of RC systems for each skill based on the annotation. The last two steps highlight two aspects: the characteristics of the dataset, and the weaknesses in and differences among RC systems. We tested the effectiveness of our methodology by annotating the Machine Comprehension Test (MCTest) dataset and analyzing four existing systems (including a neural system) on it. The results of the annotations showed that answering questions requires a combination of skills, and clarified the kinds of capabilities that systems need to understand natural language. We conclude that the set of prerequisite skills we define are promising for the decomposition and analysis of RC.

2016

Annotation and Analysis of Discourse Relations, Temporal Relations and Multi-Layered Situational Relations in Japanese Texts

Kimi Kaneko, Saku Sugawara, Koji Mineshima, and 1 more author

In Proceedings of the 12th Workshop on Asian Language Resources (ALR12), Dec 2016
An Analysis of Prerequisite Skills for Reading Comprehension

Saku Sugawara and Akiko Aizawa

In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, Nov 2016

DOI