Workshop Paper | ACL 2025

Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro Romero, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou
Utrecht University, The Netherlands | Queen Mary University of London, UK
Workshop on Gender Bias in Natural Language Processing (GeBNLP), ACL 2025

Abstract

Large Language Models (LLMs) are increasingly used for data annotation in NLP tasks, particularly for subjective tasks like hate speech detection where human annotation is expensive and potentially traumatic. However, the reliability of LLM-generated annotations, especially in the context of demographic biases and model explanations, remains understudied. This paper presents a comprehensive evaluation of LLM annotation reliability for sexism detection.

Using a Generalized Linear Mixed Model (GLMM) approach, we examine annotation variability in sexism detection tasks. Our findings reveal that demographic factors account for a minor fraction (8%) of the observed variance, with tweet content being the dominant factor. We also find that persona-based prompting approaches often fail to enhance, and sometimes degrade, performance compared to baseline models.

Through Explainable AI analysis, we demonstrate that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. This work recommends prioritizing content-driven explanations and robust annotation protocols over demographic persona simulation for achieving fairness in NLP systems.

Key Contributions

Reliability Assessment Framework

A comprehensive framework for evaluating LLM annotation reliability that considers both accuracy and consistency across demographic groups.

Mixed-Effects Analysis

Advanced statistical modeling to quantify the impact of demographic factors on LLM annotation quality and identify systematic biases.

Explanation Evaluation

Analysis of LLM-generated explanations quality and consistency, revealing discrepancies between stated reasoning and actual predictions.

Practical Guidelines

Concrete recommendations for using LLMs in annotation tasks, including strategies for bias mitigation and quality control.

Methodology

Data Selection

Analysis of sexism detection tasks using established datasets for examining annotation variability.

LLM Annotation

Testing generative AI as annotators, including persona-based prompting approaches compared against baseline models.

Statistical Modeling

Generalized Linear Mixed Model (GLMM) to examine annotation variability and quantify the impact of different factors.

Explainability Analysis

Explainable AI techniques to understand which features drive model predictions in sexism detection.

Variance Analysis
Decomposing annotation variance by factor
Persona Evaluation
Comparing persona-based vs baseline prompts
Token Attribution
Content-specific token importance analysis

Key Findings

Demographic Factors and Annotation Variance

Persona-Based Prompting

Explainable AI Insights

Implications and Recommendations

Content-Driven Explanations
Prioritize content-driven explanations over demographic persona simulation for fairness.
Robust Annotation Protocols
Develop annotation protocols that focus on content-specific features rather than demographic assumptions.
Baseline Comparison
Always compare persona-based approaches against baseline models to verify actual improvement.
Content Focus
Recognize that tweet content is the dominant factor; design systems that leverage this insight.

Citation

@inproceedings{mohammadi2025assessing,
  title={Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation},
  author={Mohammadi, Hadi and Shahedi, Tina and Mosteiro Romero, Pablo and Poesio, Massimo and Bagheri, Ayoub and Giachanou, Anastasia},
  booktitle={Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)},
  year={2025},
  organization={Association for Computational Linguistics}
}