Assessing the Reliability of LLMs Annotations

Abstract

Large Language Models (LLMs) are increasingly used for data annotation in NLP tasks, particularly for subjective tasks like hate speech detection where human annotation is expensive and potentially traumatic. However, the reliability of LLM-generated annotations, especially in the context of demographic biases and model explanations, remains understudied. This paper presents a comprehensive evaluation of LLM annotation reliability for sexism detection.

Using a Generalized Linear Mixed Model (GLMM) approach, we examine annotation variability in sexism detection tasks. Our findings reveal that demographic factors account for a minor fraction (8%) of the observed variance, with tweet content being the dominant factor. We also find that persona-based prompting approaches often fail to enhance, and sometimes degrade, performance compared to baseline models.

Through Explainable AI analysis, we demonstrate that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. This work recommends prioritizing content-driven explanations and robust annotation protocols over demographic persona simulation for achieving fairness in NLP systems.

Key Contributions

Reliability Assessment Framework

A comprehensive framework for evaluating LLM annotation reliability that considers both accuracy and consistency across demographic groups.

Mixed-Effects Analysis

Advanced statistical modeling to quantify the impact of demographic factors on LLM annotation quality and identify systematic biases.

Explanation Evaluation

Analysis of LLM-generated explanations quality and consistency, revealing discrepancies between stated reasoning and actual predictions.

Practical Guidelines

Concrete recommendations for using LLMs in annotation tasks, including strategies for bias mitigation and quality control.

Methodology

Data Selection

Analysis of sexism detection tasks using established datasets for examining annotation variability.

LLM Annotation

Testing generative AI as annotators, including persona-based prompting approaches compared against baseline models.

Statistical Modeling

Generalized Linear Mixed Model (GLMM) to examine annotation variability and quantify the impact of different factors.

Explainability Analysis

Explainable AI techniques to understand which features drive model predictions in sexism detection.

Variance Analysis

Decomposing annotation variance by factor

Persona Evaluation

Comparing persona-based vs baseline prompts

Token Attribution

Content-specific token importance analysis

Key Findings

Demographic Factors and Annotation Variance

Demographic factors account for a minor fraction (8%) of the observed variance in annotations
Tweet content is the dominant factor influencing annotation outcomes
Content-specific tokens related to sexism drive model predictions rather than demographic correlates

Persona-Based Prompting

Persona-based prompting often fails to enhance performance compared to baseline models
In some cases, persona approaches actually degrade annotation quality
Content-driven approaches prove more reliable than demographic simulation

Explainable AI Insights

Model predictions rely heavily on content-specific tokens related to sexism
Predictions are not primarily driven by correlates of demographic characteristics
Robust annotation protocols outperform demographic persona simulation for fairness

Implications and Recommendations

Content-Driven Explanations

Prioritize content-driven explanations over demographic persona simulation for fairness.

Robust Annotation Protocols

Develop annotation protocols that focus on content-specific features rather than demographic assumptions.

Baseline Comparison

Always compare persona-based approaches against baseline models to verify actual improvement.

Content Focus

Recognize that tweet content is the dominant factor; design systems that leverage this insight.

Citation

@inproceedings{mohammadi2025assessing,
  title={Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation},
  author={Mohammadi, Hadi and Shahedi, Tina and Mosteiro Romero, Pablo and Poesio, Massimo and Bagheri, Ayoub and Giachanou, Anastasia},
  booktitle={Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)},
  year={2025},
  organization={Association for Computational Linguistics}
}