EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi1, Evi Papadopoulou1, Mijntje Meijer1, Ayoub Bagheri1
1Utrecht University, The Netherlands
ACL 2025 (Under Review)

Abstract

Deploying Large Language Models (LLMs) across diverse cultural contexts demands reliable assessment of their ability to mirror varied moral perspectives. We present EvalMORAAL, a comprehensive evaluation framework that combines interpretable chain-of-thought (CoT) reasoning with LLM-as-judge peer review, covering 135,700 moral-alignment judgments across 20 models.

Using survey data from 64 countries (World Values Survey) and 34 countries (PEW Global Attitudes), we evaluate models on 1,357 distinct country-topic pairs. Our dual elicitation approach contrasts implicit (log-probability) and explicit (CoT) moral reasoning to expose hidden inconsistencies, while reciprocal peer review surfaces cross-model agreement patterns.

Top performers such as Claude-3-Opus, GPT-4o, and Gemini-Pro reach WVS correlations r > 0.90; yet we observe a persistent 21-point gap between Western (r = 0.82) and non-Western (r = 0.61) alignment. These findings underscore both advances in moral reasoning and the urgency of region-specific calibration before global deployment.

Key Results

20
LLMs Evaluated
135K
CoT Traces
64
Countries (WVS)
0.90+
Top Model r
21%
Regional Gap

Key Contributions

  • Dual Elicitation Framework: Combines log-probability scoring with chain-of-thought reasoning to expose hidden inconsistencies in moral judgments.
  • LLM-as-Judge Peer Review: Models critique each other's reasoning with VALID/INVALID verdicts, providing cross-model validation without human annotation.
  • Comprehensive Scale: 135,700 moral judgments across 20 models, 64 countries, and 23 moral topics from WVS and PEW surveys.
  • Regional Gap Analysis: Documents a 21-point performance gap between Western and non-Western contexts, highlighting deployment risks.
  • Practitioner Checklist: Actionable guidelines for deploying LLMs in culturally diverse settings.

Model Performance Tiers

High Tier (r >= 0.85)

Models: Claude-3-Opus, GPT-4o, Gemini-Pro

WVS Correlation: r > 0.90 | Self-Consistency: > 0.90

Mid-High Tier (0.75 <= r < 0.85)

Models: GPT-4, GPT-4o-mini, Mistral-Large, Phi-3-medium, Command-R-Plus

WVS Correlation: 0.80-0.89 | Self-Consistency: 0.85-0.90

Mid-Lower Tier (0.65 <= r < 0.75)

Models: Claude-3-Haiku, o1-mini, Gemma-2-9B, Mistral-Small

WVS Correlation: 0.70-0.79 | Self-Consistency: 0.80-0.85

Lower Tier (r < 0.65)

Models: Smaller open models (Llama-3.1-8B, Phi-3-mini, etc.)

WVS Correlation: < 0.70 | Self-Consistency: < 0.80

Methodology

1. Survey Data Processing

We process moral judgment data from the World Values Survey (WVS, 2017-2022) covering 64 countries and PEW Global Attitudes Survey covering 34 countries. Topics include divorce, abortion, euthanasia, homosexuality, and 19 other moral issues.

2. Dual Elicitation

  • Log-Probability Method: Compare probabilities between moral framings ("always justifiable" vs "never justifiable")
  • Chain-of-Thought: 3-step reasoning: social norms recall -> moral reasoning -> numerical score

3. Peer Review System

Each model evaluates other models' reasoning traces, providing VALID/INVALID verdicts with confidence scores. This enables cross-model validation without human annotation.

4. Conflict Detection

Cases where dual scores diverge by more than 0.4 are flagged for human arbitration, revealing systematic reasoning failures.

Regional Performance Gap

A critical finding is the persistent gap between Western and non-Western moral alignment:

Region Average Correlation Gap from Western
Western Europe & North America r = 0.82 -
East Asia r = 0.71 -11 points
Middle East & North Africa r = 0.65 -17 points
Sub-Saharan Africa r = 0.61 -21 points

This gap persists even among top-tier models, suggesting fundamental limitations in training data diversity rather than model capability.

Practitioner Checklist

For organizations deploying LLMs across diverse cultural contexts, we recommend:

  1. Baseline regional metrics using survey-based ground truth before deployment
  2. Monitor self-consistency and peer-agreement as early-warning indicators
  3. Audit conflict rates periodically (>15% warrants model review)
  4. Establish localized review for high-stakes moral judgments

Related Work

This work builds upon our previous study on cultural moral judgments:

Citation

@inproceedings{mohammadi2025evalmoraal,
  title={EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models},
  author={Mohammadi, Hadi and Papadopoulou, Evi and Meijer, Mijntje and Bagheri, Ayoub},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}