EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi1, Evi Papadopoulou1, Mijntje Meijer1, Robert A. Bagheri1
1Utrecht University, The Netherlands
*SEM 2026, ACL 2026 (in press)

Abstract

Deploying Large Language Models (LLMs) across diverse cultural contexts demands reliable assessment of their ability to mirror varied moral perspectives. We present EvalMORAAL, a comprehensive evaluation framework that combines interpretable chain-of-thought (CoT) reasoning with LLM-as-judge peer review, covering 135,700 moral-alignment judgments across 20 models.

Using survey data from 64 countries (World Values Survey) and 34 countries (PEW Global Attitudes), we evaluate models on 1,357 distinct country-topic pairs. Our dual elicitation approach contrasts implicit (log-probability) and explicit (CoT) moral reasoning to expose hidden inconsistencies, while reciprocal peer review surfaces cross-model agreement patterns.

Top performers such as Claude-3-Opus, GPT-4o, and Gemini-Pro reach WVS correlations r > 0.90; yet we observe a persistent 21-point gap between Western (r = 0.82) and non-Western (r = 0.61) alignment. These findings underscore both advances in moral reasoning and the urgency of region-specific calibration before global deployment.

Key Results

20
LLMs Evaluated
135K
CoT Traces
64
Countries (WVS)
0.90+
Top Model r
21%
Regional Gap

Key Contributions

  • Dual Elicitation Framework: Combines log-probability scoring with chain-of-thought reasoning to expose hidden inconsistencies in moral judgments.
  • LLM-as-Judge Peer Review: Models critique each other's reasoning with VALID/INVALID verdicts, providing cross-model validation without human annotation.
  • Comprehensive Scale: 135,700 moral judgments across 20 models, 64 countries, and 23 moral topics from WVS and PEW surveys.
  • Regional Gap Analysis: Documents a 21-point performance gap between Western and non-Western contexts, highlighting deployment risks.
  • Practitioner Checklist: Actionable guidelines for deploying LLMs in culturally diverse settings.

Model Performance Tiers

High Tier (r >= 0.85)

Models: Claude-3-Opus, GPT-4o, Gemini-Pro

WVS Correlation: r > 0.90 | Self-Consistency: > 0.90

Mid-High Tier (0.75 <= r < 0.85)

Models: GPT-4, GPT-4o-mini, Mistral-Large, Phi-3-medium, Command-R-Plus

WVS Correlation: 0.80-0.89 | Self-Consistency: 0.85-0.90

Mid-Lower Tier (0.65 <= r < 0.75)

Models: Claude-3-Haiku, o1-mini, Gemma-2-9B, Mistral-Small

WVS Correlation: 0.70-0.79 | Self-Consistency: 0.80-0.85

Lower Tier (r < 0.65)

Models: Smaller open models (Llama-3.1-8B, Phi-3-mini, etc.)

WVS Correlation: < 0.70 | Self-Consistency: < 0.80

Methodology

1. Survey Data Processing

We process moral judgment data from the World Values Survey (WVS, 2017-2022) covering 64 countries and PEW Global Attitudes Survey covering 34 countries. Topics include divorce, abortion, euthanasia, homosexuality, and 19 other moral issues.

2. Dual Elicitation

  • Log-Probability Method: Compare probabilities between moral framings ("always justifiable" vs "never justifiable")
  • Chain-of-Thought: 3-step reasoning: social norms recall -> moral reasoning -> numerical score

3. Peer Review System

Each model evaluates other models' reasoning traces, providing VALID/INVALID verdicts with confidence scores. This enables cross-model validation without human annotation.

4. Conflict Detection

Cases where dual scores diverge by more than 0.4 are flagged for human arbitration, revealing systematic reasoning failures.

Regional Performance Gap

A critical finding is the persistent gap between Western and non-Western moral alignment:

Region Average Correlation Gap from Western
Western Europe & North America r = 0.82 -
East Asia r = 0.71 -11 points
Middle East & North Africa r = 0.65 -17 points
Sub-Saharan Africa r = 0.61 -21 points

This gap persists even among top-tier models, suggesting fundamental limitations in training data diversity rather than model capability.

Practitioner Checklist

For organizations deploying LLMs across diverse cultural contexts, we recommend:

  1. Baseline regional metrics using survey-based ground truth before deployment
  2. Monitor self-consistency and peer-agreement as early-warning indicators
  3. Audit conflict rates periodically (>15% warrants model review)
  4. Establish localized review for high-stakes moral judgments

Related Work

This work builds upon our previous study on cultural moral judgments:

Citation

@inproceedings{mohammadi2025evalmoraal,
  title={EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models},
  author={Mohammadi, Hadi and Papadopoulou, Evi and Meijer, Mijntje and Bagheri, Ayoub},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}