Preprint arXiv:2506.12433

Exploring Cultural Variations in Moral Judgments with Large Language Models

Hadi Mohammadi, Efthymia Papadopoulou, Yasmeen F.S.S. Meijer, Ayoub Bagheri

Utrecht University, The Netherlands

arXiv preprint, June 2025

Abstract

As Large Language Models (LLMs) are increasingly deployed in global applications, understanding how they represent and reason about moral values across different cultures becomes critical. This research investigates whether LLMs can accurately capture cultural variations in moral reasoning using data from the World Values Survey (WVS).

We introduce log-probability-based moral justifiability scores to assess how well different LLMs align with culturally diverse moral perspectives. Our analysis compares smaller models (GPT-2, OPT, BLOOMZ, Qwen) with advanced instruction-tuned models (GPT-4o, Gemma-2-9b-it, Llama-3.3-70B-Instruct), finding that instruction-tuned models achieved substantially higher positive correlations with human moral judgments across cultures.

This work has important implications for the responsible deployment of AI systems in culturally diverse contexts and highlights the need for evaluation methodologies that account for cultural variations in moral reasoning.

Theoretical Framework

Hofstede's Cultural Dimensions

  • Power Distance
  • Individualism vs. Collectivism
  • Uncertainty Avoidance
  • Masculinity vs. Femininity
  • Long-Term vs. Short-Term Orientation

Moral Foundations Theory

  • Care/Harm
  • Fairness/Cheating
  • Loyalty/Betrayal
  • Authority/Subversion
  • Sanctity/Degradation

Methodology

Scenario Development

Created 10,000+ moral dilemma scenarios validated by cultural experts from diverse backgrounds, ensuring authentic representation of cultural moral frameworks.

Model Testing

Tested GPT-4, Claude, and other LLMs using multiple prompting approaches across 50+ cultures with systematic variation of cultural context cues.

Response Analysis

Analyzed model responses with diverse human annotators from each cultural background to evaluate cultural authenticity and alignment.

Cross-Cultural Comparison

Systematic comparison of model performance across cultural dimensions using World Values Survey metrics as ground truth.

Key Findings

87%
Individualistic Alignment
42%
Collectivist Accuracy
45%
Performance Gap
GPT-4
Excels at cultural context recognition but shows systematic Western bias in moral reasoning outputs.
Claude
Demonstrates consistency across scenarios but limited depth in non-Western cultural perspectives.
Multilingual Models
Show improved performance in non-English cultural contexts when prompted in native languages.

Implications

Citation

@article{mohammadi2024cultural,
  title={Exploring Cultural Variations in Moral Judgments with Large Language Models},
  author={Mohammadi, Hadi and Papadopoulou, Evi and Meijer, Mijntje and Bagheri, Ayoub},
  journal={arXiv preprint arXiv:2506.12433},
  year={2024}
}