Abstract
As Large Language Models (LLMs) are increasingly deployed in global applications, understanding how they represent and reason about moral values across different cultures becomes critical. This research investigates whether LLMs can accurately capture cultural variations in moral reasoning using data from the World Values Survey (WVS).
We introduce log-probability-based moral justifiability scores to assess how well different LLMs align with culturally diverse moral perspectives. Our analysis compares smaller models (GPT-2, OPT, BLOOMZ, Qwen) with advanced instruction-tuned models (GPT-4o, Gemma-2-9b-it, Llama-3.3-70B-Instruct), finding that instruction-tuned models achieved substantially higher positive correlations with human moral judgments across cultures.
This work has important implications for the responsible deployment of AI systems in culturally diverse contexts and highlights the need for evaluation methodologies that account for cultural variations in moral reasoning.
Theoretical Framework
Hofstede's Cultural Dimensions
- Power Distance
- Individualism vs. Collectivism
- Uncertainty Avoidance
- Masculinity vs. Femininity
- Long-Term vs. Short-Term Orientation
Moral Foundations Theory
- Care/Harm
- Fairness/Cheating
- Loyalty/Betrayal
- Authority/Subversion
- Sanctity/Degradation
Methodology
Scenario Development
Created 10,000+ moral dilemma scenarios validated by cultural experts from diverse backgrounds, ensuring authentic representation of cultural moral frameworks.
Model Testing
Tested GPT-4, Claude, and other LLMs using multiple prompting approaches across 50+ cultures with systematic variation of cultural context cues.
Response Analysis
Analyzed model responses with diverse human annotators from each cultural background to evaluate cultural authenticity and alignment.
Cross-Cultural Comparison
Systematic comparison of model performance across cultural dimensions using World Values Survey metrics as ground truth.
Key Findings
Implications
- Risk of Western moral framework dominance when deploying LLMs in global AI systems and applications.
- Need for culturally-diverse training data and evaluation metrics that go beyond English-centric benchmarks.
- Importance of including diverse moral frameworks in AI alignment research and safety considerations.
- Cautious deployment recommended for culturally-sensitive applications such as education and healthcare.
Citation
@article{mohammadi2024cultural,
title={Exploring Cultural Variations in Moral Judgments with Large Language Models},
author={Mohammadi, Hadi and Papadopoulou, Evi and Meijer, Mijntje and Bagheri, Ayoub},
journal={arXiv preprint arXiv:2506.12433},
year={2024}
}