NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery

1Dongguk University
EMNLP 2025 (Oral Presentation)
SAC Highlights Award

Abstract

Social norms govern culturally appropriate behavior in communication, enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair.

To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance.

Comparison of generation outputs
Comparison of generation outputs in Korean. Prior methods produce pragmatically inconsistent responses (red), whereas our framework yields culturally and pragmatically coherent outputs (blue).

๐Ÿ”ฅ News

  • [Nov 2025] We received the SAC Highlights Award at EMNLP 2025! ๐Ÿ†
  • [Oct 2025] Nominated for Outstanding Paper, SAC Highlight, and Resource Paper Award at EMNLP 2025. ๐Ÿ†
  • [Sep 2025] Our paper was released on arXiv. ๐Ÿ“˜
  • [Aug 2025] Selected for Main Conference (Oral Presentation) at EMNLP 2025. ๐ŸŽค

๐Ÿš€ Methodology

NormGenesis Overview
NormGenesis Overview. Our framework consists of four stages: (1) norm and style design, (2) scenarioโ€“situation construction, (3) exemplar-based iterative refinement, and (4) multi-turn dialogue generation.

NormGenesis consists of four core stages designed to model social communication effectively:

1. Norm and Style Design

We constructed a taxonomy of 12 conversational social norm categories (e.g., Apology, Empathy, Respect) and defined 120 culturally grounded subnorms per language. We also defined pragmatic parameters like tone, honorifics, and relational distance.

2. Scenario-Situation Constructor

We generate scenario-situation pairs labeled as Norm Adherence, Norm Violation, or Violation-to-Resolution (V2R). V2R models post-violation repair strategies, capturing core aspects of interactional competence.

3. Exemplar-Based Iterative Refinement

To address pragmatic mismatches in low-resource languages, we use an iterative refinement loop. The model retrieves semantically and structurally similar exemplars to guide the revision of scenarios and situations, ensuring cultural alignment.

4. Multi-Turn Dialogue Generator

Refined scenarios are expanded into multi-turn dialogues (5-15 turns). Each utterance is annotated with norm adherence, speaker intent, and emotional state, grounded in dialogue act theory.

Turn-level Annotation Example
Turn-level annotated dialogue example showing Norm Adherence, Speaker Reaction, and Explanation.

๐Ÿ“Š Evaluation & Results

We evaluated our framework along three axes: Refinement Quality (RQ), Dialogue Quality (DQ), and Generalization Quality (GQ).

Refinement Quality (RQ)

Our exemplar-based refinement consistently improved generation quality in Korean and Chinese. In Korean, linguistic quality improved substantially (3.59 → 4.91), accompanied by gains in norm alignment and semantic fidelity.

Dialogue Quality (DQ)

Evaluations across six dimensions (Consistency, Naturalness, Relevance, Emotion Appropriateness, Norm Appropriateness, Scenario Coherence) showed that dialogues from Adherence and V2R categories achieved high ratings (avg. > 4.9). Human evaluations strongly correlated with LLM-based assessments (r > 0.9).

Generalization Quality (GQ)

Models trained on NormGenesis datasets significantly outperformed baselines (NORM DIAL, SODA). In A/B preference testing, GPT-4o-mini trained on our data was preferred in 65% (English) and 75% (Chinese) of cases over NORM DIAL.

Impact of V2R

Models trained with Violation-to-Resolution (V2R) data were preferred in 82% of cases in ethically sensitive scenarios, demonstrating superior empathy and ability to model norm repair.

๐ŸŒ Evaluation Scope

Cultures Covered

American

Chinese

Korean

BibTeX

@inproceedings{hong2025normgenesis,
  title={NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery},
  author={Hong, Minki and Choi, Jangho and Kim, Jihie},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={33781--33819},
  year={2025}
}