We evaluated our framework along three axes: Refinement Quality (RQ), Dialogue Quality (DQ),
and Generalization Quality (GQ).
Refinement Quality (RQ)
Our exemplar-based refinement consistently improved generation quality in Korean and
Chinese.
In Korean, linguistic quality improved substantially (3.59 → 4.91), accompanied by
gains in norm alignment and semantic fidelity.
Dialogue Quality (DQ)
Evaluations across six dimensions (Consistency, Naturalness, Relevance, Emotion
Appropriateness, Norm Appropriateness, Scenario Coherence) showed that dialogues from
Adherence and V2R categories achieved high ratings (avg. > 4.9).
Human evaluations strongly correlated with LLM-based assessments (r > 0.9).
Generalization Quality (GQ)
Models trained on NormGenesis datasets significantly outperformed baselines (NORM DIAL,
SODA).
In A/B preference testing, GPT-4o-mini trained on our data was preferred in 65% (English)
and 75% (Chinese) of cases over NORM DIAL.
Impact of V2R
Models trained with Violation-to-Resolution (V2R) data were preferred in 82% of cases in
ethically sensitive scenarios, demonstrating superior empathy and ability to model norm
repair.