Generating Korean Image Captions using OCR in CoT Prompting

Published in HCLT-KACL 2024, 2024

Overview

This project addresses the challenge of generating high-quality image captions in Korean, a language with relatively limited large-scale captioning resources. We integrate Optical Character Recognition (OCR) with Chain-of-Thought (CoT) prompting to improve captioning performance on text-rich images such as signs, labels, and documents.

Motivation

Most vision-language models underperform on Korean due to scarce training data. Text-rich images are especially problematic, as models often ignore or mistranslate embedded Korean text.

Approach

  • OCR Integration: Extract Korean text from images using OCR.
  • CoT Prompting: Incorporate OCR outputs into a reasoning chain before generating captions.
  • Grounded Captions: Generate captions that combine both visual content and recognized Korean text.

Results

Our method produces captions that are more accurate, context-aware, and linguistically faithful to Korean compared to baseline captioning approaches.

Impact

This work demonstrates how reasoning-based prompting and OCR integration can improve generative AI for low-resource languages, with potential applications in accessibility, digital archiving, and human-centered AI systems.


Recommended citation: Hong, M., Yun, Y., Park, S., & Kim, J. (2024). Generating Korean Image Captions using OCR in CoT Prompting. In Annual Conference on Human and Language Technology (pp. 165-168). Human and Language Technology.