Logo
FrontierNews.ai

Bengali AI Models Show Stark Hallucination Gaps: New Benchmark Exposes Reliability Crisis

Researchers have discovered that large language models (LLMs) struggle significantly with hallucination when processing Bengali, the sixth most spoken language globally, with accuracy scores ranging from 7.72% to 55.42% depending on the task. A new evaluation framework called BenHalluEval exposes how AI models fabricate information, invent details, or contradict provided context when working with Bengali text and Bangla-English code-mixed input, where speakers mix both languages in conversation.

Hallucination in AI refers to when models generate outputs that contradict established facts, fabricate details, or conflict with the context they were given. For Bengali speakers, this problem has remained largely invisible until now. Despite Bengali being spoken by hundreds of millions of people worldwide, no prior systematic evaluation of hallucination in Bengali-capable AI models existed before this research.

Why Should We Care About Bengali AI Hallucinations?

The consequences of AI hallucination in Bengali are real and potentially harmful. A medical chatbot that invents drug dosages, a legal assistant that cites non-existent court cases, or a tutoring system that presents incorrect mathematics can cause serious damage to users who rely on these tools. The problem becomes even more acute in low-resource language settings like Bengali, where training data is scarcer and model performance tends to degrade.

Researchers evaluated seven different AI models across three categories: reasoning-oriented models, multilingual models, and Bengali-specific models. The study tested these systems on four distinct tasks to measure their reliability.

What Tasks Did Researchers Test?

  • Generative Question Answering: Models were asked to answer questions based on provided context, testing whether they would invent information or stick to facts.
  • Bangla-English Code-Mixed QA: This tested how models handle mixed-language input, a common real-world scenario for Bengali speakers who switch between languages.
  • Summarization: Models were evaluated on their ability to summarize text accurately without adding false details.
  • Mathematical Reasoning: The framework tested whether models could perform logical reasoning tasks without hallucinating intermediate steps or conclusions.

The researchers constructed 12,000 hallucinated test cases across twelve different types of hallucination errors. They then measured two critical failure modes: how often models incorrectly flagged correct information as false (false positives) and how often they failed to detect actual hallucinations (missed detections).

How Can Organizations Improve Bengali AI Reliability?

  • Implement Dual-Track Evaluation: Use BenHalluEval's approach of independently measuring false-positive rates on correct information and hallucination detection rates on fabricated content, rather than relying on single-metric assessments.
  • Test Chain-of-Thought Prompting Carefully: While chain-of-thought reasoning can help with multi-step inference, the research shows it does not consistently improve a model's ability to distinguish hallucinated from correct content in Bengali tasks.
  • Benchmark Across Model Categories: Evaluate reasoning-oriented, multilingual, and language-specific models separately, as they show different hallucination patterns on Bengali content.
  • Include Code-Mixed Scenarios: Test models on Bangla-English mixed input, which reflects how Bengali speakers actually use AI in real-world applications.

The study introduced a new metric called BenHalluScore to jointly penalize both types of failures and prevent inflated scores from models that simply default to one answer regardless of content. This calibration approach revealed substantial variation in how different models handle hallucination across tasks, with scores ranging from 7.72% to 55.42%.

What Did Chain-of-Thought Prompting Reveal?

Chain-of-thought prompting is a technique where AI models are asked to show their reasoning step-by-step before providing an answer. Researchers tested this as a potential mitigation strategy for reducing hallucinations in Bengali tasks. However, the results were sobering: chain-of-thought prompting shifted how models distributed their responses without consistently improving their ability to distinguish hallucinated content from accurate information.

This finding challenges a common assumption in AI development that asking models to reason through problems step-by-step automatically makes them more reliable. The research suggests that for Bengali and other low-resource languages, the benefit of chain-of-thought reasoning depends heavily on the specific task and model architecture.

The BenHalluEval framework and its associated dataset are now publicly available, providing researchers and developers with the first dedicated hallucination benchmark for Bengali. This resource addresses a critical gap in AI evaluation for one of the world's most widely spoken languages, potentially spurring improvements in Bengali-capable AI systems across medical, legal, educational, and other high-stakes applications.