Aug 3, 2025arXiv:2508.01918

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

AI Summary

The authors introduce PunGPT2, the first open-source Punjabi generative model suite, trained on a 35GB corpus and optimized for Gurmukhi and Shahmukhi scripts. They propose Pun-RAG, a retrieval-augmented framework, and Pun-Instruct, an instruction-tuned variant, to enhance Punjabi language generation. Their key innovation, Quantum-RAG, fuses sparse, dense, and quantum kernel embeddings, achieving state-of-the-art results on FLORES-200, IndicGenBench, and a new PunjabiEval suite, demonstrating improved retrieval and generation performance compared to multilingual baselines.

Key Contribution

Quantum-RAG achieves a 7.4% improvement in Recall@10 over standard FAISS by fusing sparse, dense, and quantum kernel embeddings, opening the door to efficient low-resource retrieval.

Abstract

Despite rapid advances in large language models (LLMs), low-resource languages remain excluded from NLP, limiting digital access for millions. We present PunGPT2, the first fully open-source Punjabi generative model suite, trained on a 35GB corpus covering literature, religious texts, news, social discourse, etc. PunGPT2 captures Punjabi's syntactic and morphological richness through a tokenizer optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval-augmented framework integrating PunGPT2 with a FAISS retriever over a curated Punjabi knowledge base, and Pun-Instruct, an instruction-tuned variant using QLoRA for robust zero-shot summarization, translation, and question answering. Our key innovation, Quantum-RAG, fuses sparse, dense, and quantum kernel embeddings for efficient, context-aware retrieval with low memory overhead, marking the first practical quantum-inspired retrieval in a low-resource LLM. Our models outperform multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and a new PunjabiEval suite. Quantum-RAG yields +7.4 Recall@10 over FAISS and +3.5 BLEU over mT5 on PunjabiEval. We publicly release all training scripts, hyperparameters, evaluation pipelines, the 35GB Punjabi corpus, the PunjabiEval benchmark, and all model weights, establishing new state-of-the-art results for Punjabi language generation and retrieval.

Natural Language Processing Open-Source Models & Weights Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References39

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Related Papers