March 25 – April 1, 2026

Natural Language Processing - Weekly Roundup

100 papers published across 3 labs.

Selected Labs publishing this week

Tsinghua AI1 Microsoft Research1 Stanford HAI1

Top Papers

Mar 31, 2026

Mohammadhossein Khojasteh +41d ago

Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

LLM-derived abstractions significantly boost analogical reasoning in narratives, outperforming end-to-end LLMs and revealing the critical role of appropriate abstraction levels.

Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Xiaoshan Huang +21d ago

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

Physiological synchrony in medical teams doesn't always signal success; it's the *context* of shared discovery versus shared uncertainty that determines whether it predicts effective collaboration.

Xiaoshan Huang, Jiayi Zhang, Susanne P. Lajoie

Natural Language Processing Scientific Discovery & Drug Design

Peng Gang1d ago

Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.

Peng Gang

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Benjamin Josef Schüßler +11d ago

Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.

Benjamin Josef Schüßler, Jakob Prange

Eval Frameworks & Benchmarks Natural Language Processing

Joakim Edin +31d ago

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

Automated medical coding finally gets explainable: Symphony's agentic approach provides span-level evidence, linking each predicted code to the supporting text.

Joakim Edin, Andreas Motzfeldt, Simon Flachs +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

All Papers (100)

Mar 31, 2026

Mohammadhossein Khojasteh +41d ago

Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

LLM-derived abstractions significantly boost analogical reasoning in narratives, outperforming end-to-end LLMs and revealing the critical role of appropriate abstraction levels.

Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis +2

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Xiaoshan Huang +21d ago

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

Physiological synchrony in medical teams doesn't always signal success; it's the *context* of shared discovery versus shared uncertainty that determines whether it predicts effective collaboration.

Xiaoshan Huang, Jiayi Zhang, Susanne P. Lajoie

Natural Language Processing Scientific Discovery & Drug Design

Peng Gang1d ago

Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.

Peng Gang

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Benjamin Josef Schüßler +11d ago

Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.

Benjamin Josef Schüßler, Jakob Prange

Eval Frameworks & Benchmarks Natural Language Processing

Joakim Edin +31d ago

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

Automated medical coding finally gets explainable: Symphony's agentic approach provides span-level evidence, linking each predicted code to the supporting text.

Joakim Edin, Andreas Motzfeldt, Simon Flachs +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Florian Andreas Marwitz +21d ago

A First Step Towards Even More Sparse Encodings of Probability Distributions

Representing probability distributions with first-order logic formulas can drastically reduce their size, offering a path to more efficient probabilistic reasoning.

Florian Andreas Marwitz, Tanya Braun, Ralf Möller

Natural Language Processing Reasoning & Chain-of-Thought

Zhenning Chen +61d ago

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.

Zhenning Chen, Hanbei Zhan, Yanwei Huang +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Christopher Koch1d ago

Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

LLMs don't just make people confidently wrong; they create a dangerous illusion of competence by decoupling performance from actual understanding.

Christopher Koch

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Brian Felipe Keith-Norambuena +61d ago·also Department of Computing and Systems Engineering

Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models

LLMs can steer narrative extraction to align with user-specified perspectives, achieving a 9.9% improvement in agenda alignment over keyword matching without sacrificing narrative coherence.

Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas +4

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Brian Felipe Keith-Norambuena +41d ago·also Universidad Católica del Norte

Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation

Interactive narrative maps with semantic interaction significantly boost insight generation compared to static maps and timelines, offering a more intuitive path to model refinement.

Brian Felipe Keith-Norambuena, Fausto German, Eric Krokos +2

Interpretability & Mechanistic Interp Natural Language Processing

Pegah Ramezani +41d ago

Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

Human brains and neural networks may converge on similar "Platonic" representations for linguistic constructions, suggesting universal principles guide efficient language abstraction.

Pegah Ramezani, Thomas Kinfe, Andreas Maier +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

The Harker School1d ago

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Bilingual language models can achieve performance comparable to monolingual models in both languages, challenging the assumption that bilingual input poses significant learning obstacles.

Linda Zeng, Steven Y. Feng, Michael C. Frank

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Steven Y. Feng +21d ago

Baby Scale: Investigating Models Trained on Individual Children's Language Input

Training language models on individual children's language reveals that distributional and interactional linguistic features, not just dataset size, are key to efficient learning, mirroring factors that drive child language acquisition.

Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank

Eval Frameworks & Benchmarks Natural Language Processing Scaling Laws & Emergent Abilities

Alain Vázquez +11d ago

Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

Enriching meaning representations with task demonstrators can significantly boost dialogue generation, especially in challenging scenarios, revealing a simple yet effective strategy for improving NLG performance.

Alain Vázquez, Maria Inés Torres

Eval Frameworks & Benchmarks Natural Language Processing

Seung-Hun Han +21d ago

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.

Seung-Hun Han, Youssef Mohamed, Mohamed Elhoseiny

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

T. Simon +41d ago

Few-shot Writer Adaptation via Multimodal In-Context Learning

Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.

T. Simon, Stéphane Nicolas, Pierrick Tranouez +2

Computer Vision Multimodal Models Natural Language Processing

Soveatin Kuntur +41d ago

Rewrite the News: Tracing Editorial Reuse Across News Agencies

News agencies reuse content across languages far more than simple lexical overlap reveals, with over half of articles drawing on foreign sources through paraphrase and compositional techniques.

Soveatin Kuntur, Nina Smirnova, Anna Wroblewska +2

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Moiz Sadiq Awan +11d ago

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

LLMs can nail the clinical content of prior authorization letters, but consistently fumble the administrative details that actually get them approved.

Moiz Sadiq Awan, Maryam Raza

Eval Frameworks & Benchmarks Natural Language Processing

Thomas S Sha +11d ago

BenchScope: How Many Independent Signals Does Your Benchmark Provide?

AI benchmarks may be giving you a false sense of comprehensive evaluation: the six scores on the Open LLM Leaderboard effectively boil down to just two independent measurements.

Thomas S Sha, Stella Zhao

Eval Frameworks & Benchmarks Natural Language Processing

Bokang Jia +51d ago

Nomad: Autonomous Exploration and Discovery

Forget prompt engineering – Nomad autonomously uncovers insights you didn't even know to ask for.

Bokang Jia, Samta Kamboj, Satheesh Katipomu +3

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

1d ago·also UT Austin

Sima AIunty: Caste Audit in LLM-Driven Matchmaking

LLMs used in matchmaking amplify existing caste hierarchies, rating same-caste matches significantly higher and perpetuating social biases in potentially harmful ways.

Atharva Naik, Shounok Kar, Varnika Sharma +2

Constitutional AI & AI Ethics Natural Language Processing

Lakshya Garg +51d ago

Monodense Deep Neural Model for Determining Item Price Elasticity

Accurately predict how customers will react to price changes, even without controlled experiments, using a new Monodense neural network that beats traditional methods.

Lakshya Garg, S. Yaswanth, Dr. Deepanshu Mishra +3

Natural Language Processing Recommendation & Information Retrieval

Ming-Hua Tsai +11d ago

Reward-Based Online LLM Routing via NeuralUCB

NeuralUCB can slash LLM inference costs while maintaining quality, offering a practical alternative to always using the biggest, most expensive models.

Ming-Hua Tsai, Phat Tran

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

1d ago·also BRAC University

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman +3

Computer Vision Multimodal Models Natural Language Processing

Mohammad Mohammadamini1d ago

FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.

Mohammad Mohammadamini

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Paige Tuttösí +41d ago

Covertly improving intelligibility with data-driven adaptations of speech timing

Global speech slowing, a common strategy for improving intelligibility, is outperformed by targeted, data-driven speech rate adjustments that listeners don't even consciously notice.

Paige Tuttösí, Angelica Lim, H. Henny Yeung +2

Natural Language Processing Speech & Audio

Yufeng Li +21d ago

ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

Knowing the context around a claim—gleaned from Wikipedia—can boost verifiable claim detection, but the benefit depends heavily on the domain and model used.

Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga

Natural Language Processing Recommendation & Information Retrieval

Cristian Santini +51d ago

ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

Training NERL models on modern Italian won't cut it for historical texts: ENEIDE exposes the performance gap with a new multi-domain dataset spanning two centuries.

Cristian Santini, Sebastian Barzaghi, Paolo Sernani +3

Data Curation & Synthetic Data Natural Language Processing

Gensyn1d ago

Training-Free Dynamic Upcycling of Expert Language Models

Forget expensive finetuning: DUME dynamically combines existing expert LLMs into a powerful MoE *without* additional training, unlocking multi-domain performance at minimal cost.

Eros Fanì, Oğuzhan Ersoy

Natural Language Processing Open-Source Models & Weights Training Efficiency & Optimization

Junwei Yu +31d ago

Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

Forget SEO, optimizing content *structure* alone boosts citation rates in generative AI search engines by 17%.

Junwei Yu, Mufeng Yang, Yepeng Ding +1

Natural Language Processing Recommendation & Information Retrieval

Gabriel Loiseau +41d ago

Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

You can shrink a privacy expert LLM by 4500x and still get human-level privacy judgments.

Gabriel Loiseau, D. Sileo, Damien Riquet +2

Constitutional AI & AI Ethics Inference & Quantization Natural Language Processing

Baoyi Zeng +11d ago

Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

LLM-generated authorial impersonations, despite their sophistication, are surprisingly detectable by existing authorship verification methods, even outperforming on some genuine negative samples.

Baoyi Zeng, Andrea Nini

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Robinson Ferrer +31d ago

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Forget fancy ensembling – simply asking an LLM how confident it is in its grading is the most reliable way to predict its accuracy, and it's far cheaper than self-consistency voting.

Robinson Ferrer, Damla Turgut, Zhongzhou Chen +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Tobias Bystrich +51d ago

Can LLM Agents Identify Spoken Dialects like a Linguist?

LLMs can classify dialects with surprising accuracy when given linguistic hints, suggesting a new way to leverage their knowledge for low-resource language tasks.

Tobias Bystrich, Lukas Hamm, Maria Hassan +3

Natural Language Processing Speech & Audio Tool Use & Agents

Hailay Kidu Teklehaymanot +21d ago

LLM Probe: Evaluating LLMs for Low-Resource Languages

LLMs may ace English, but LLM Probe reveals surprising performance disparities in low-resource languages, with sequence-to-sequence models unexpectedly leading in morphosyntax.

Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl

Eval Frameworks & Benchmarks Natural Language Processing

D. Bani-Harouni +81d ago

Calibrated Confidence Expression for Radiology Report Generation

Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.

D. Bani-Harouni, Chantal Pellegrini, J. Luers +6

Computer Vision Multimodal Models Natural Language Processing

Yahan Li +71d ago

CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Mental-health support chatbots get a much-needed reality check with CounselReflect, a toolkit that exposes their strengths and weaknesses through transparent, multi-dimensional audits.

Yahan Li, Chaohao Du, Zeyang Li +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Tal Ishon +21d ago

PRISM: PRIor from corpus Statistics for topic Modeling

Forget finetuning or embeddings: better topic models are lurking in your corpus's own co-occurrence stats.

Tal Ishon, Yoav Goldberg, Uri Shaham

Natural Language Processing

Zoë Prins +41d ago

Is my model perplexed for the right reason? Contrasting LLMs'Benchmark Behavior with Token-Level Perplexity

LLMs ace linguistic benchmarks, but a token-level perplexity analysis reveals they're often relying on the wrong cues.

Zoë Prins, Samuele Punzo, Frank Wildenburg +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Amane Watahiki +71d ago

Developing a Guideline for the Labovian-Structural Analysis of Oral Narratives in Japanese

Adapting Labovian narrative analysis to Japanese reveals the challenges and opportunities in cross-linguistic qualitative research, highlighting the need for language-specific guidelines.

Amane Watahiki, T. Doi, Akari Kikuchi +5

Natural Language Processing

Yahan Li +71d ago

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

LLMs struggle to handle common, challenging patient behaviors like contradictory statements and inaccurate medical information, revealing critical safety gaps in medical consultation applications.

Yahan Li, Xinyi Jie, Wanjia Ruan +5

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Anass Sedrati +21d ago

L-ReLF: A Framework for Lexical Dataset Creation

Unlock knowledge equity for underserved languages: L-ReLF offers a reproducible recipe for creating high-quality lexical datasets where they're needed most.

Anass Sedrati, M. Afifi, Reda Benkhadra

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Ona de Gibert +11d ago

Open Machine Translation for Esperanto

Despite its simple grammar, Esperanto translation still poses challenges for LLMs, with NLLB models only preferred in about half of human evaluations.

Ona de Gibert, Llu'is de Gibert

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

S. Higashiyama +21d ago

CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Japanese entity linking gets a boost: CADEL offers a high-quality, Japan-specific corpus to tackle the unique challenges of linking entities in administrative web documents.

S. Higashiyama, Masao Ideuchi, Masao Utiyama

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Lukuang Dong +41d ago

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.

Lukuang Dong, Ziwei Li, Saierdaer Yusuyin +2

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Zhuowen Liang +51d ago

Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Forget slow, bloated LLMs – this work shows you can get GPT-4o quality on long-document QA with a 3B model and a clever structure-first distillation approach.

Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Ranidu Gurusinghe +11d ago

SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

Proprietary language models trounce open-source alternatives by 3-6x on a new, large-scale corpus of Sinhala and Pali Buddhist texts.

Ranidu Gurusinghe, Nevidu Jayatilleke

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Mohammad Khalil +31d ago

SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

The first publicly available dataset for Syrian Arabic Sign Language (SyArSL) opens the door for machine translation research to improve accessibility for a historically underserved community.

Mohammad Khalil, R. Nahas, A. Nassar +1

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

O. Timo +21d ago

Designing FSMs Specifications from Requirements with GPT 4.0

GPT-4 can automatically generate FSMs from textual requirements, but expert-guided mutation and testing are crucial for repairing imperfections.

O. Timo, Paul Rodríguez, Florent Avellaneda

Code Generation & Program Synthesis Natural Language Processing

G. Boateng +21d ago

Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

A human-in-the-loop AI assistant can provide scalable, high-quality coding education support in resource-constrained African contexts, even with limited infrastructure.

G. Boateng, Samuel Boateng, V. Kumbol

Code Generation & Program Synthesis Natural Language Processing Recommendation & Information Retrieval

Christine Zhang +21d ago

Concept Training for Human-Aligned Language Models

LLMs can better capture human semantic similarity by predicting sets of related concepts instead of single next tokens.

Christine Zhang, Daniel Jurafsky, C. Shani

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Iordanis Fostiropoulos +101d ago

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

LLMs still struggle to accurately infer user interests from interaction histories, especially when dealing with diverse engagement signals – a critical gap for effective personalization.

Iordanis Fostiropoulos, Muhammad Azhar, Abdalaziz Sawwan +8

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Adi Wijaya +31d ago·also Department of Electrical Engineering and Information Technology, Faculty of Computer Science, Metrics Research Institute and Statistics Consulting

From Patterns to Policy: A Scoping Review Based on Bibliometric Analysis (ScoRBA) of Intelligent and Secure Smart Hospital Ecosystems

Smart hospital research is converging towards integrated ecosystems where AI, trust, and infrastructure reinforce each other, but real-world implementation and governance are lagging.

Adi Wijaya, Budi Hermawan, Wiga Maulana Baihaqi +1

Natural Language Processing Scientific Discovery & Drug Design

Marie-Therese Sekwenz +21d ago

"There is literally zero funding": Understanding the Emerging Role of Trusted Flaggers under the EU Digital Services Act

Despite the EU's Digital Services Act aiming to empower Trusted Flaggers in combating harmful online content, TFs are struggling with accreditation hurdles, resource scarcity, and conflicting platform priorities, raising serious questions about the DSA's practical effectiveness.

Marie-Therese Sekwenz, Kyle Beadle, Simon Parkin

Constitutional AI & AI Ethics Natural Language Processing

Chandler C. Payne +71d ago

Same Rules, Mixed Messages: Exploring Community Perceptions of Academic Dishonesty in Computing Education

Instructors and students are often on different planets when it comes to understanding why cheating happens in CS courses.

Chandler C. Payne, Kai A. Hackney, Lucas Guarenti Zangari +5

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Jianjun Xiao +11d ago

Designing Human-GenAI Interaction for cMOOC Discussion Facilitation: Effects of a Collaborative AI-in-the-Loop Workflow on Social and Cognitive Presence

Simply injecting GenAI into online learning discussions doesn't cut it; reciprocal exchange and human oversight are key to boosting social presence and higher-order cognition.

Jianjun Xiao, Cixiao Wang

Natural Language Processing Tool Use & Agents

Oraclizer Core Team1d ago

A Regulatory Compliance Protocol for Asset Interoperability Between Traditional and Decentralized Finance in Tokenized Capital Markets

Bridging TradFi and DeFi asset tokenization requires more than just technology – it demands a standardized regulatory framework, and this paper delivers one.

Jin-whan Kim, Jonghun Hong

Natural Language Processing

Xiangyang Xiao +21d ago

Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects

LLMs can now reproduce Android app bugs with 87% accuracy, thanks to pre-assessing the visual effects of UI actions.

Xiangyang Xiao, Huaxun Huang, Rongxin Wu

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

1d ago·also Hebei University of Science and Technology, York

Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback

Stop optimizing LLM logs for human readability – runtime-guided, task-oriented logs dramatically improve downstream debugging performance.

Xin Wang, Jiaoxiao Qian, Yang Zhang +2

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Andreas Abel +21d ago

A Graded Modal Dependent Type Theory with Erasure, Formalized

Guaranteeing that erasing "erasable" function arguments provably preserves program behavior opens the door to more efficient and verifiable code optimization.

Andreas Abel, Nils Anders Danielsson, Oskar Eriksson

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

Shi Li +81d ago

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.

Shi Li, Vinkle Srivastav, Nicolas Chanel +6

Computer Vision Multimodal Models Natural Language Processing

1d ago·also UWA, Xidian

SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.

Ning Wang, Tieyue Wu, Naeha Sharif +5

Computer Vision Multimodal Models Natural Language Processing

Yaning Zhang +41d ago

GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Gaze, often overlooked, reveals deepfake origins with surprising accuracy, enabling a new CLIP-based approach that significantly boosts deepfake attribution and detection.

Yaning Zhang, Linlin Shen, Zitong Yu +2

Computer Vision Multimodal Models Natural Language Processing

P. Majumdar +31d ago

Unbiased Model Prediction Without Using Protected Attribute Information

Mitigating bias in deep learning models is now possible without needing sensitive protected attribute information, opening doors for fairer AI in privacy-conscious applications.

P. Majumdar, S. Mittal, M. Vatsa +1

Constitutional AI & AI Ethics Natural Language Processing

Jingqi Xu1d ago

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Negation, a known weakness in VLMs like CLIP, can be dramatically improved by strategically fine-tuning only the *front* layers of the text encoder with a modified contrastive loss.

Jingqi Xu

Computer Vision Multimodal Models Natural Language Processing

Takeshi Kurata1d ago

XR is XR: Rethinking MR and XR as Neutral Umbrella Terms

XR's widespread use isn't about "Extended Reality" at all, but rather its neutrality as a symbolic container for VR, AR, and MR.

Takeshi Kurata

Natural Language Processing

Tsinghua AI1d ago·also ByteDance, Rice

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.

Zeyu Jin, Songtao Zhou, Haoyu Wang +5

Multimodal Models Natural Language Processing Speech & Audio

Latent Labs Team +191d ago

Latent-Y: A Lab-Validated Autonomous Agent for De Novo Drug Design

An AI agent can now autonomously design functional antibodies with nanomolar affinities from text prompts, achieving a 67% success rate in lab validation and accelerating expert workflows by 56x.

Latent Labs Team, Sebastian M. Schmon, Daniella Pretorius +17

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Detai Xin +61d ago

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Ditching mel-spectrograms unlocks surprisingly better text-to-speech, as LongCat-AudioDiT proves that waveform latent diffusion can beat the state-of-the-art in zero-shot voice cloning.

Detai Xin, Shujie Hu, Chengzuo Yang +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Y. E. Kheir +71d ago

IQRA 2026: Interspeech Challenge on Automatic Assessment Pronunciation for Modern Standard Arabic (MSA)

Arabic mispronunciation detection just got a whole lot better: F1-scores jumped by 0.28 thanks to novel architectures and a new dataset of authentic mispronunciations.

Y. E. Kheir, Amit Meghanani, Mostafa Shahin +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

1d ago·also Baidu, Leiden, UvA

Cold-Starts in Generative Recommendation: A Reproducibility Study

Generative recommendation's touted cold-start abilities often vanish under rigorous testing, revealing a sensitivity to design choices that current benchmarks fail to capture.

Zhen Zhang, Jujia Zhao, Xinyu Ma +3

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

1d ago·also Microsoft Research, Independent

Drift-Aware Continual Tokenization for Generative Recommendation

Generative recommendation models can adapt to evolving user behavior without catastrophic forgetting by selectively updating item tokens based on a novel drift-detection mechanism.

Yuebo Feng, Jiahao Liu, Mingzhe Han +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Archish S +41d ago

On Strengths and Limitations of Single-Vector Embeddings

Single-vector embeddings' retrieval failures aren't just about dimensionality; they're fundamentally hobbled by domain shift, relevance misalignment, and a "drowning" effect that multi-vector models handle far better.

Archish S, Mihir Agarwal, Ankit Garg +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

1d ago·also Deakin

Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

Stakeholder-agnostic requirements engineering in aged-care tech can lead to misalignment and missed priorities, as developers, caregivers, and older adults often disagree on what matters most.

Yuqing Xiao, John C. Grundy, Anuradha Madugalla +1

Constitutional AI & AI Ethics Natural Language Processing

University of Sannio1d ago

Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software

Open-source projects are quietly integrating ML models in ways that may violate terms of service and regulations, raising concerns about unchecked ML automation.

Zohaib Arshid, Daniele Bifolco, Fiorella Zampetti +1

Constitutional AI & AI Ethics Natural Language Processing Open-Source Models & Weights

Tor Lattimore1d ago

Refined Detection for Gumbel Watermarking

Gumbel watermarks just got a whole lot harder to evade: a new detection method is provably near-optimal.

Tor Lattimore

Natural Language Processing

Mar 30, 2026

Mehryar Mohri +42d ago

Next-Token Prediction and Regret Minimization

Bounded context windows in next-token prediction models can be fundamentally incompatible with low adversarial regret, even with long context lengths.

Mehryar Mohri, Clayton Sanford, Jon Schneider +2

Natural Language Processing Red-Teaming & Adversarial Robustness

Julio C. Serrano. Joonas Kevari +12d ago·also University of Vaasa

A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis

Escape the confines of linear literature reviews: this multi-agent system surfaces hidden connections and ruptures in research landscapes, revealing insights that traditional methods miss.

Julio C. Serrano. Joonas Kevari, Rumy Narayan

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Qing Qing +42d ago

NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information

Spectral analysis of graph neighborhoods reveals a surprisingly effective and efficient way to boost anomaly detection, consistently outperforming existing GNN-based methods.

Qing Qing, Huafei Huang, Mingliang Hou +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Yves Ruffenach2d ago

Variational Neurons in Transformers for Language Modeling

Transformers can now predict with an explicit internal structure of uncertainty, enabling stronger probabilistic evaluation and a more informative analysis of model behavior.

Yves Ruffenach

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Hongkai Hu2d ago

Policy-Controlled Generalized Share: A General Framework with a Transformer Instantiation for Strictly Online Switching-Oracle Tracking

Transformers can now dynamically adapt expert weighting in online learning, achieving state-of-the-art dynamic regret in non-stationary environments.

Hongkai Hu

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Andrew Jacobsen +32d ago

A Perturbation Approach to Unconstrained Linear Bandits

Unconstrained bandit linear optimization can be surprisingly reduced to standard online linear optimization using a perturbation approach, unlocking new regret guarantees and high-probability bounds.

Andrew Jacobsen, Dorian Baudry, Shinji Ito +1

Natural Language Processing Recommendation & Information Retrieval Training Efficiency & Optimization

University of the Basque Country (EHU)2d ago

Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Unlock hidden predictive power: NLP on unstructured clinical notes beats traditional EHR data for early disease prediction.

Ane G Domingo-Aldama, Marcos Merino Prado, Alain García Olea +3

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Julio Candanedo +12d ago

Diffusion Maps is not Dimensionality Reduction

Diffusion Maps alone fail to directly recover low-dimensional charts, requiring combination of multiple modes, challenging their common perception as a drop-in dimensionality reduction technique.

Julio Candanedo, Alejandro Patiño

Architecture Design (Transformers, SSMs, MoE)Computer Vision Natural Language Processing

Arundhathi Dev +12d ago

Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Achieve near state-of-the-art OCR accuracy with 95% less compute by decoupling character detection from language correction and training the language model on synthetic noise alone.

Arundhathi Dev, Justin Zhan

Computer Vision Natural Language Processing Training Efficiency & Optimization

2d ago

Building evidence-based knowledge graphs from full-text literature for disease-specific biomedical reasoning

LLMs can now construct high-fidelity, disease-specific knowledge graphs from full-text biomedical literature, unlocking evidence-aware reasoning and hypothesis generation.

Chang Zong, Sicheng Lv, Si-tu Xue +3

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Kangkang Sun +52d ago

CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

Semantic disagreement between LLMs reveals crucial uncertainty that single-model metrics miss, and Collaborative Entropy (CoE) captures it.

Kangkang Sun, Jun Wu, Jianhua Li +3

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Raspberry Pi Foundation2d ago·also Cambridge

Mapping data literacy trajectories in K-12 education

Data literacy isn't monolithic: K-12 learners navigate wildly different learning pathways depending on the context, challenging assumptions about a one-size-fits-all approach.

Robert Whyte, M. Cheung, Manni Cheung +3

Data Curation & Synthetic Data Natural Language Processing

Thomas Van Mullem +22d ago

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

Gemini 3 flash can answer introductory programming questions better than typical educators, suggesting a path to scalable, personalized feedback in CS1 courses.

Thomas Van Mullem, Bart Mesuere, Peter Dawyndt

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Yujie Zhang +32d ago

EpiPersona: Persona Projection and Episode Coupling for Pluralistic Preference Modeling

LLMs can better adapt to diverse preferences by explicitly separating stable personal traits from situational factors, leading to significant performance gains, especially when preferences shift across episodes.

Yujie Zhang, Weikang Yuan, Zhuoren Jiang +1

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Neha Puri +12d ago

Designing AI for Real Users -- Accessibility Gaps in Retail AI Front-End

Retail AI's promise of intuitive, personalized experiences crumbles when confronted with the reality of differently abled users, exposing a systemic neglect of accessibility in design and deployment.

Neha Puri, Tim Dixon

Constitutional AI & AI Ethics Natural Language Processing Recommendation & Information Retrieval

Core contribution2d ago

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Open-source document parsing models are shockingly brittle, losing nearly 18% accuracy on real-world photos and 14% on non-Latin scripts compared to their closed-source counterparts.

Zhang Li, Zhibo Lin, Qiang Liu +6

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

L. Curini +52d ago

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.

L. Curini, Luigi Curini, Alfio Ferrara +3

Multimodal Models Natural Language Processing Speech & Audio

Stanford HAI2d ago

Synonymix: Unified Group Personas for Generative Simulations

Unlock richer, more realistic agent simulations by moving beyond individual personas to unified group representations that capture collective behavior.

Huanxing Chen, Aditesh Kumar

Natural Language Processing Tool Use & Agents World Models & Planning

Kosei Fushimi +32d ago

Structural-Ambiguity-Aware Translation from Natural Language to Signal Temporal Logic

Instead of forcing a single interpretation, this work embraces the inherent ambiguity of natural language to generate multiple plausible STL formulas from a single NL task description.

Kosei Fushimi, Kazunobu Serizawa, Junya Ikemoto +1

Code Generation & Program Synthesis Natural Language Processing

Sercan Karakaş2d ago

Transfer Learning for an Endangered Slavic Variety: Dependency Parsing in Pomak Across Contact-Shaped Dialects

Even a small, targeted dataset can bridge the gap in cross-dialect transfer learning for low-resource languages, significantly boosting dependency parsing accuracy.

Sercan Karakaş

Data Curation & Synthetic Data Natural Language Processing

Verena Platzgummer +22d ago

\textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language

LLMs' struggles with non-standard languages aren't just a technical problem, but reflect and reinforce historical power imbalances embedded in linguistic standardization.

Verena Platzgummer, John McCrae, Sina Ahmadi

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

Islamic University of Technology (IUT)2d ago

From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

LLMs can now reliably transform messy app store reviews into well-formatted user stories, but still fall short of creating truly independent and unique requirements for agile development.

Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem +1

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Xinran Zhang2d ago

Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation

Atomic decomposition, a popular technique for LLM judges, may not be superior to holistic evaluation when prompts are carefully controlled, challenging the assumption that breaking down answers into claims is always beneficial.

Xinran Zhang

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2d ago

Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

You can now unmask LLM ghostwriters with a lightweight fingerprinting method that works even when they try to hide in new domains or use unseen models.

Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Search

Natural Language Processing - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)