May 1 – May 8, 2026

Natural Language Processing - Weekly Roundup

100 papers published across 7 labs.

Selected Labs publishing this week

Tsinghua AI1 MIT CSAIL1 CMU ML1 ETH1 DAMO1

Top Papers

May 6, 2026

Department of Mathematics2w ago·also Georgia Tech, Purdue, School of Mathematics

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

Transformers can be explicitly designed to perform nonlinear regression in-context by leveraging attention as a featurizer, offering a theoretical understanding of how these models learn complex relationships from prompts.

Alexander Hsu, Zhaiming Shen, Wenjing Liao +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scaling Laws & Emergent Abilities

Independent Researcher2w ago

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

Synthetic data augmentation and per-language threshold tuning can significantly boost the performance of LLMs on multilingual tasks, outperforming alternative architectures that showed promise on the development set.

Srikar Kashyap Pulipaka

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Freyaa Chawla +42w ago

Human-AI Co-Mentorship in Project-Based Learning: A Case Study in Financial Forecasting

AI co-mentorship lets high schoolers build real-world financial models, skipping the classroom grind and diving straight into problem-solving.

Freyaa Chawla, Ahan Chawla, Rishi Singh +2

Natural Language Processing Tool Use & Agents

University of Tennessee2w ago

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Hallucination detection can be reframed as a dynamical systems problem, enabling a surprisingly effective and efficient black-box approach that avoids expensive sampling or external knowledge retrieval.

Dan Wilson, Mohamed Akrout

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Computer Science Department2w ago·also Department of Critical Care Medicine, Pitt

Conditional outlier detection for clinical alerting

Anomaly detection in EHR data can pinpoint potentially erroneous clinical decisions with surprisingly low false alarm rates, suggesting a practical pathway to improve patient safety.

Milos Hauskrecht, Michal Valko, Shyam Visweswaran +3

Natural Language Processing Scientific Discovery & Drug Design

All Papers (100)

May 6, 2026

Department of Mathematics2w ago·also Georgia Tech, Purdue, School of Mathematics

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

Alexander Hsu, Zhaiming Shen, Wenjing Liao +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scaling Laws & Emergent Abilities

Independent Researcher2w ago

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

Srikar Kashyap Pulipaka

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Freyaa Chawla +42w ago

Human-AI Co-Mentorship in Project-Based Learning: A Case Study in Financial Forecasting

AI co-mentorship lets high schoolers build real-world financial models, skipping the classroom grind and diving straight into problem-solving.

Freyaa Chawla, Ahan Chawla, Rishi Singh +2

Natural Language Processing Tool Use & Agents

University of Tennessee2w ago

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Dan Wilson, Mohamed Akrout

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Computer Science Department2w ago·also Department of Critical Care Medicine, Pitt

Conditional outlier detection for clinical alerting

Anomaly detection in EHR data can pinpoint potentially erroneous clinical decisions with surprisingly low false alarm rates, suggesting a practical pathway to improve patient safety.

Milos Hauskrecht, Michal Valko, Shyam Visweswaran +3

Natural Language Processing Scientific Discovery & Drug Design

Arthur Gretton +52w ago

On the Wasserstein Gradient Flow Interpretation of Drifting Models

GMD algorithms, previously seen as a novel generative framework, can be understood as directly targeting fixed points of Wasserstein Gradient Flows, offering a new perspective on their optimization process.

Arthur Gretton, Li Kevin Wenliang, Alexandre Galashov +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Xiaoyu Jiang +42w ago

Transformed Latent Variable Multi-Output Gaussian Processes

Modeling 10,000+ correlated outputs is now tractable: T-LVMOGP offers a scalable alternative to restrictive low-rank MOGPs by learning a flexible deep kernel in a shared embedding space.

Xiaoyu Jiang, Xinxing Shi, Sokratia Georgaka +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Olivia Jullian Parra +92w ago

Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation

LLMs can now impute missing healthcare data well enough to improve causal treatment effect estimation from real-world EHRs, even with 80% missingness.

Olivia Jullian Parra, Sara Zoccheddu, David Catalan Cerezo +7

Natural Language Processing Scientific Discovery & Drug Design

Andreas Pattichis +12w ago

Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics

Forget rigid memory structures: Memini lets your LLM's external knowledge evolve organically, learning and forgetting like a brain.

Andreas Pattichis, Constantine Dovrolis

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

2w ago

Scalable inference of spatial regions and temporal signatures from time series

Discovering spatial regions and their temporal signatures in massive time series data just got much faster and easier, thanks to a new method that scales log-linearly with the number of time series.

Jiayu Weng, Alec Kirkley

Computer Vision Natural Language Processing Scientific Discovery & Drug Design

Andrea Napoli +12w ago

Order Matters: Improving Domain Adaptation by Reordering Data

Training data order matters more than you think: reordering your data can significantly improve unsupervised domain adaptation by reducing variance in domain discrepancy estimates.

Andrea Napoli, Paul White

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Antonin Berthon +22w ago

Skill Neologisms: Towards Skill-based Continual Learning

Forget fine-tuning: "skill neologisms"—new soft tokens—let you inject skills into LLMs without weight updates, composing them zero-shot for flexible knowledge expansion.

Antonin Berthon, Nicolas Astorga, Mihaela van der Schaar

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

2w ago·also UPenn

Conceptors for Semantic Steering

Steering LLMs with conceptors—soft projection matrices capturing the full semantic subspace—yields more robust control and enables Boolean logic for composing concepts, moving beyond the limitations of single-vector steering.

Ilias Triantafyllopoulos, Young-Min Cho, Ren Tao +6

Interpretability & Mechanistic Interp Natural Language Processing

2w ago

Delving into Non-Exchangeability for Conformal Prediction in Graph-Structured Multivariate Time Series

Conformal prediction for graph time series doesn't have to break down: by conditioning on low-frequency trends, you can restore exchangeability and get valid uncertainty estimates.

Ruichao Guo, Xingyao Han, Luo Wenshui +3

Natural Language Processing Scientific Discovery & Drug Design

Institute of Science Tokyo2w ago

A Foundation Model for Zero-Shot Logical Rule Induction

Forget retraining: this model learns interpretable logical rules from data in a zero-shot manner by encoding literals with domain-agnostic statistical properties.

Yin Jun Phua

Interpretability & Mechanistic Interp Natural Language Processing Reasoning & Chain-of-Thought

Tsinghua AI2w ago·also SEU, Siemens AI

Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.

Xinyan Han, Yan Lu, Xiaoyu Lin +5

Data Curation & Synthetic Data Natural Language Processing

Dominik Dahlem +22w ago

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

Symmetric spectral analysis of attention is fundamentally blind to information flow direction, but a simple asymmetry coefficient can restore the signal.

Dominik Dahlem, Diego Maniloff, Mac Misiura

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Yangchen Yu +72w ago

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.

Yangchen Yu, Qian Chen, Jia Li +5

Multimodal Models Natural Language Processing Speech & Audio

Soyoung park +22w ago

Quantile-Free Uncertainty Quantification in Graph Neural Networks

GNN uncertainty just got a whole lot easier: QpiGNN delivers better coverage and tighter intervals without the quantile gymnastics.

Soyoung park, Hwanjun Song, Sungsu Lim

Natural Language Processing Scientific Discovery & Drug Design

Wenjing Liu +22w ago

A Biased Nonnegative Block Term Tensor Decomposition Model for Dynamic QoS Prediction

Overcome limitations in capturing complex user-service dependencies with a novel tensor decomposition method that significantly boosts QoS prediction accuracy.

Wenjing Liu, Yujia Lei, Qu Wang

Natural Language Processing Recommendation & Information Retrieval

National Central University2w ago·also National Dong Hwa University, Universitas Negeri Yogyakarta

Cognitive Twins: Investigating Personalized Thinking Model Building and Its Performance Enhancement with Human-in-the-Loop

LLMs can construct interpretable, multi-layered models of individual student cognition from journal entries, opening new possibilities for personalized education.

Wu-Yuin Hwang, Nur Alif Ilyasa, Muhammad Irfan Luthfi +1

Interpretability & Mechanistic Interp Natural Language Processing Tool Use & Agents

V. Srinivasan +32w ago

Gyan: An Explainable Neuro-Symbolic Language Model

Forget opaque transformers: Gyan offers SOTA language modeling with full interpretability, lower compute, and human-like compositional understanding.

V. Srinivasan, Vishaal Jatav, A. Chandrababu +1

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Leon Witt +42w ago

Knowledge-Free Correlated Agreement for Incentivizing Federated Learning

Incentivizing honest participation in federated learning is now possible without ground truth labels, even when some participants are trying to game the system.

Leon Witt, T. Abbaslı, Kentaroh Toyoda +2

Distributed Systems & Hardware Natural Language Processing Training Efficiency & Optimization

Ziang Chen +42w ago

Almost-Orthogonality in Lp Spaces: A Case Study with Grok

Carbery's conjectured improvement to the triangle inequality in Lp spaces is false for p > 2, but a weaker version holds true with a sharp exponent.

Ziang Chen, Jaume de Dios Pont, Paata Ivanisvili +2

Natural Language Processing

2w ago

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Hallucination detection can be nearly as effective with a single forward pass as with expensive multi-sample methods.

Mina Gabriel

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Oracle Corporation2w ago

Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement

Forget relying on LLMs to judge themselves: this "Concept Field" approach uses vector math on text corpora to detect hallucinations and novelty, offering a fast, interpretable, and black-box alternative.

Nicholas S. Kersting, Vittorio Castelli, Chieh Ting Yeh +2

Natural Language Processing Recommendation & Information Retrieval

2w ago·also Georgia Institute of Techonology, Princeton

Think-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior

Think-Aloud data doesn't just improve cognitive model fit; it fundamentally reshapes the discovered model structure, revealing cognitive mechanisms undetectable from behavior alone.

Hanbo Xie, Akshay K. Jagadish, Lan Pan +1

Natural Language Processing Reasoning & Chain-of-Thought

2w ago

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Interventions on LLMs, like knowledge editing or unlearning, can have surprising side effects that this automated pipeline can now surface and validate.

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau +1

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago·also Shanghai Qizhi Institute, State Key Laboratory of Cryptology

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

Shuffling activations, a popular defense in secure Transformer inference, crumbles under a new alignment attack that recovers model weights for just $1.

Zhengyi Li, Yakai Wang, Kang Yang +6

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Universitat de Barcelona2w ago·also Centro Ricerche Enrico Fermi (CREF), Complexity Science Hub (CSH), Konstanz

Anticipating Innovation Using Large Language Models

Forget expert intuition – language trends in patent filings can foresee technological breakthroughs years before they happen.

Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo +2

Natural Language Processing Scientific Discovery & Drug Design

Federal University of Rio Grande do2w ago

Assessing Cognitive Effort in L2 Idiomatic Processing: An Eye-Tracking Dataset

L2 learners' struggles with idioms, captured in a new eye-tracking dataset, offer a cognitively-grounded benchmark for evaluating how well LLMs truly "understand" figurative language.

Eduardo Santos, Juliana Carvalho, César Rennó-Costa

Natural Language Processing

Haotian Xia +62w ago·also HKU, Northwestern

StoryAlign: Evaluating and Training Reward Models for Story Generation

Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.

Haotian Xia, Hao Peng, Yunjia Qi +4

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

Álvaro Becerra +22w ago·also School of Engineering

AICoFe: Implementation and Deployment of an AI-Based Collaborative Feedback System for Higher Education

Teachers can now scalably provide high-quality, personalized feedback to students by leveraging a multi-LLM system that synthesizes rubric data and qualitative observations, while retaining control through a teacher-in-the-loop workflow.

Álvaro Becerra, A. Palma, Ruth Cobos

Natural Language Processing Open-Source Models & Weights Tool Use & Agents

Miao Wang +72w ago

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

Forget stilted, unconvincing VR characters: EBM-RL's novel reward decomposition finally makes video-grounded role-playing dialogue feel immersive.

Miao Wang, Yuling Shi, Yijiang Li +5

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Álvaro Becerra +22w ago·also School of Engineering

AISSA: Implementation and Deployment of an AI-based Student Slides Analysis tool for Academic Presentations

Automating rubric-based feedback on presentation slides is now feasible and perceived as useful, thanks to LLMs and learning analytics dashboards.

Álvaro Becerra, Diego Gómez, Ruth Cobos

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Yuanzhi Wang +92w ago

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng +7

Computer Vision Multimodal Models Natural Language Processing

Mingda Li +42w ago

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

LLM uncertainty can be efficiently estimated *without* sampling by measuring the stability of output distributions under semantically equivalent input perturbations.

Mingda Li, Rundong Lv, Xinyu Li +2

Eval Frameworks & Benchmarks Natural Language Processing

2w ago·also Georgia State University, Harvard, Vanderbilt

Guidelines for Designing AI Technologies to Support Adult Learning

AI-powered learning systems often fail adult learners because they're built for kids: here are 19 guidelines to fix that.

Jennifer M. Reddig, Glen R. Smith, Sanaz Ahmadzadeh Siyahrood +16

Constitutional AI & AI Ethics Natural Language Processing

Yukun Chen +42w ago

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Unlock scalable, high-quality singing voice synthesis by directly generating structured musical scores from audio, outperforming existing systems on multiple datasets.

Yukun Chen, Tianrui Wang, Zhaoxi Mu +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Xinyi Li +72w ago

HeterSEED: Semantics-Structure Decoupling for Heterogeneous Graph Learning under Heterophily

HeterSEED achieves state-of-the-art performance on heterophilic heterogeneous graphs by decoupling semantic and structural information, offering a more robust approach than relying on feature similarity alone.

Xinyi Li, Ming Li, Lu Bai +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

2w ago·also ByteDance, SEU

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.

Zishen Qu, Xuesong Li, Haijian Gu +4

Computer Vision Multimodal Models Natural Language Processing

2w ago

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

LLMs can get up to 6x more logically consistent without human feedback, simply by fusing NLI scores into the DPO training loop.

Qiming Bao, Juho Leinonen, Paul Denny +1

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Ivan Bondarenko +52w ago

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

A judge-orchestrated ensemble of diverse LLMs trounces single models in multi-turn response generation, proving that strategic model selection beats brute force scaling.

Ivan Bondarenko, Roman Derunets, Oleg Sedukhin +3

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

MIT CSAIL2w ago

Implicit Representations of Grammaticality in Language Models

LMs encode grammaticality as a distinct feature in their hidden representations, separable from raw string probability and generalizable across languages.

Yingshan Susan Wang, Linlu Qiu, Zhaofeng Wu +2

Eval Frameworks & Benchmarks Natural Language Processing

University of Calgary2w ago·also Institute University of Calgary

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

LLMs ace MRI multiple-choice tests, but can't actually recall basic facts about GE scanners, revealing a dangerous gap between perceived and actual competence.

Perry E. Radau

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Yucheng Ruan +42w ago

Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction

Overconfident predictions plague mental health prediction models, but this new framework leverages evidential learning to provide more trustworthy uncertainty estimates and human-understandable reasoning signals.

Yucheng Ruan, Ling Huang, Qika Lin +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

IDEAS Research Institute2w ago·also Warsaw

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

LLMs differ most not in personality, but in how they represent themselves as having (or not having) rich internal experience.

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Utrecht University2w ago

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

Attention heads hold the key to detecting LLM hallucinations, offering a lightweight, white-box alternative to expensive sampling or external models.

Gijs van Dijk

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

2w ago·also Ant Group, PolyU

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

TabEmbed leapfrogs existing text embedding models to achieve SOTA performance on tabular data by reformulating tasks as semantic matching problems and using contrastive learning.

Minjie Qiang, Mingming Zhang, Xiaoyi Bao +5

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Kazan Federal University2w ago·also Automation and Information Technologies, Department of Automated Systems for Data, Department of Data Analysis and Programming, Dmukhtasibovich -Doctor of Physical and Mathematical +5

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

Forget full fine-tuning: QLoRA on 7B models can match the perplexity of fully fine-tuned smaller models for low-resource languages, while slashing the parameter count by 40x.

Mullosharaf K. Arabov, Svetlana S. Khaybullina

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Charles University2w ago

UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning

Small LLMs paired with symbolic solvers can outperform larger zero-shot LLMs on formal reasoning tasks, but still struggle with multilingual inputs.

Ivan Kartáč, Kristýna Onderková, Jan Bronec +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2w ago·also Northwestern

Unintended Negative Impacts of Promotional Language in Patent Evaluation

Patents overselling their innovation actually face a *penalty* in evaluation, decreasing their chances of being granted, transferred, or successfully appealed.

Bingkun Zhao, Chenwei Zhang, Hao Peng

Natural Language Processing Scientific Discovery & Drug Design

Vita Anggraini +52w ago·also Department of Data Science Institut, Institut Teknologi Sumatera Lampung

A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset

Sometimes, simpler is better: Logistic Regression beats BiLSTMs at tweet sentiment classification on medium-sized datasets.

Vita Anggraini, Cintya Bella, Bastian +3

Natural Language Processing

2w ago·also CNRS, CREST (, ENSAE, Grenoble INP +3

BenCSSmark: Making the Social Sciences Count in LLM Research

LLM benchmarks are missing a critical ingredient: social science data, which could significantly improve model generalization and robustness across a wide range of disciplines.

Arnault Chatelain, Étienne Ollion, Qianwen Guan +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Institut Teknologi Sumatera South2w ago·also Department of Data Science Institut, Institut Teknologi Sumatera Lampung

Sentiment Analysis and Customer Satisfaction Prediction on E-Commerce Platforms Based on YouTube Comments Using the XGBoost Algorithm

E-commerce sentiment analysis is surprisingly influenced by socio-political terminology, impacting the accuracy of customer satisfaction prediction models.

Ridho Benedictus Togi Manik, Muhammad Aqil Ramadhan, Ihsan Maulana Yusuf +3

Natural Language Processing Recommendation & Information Retrieval

Department of Data Science Institut2w ago·also Institut Teknologi Sumatera Lampung

A Comparative Study of PyCaret AutoML and CNN-BiLSTM for Binary Hate Speech Detection in Indonesian Twitter

CNN-BiLSTM beats AutoML for Indonesian hate speech detection, but the gains are modest, suggesting the dataset's limitations are a bigger bottleneck than model architecture.

Tanty Widiyastuti, Mayada, Adisty Syawalda Ariyanto +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Maria Luongo +22w ago

Measuring Psychological States Through Semantic Projection: A Theory-Driven Approach to Language-Based Assessment

Ditch the black box: This unsupervised semantic projection method rivals supervised models in psychological assessment, offering interpretability and generalizability that supervised methods lack.

Maria Luongo, Davide Marocco, Nicola Milano

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Corresponding Author2w ago

CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning

State-of-the-art temporal knowledge graph reasoning is now possible by jointly modeling historical evidence and evolutionary dynamics, unlocking complementary predictive signals.

Shuai Lei, Xiaobin Zhu, Jiarui Liang +3

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Ge Lei +12w ago

Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

LLM surrogates in low-data optimization are far more sensitive to prompt engineering and query protocols than previously appreciated, fundamentally altering their beliefs and downstream performance.

Ge Lei, Samuel J. Cooper

Eval Frameworks & Benchmarks Natural Language Processing

Aofan Liu +12w ago

Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

LLMs can be surprisingly brittle: simply rephrasing a prompt, even while preserving its meaning, can cause them to completely abandon the requested output format.

Aofan Liu, Jingxiang Meng

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

CMU ML2w ago·also SKKU, UBC

Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties

Dissimilarity, not just similarity, unlocks better language generalization for low-resource varieties.

Jinju Kim, Haeji Jung, Youjeong Roh +2

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

ETH2w ago

Graph-Augmented LLMs for Swiss MP Ideology Prediction

Political ideology prediction gets a boost: injecting LLMs with knowledge graphs of MP relationships significantly improves accuracy.

Natural Language Processing Recommendation & Information Retrieval

M. Arabov2w ago

TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

Unlock Tajik NLP: a new open-source toolkit delivers a comprehensive pipeline for processing Cyrillic-script Tajik text, complete with datasets and pre-trained embeddings.

M. Arabov

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

University of Potsdam2w ago

The Newsworthiness of Brazilian Distress: A Peak Analysis on Time Series of International Media Attention to Disasters in Brazil

International media attention to Brazilian disasters doesn't always reflect the actual severity or frequency of events, revealing a disconnect between disaster databases and news cycles.

Brielen Madureira, Andreas Niekler, Marc Keuschnigg +1

Natural Language Processing

M. Arabov2w ago

Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus

Even state-of-the-art multilingual models struggle to tag parts-of-speech in Tajik when trained on isolated words, highlighting the critical role of syntactic context.

M. Arabov

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Natural Language Processing

Yepeng Weng +22w ago

UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

UniVer achieves state-of-the-art speculative decoding by jointly optimizing multi-step and multi-draft verification, outperforming existing methods by up to 8.5% in acceptance length.

Yepeng Weng, Qiao Hu, T. Yairi

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Zongqi Cui +12w ago

Distilling Bayesian Belief States into Language Models for Auditable Negotiation

You can distill interpretable Bayesian reasoning about opponent preferences into an 8B language model, outperforming much larger models and enabling detailed auditability of negotiation strategies.

Zongqi Cui, Baihan Lin

Inference & Quantization Natural Language Processing Tool Use & Agents

Zhipeng Song +82w ago

CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

RAG systems can be significantly improved by reranking documents based on how much they increase the LLM's confidence, not just their relevance.

Zhipeng Song, Yizhi Zhou, Xiangyu Kong +6

Natural Language Processing Recommendation & Information Retrieval

2w ago

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

Stop hand-crafting QA datasets for evaluating RAG systems: DoGMaTiQ automates the process with surprisingly high correlation to human judgment, even across languages.

Bryan Li, W. Walden, Yu Hou +6

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Xinyu Wang +32w ago

Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control

LLMs can retain 10x more of their original capabilities after fine-tuning, simply by using a dynamically adjusted "anchor" to constrain distributional drift during training.

Xinyu Wang, Changzhi Sun, Yuanbin Wu +1

Natural Language Processing Training Efficiency & Optimization

Ziqi Zhu +32w ago

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

LLMs get schooled in dialogue state tracking by a mixture-of-experts architecture that uses a graph neural network and ReAct agents to achieve state-of-the-art results with a T5-Small backbone.

Ziqi Zhu, Adithya Suresh, Tomal Deb +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

Mikhail L. Arbuzov +42w ago

Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting

Forget token deletion – Telegraph English rewrites prompts into a symbol-rich, structured dialect that compresses by 50% while actually *improving* accuracy on smaller models.

Mikhail L. Arbuzov, Sisong Bei, Ziwei Dong +2

Inference & Quantization Natural Language Processing

2w ago·also Brandenburg University of Technology

Conflict Essences for Transformation Rules with Nested Application Conditions -- Long Version

Pinpointing minimal "conflict essences" reveals precisely how graph transformation rules interfere, even with complex nested conditions.

Alexander Lauer, Jens Kosiol, Leen Lambers +1

Code Generation & Program Synthesis Natural Language Processing

2w ago·also Munich Center for Machine Learning (MCML), These authors contributed equally to

A meta-analysis of the effect of generative AI on productivity and learning in programming

GenAI coding assistants boost developer productivity, but the gains shrink outside the lab and don't translate to better learning.

Sebastian Maier, Moritz Gunzenhauser, J. Schweisthal +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Wenzhuo Cheng +62w ago

CapsID: Soft-Routed Variable-Length Semantic IDs for Generative Recommendation

Generative recommendation gets a boost: CapsID's soft-routed semantic IDs outperform hard-quantized baselines and even rival sparse-dense hybrids, all while slashing inference latency by nearly half.

Wenzhuo Cheng, Menghang Gong, Qixin Guo +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

2w ago·also ZJU

Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation

LLMs for recommendation can now surpass the limitations of static training signals, achieving sustained improvements in ranking accuracy, fairness, and diversity through a dynamically updated Bayesian distillation target.

Ruijun Chen, Chongming Gao, Jiawei Chen +2

Natural Language Processing Recommendation & Information Retrieval

DAMO2w ago·also PolyU, SCU

RecGPT-Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation

On-device LLMs can now drive real-time recommendation improvements, unlocking faster adaptation to evolving user intent without cloud reliance.

Bin Zhang, Weipeng Huang, Dimin Wang +8

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

ZhiXin Sun2w ago

Example-Based Object Detection

Stop retraining your object detector every time it makes a mistake: EBOD learns from failure examples to prevent recurring errors in open-vocabulary object detection.

ZhiXin Sun

Computer Vision Natural Language Processing

May 5, 2026

Qiyao Wang +132w ago

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

LLMs struggle to navigate the complex, multi-turn justification and response dynamics of real-world patent examination, revealing critical gaps in legal reasoning and technical novelty judgment.

Qiyao Wang, Qiyao Wang, Xinyi Chen +11

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Joseph Breda +322w ago

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

LLMs beat doctors at everyday symptom diagnosis, but only when they proactively interview patients instead of passively answering questions.

Joseph Breda, Fadi Yousif, Beszel Hawkins +30

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Zhi Xu +12w ago·also Northeastern

NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

LLMs struggle with causal reasoning when noise is introduced, but explicitly modeling causal graphs can dramatically improve performance and generalization.

Zhi Xu, Yun Fu

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Amazon Science2w ago

SWAN: Semantic Watermarking with Abstract Meaning Representation

Semantic watermarks, embedded via AMR, survive paraphrasing attacks that obliterate token-level watermarks.

Ziping Ye, Gourab Dey, Christos Christodoulopoulos +7

Natural Language Processing Red-Teaming & Adversarial Robustness

Stefano Bannò +22w ago

Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs

LLMs are surprisingly good at pinpointing what's *wrong* with student writing, even outperforming human graders in identifying relative weaknesses.

Stefano Bannò, Kate Knill, Mark Gales

Eval Frameworks & Benchmarks Natural Language Processing

2w ago

MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

Existing hallucination detection methods are missing subtle, word-level medical errors, but a new data-centric pipeline and detector closes the gap by 15%.

Tung Sum Thomas Kwok, Qian Qian, Xiaofeng Lin +8

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Yao-Shun Chuang +82w ago

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

Forget massive models: small, locally-deployable language models can achieve surprisingly strong performance on privacy-sensitive clinical information extraction tasks with self-prompting and preference-based optimization.

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh +6

Inference & Quantization Natural Language Processing Open-Source Models & Weights

Yaobo Zhang2w ago

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Forget boring rotary embeddings: Jordan-RoPE unlocks distance-modulated phase interactions in attention, letting your model learn relationships like "the further apart, the stronger the cosine similarity."

Yaobo Zhang

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Stephen E. Moore +152w ago

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

Despite impressive multilingual capabilities, today's LLMs still can't reliably translate between English and Ghanaian languages at scale.

Stephen E. Moore, M. Owusu, Akwasi Asare +13

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Oona Itkonen +12w ago

The Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation

Domain match and language relatedness trump joint vocabularies for effective knowledge transfer in multilingual NMT.

Oona Itkonen, Jörg Tiedemann

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Hoffmann Muki +12w ago

Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa

LLMs exhibit a surprising "False Illegitimation bias," systematically misclassifying legitimate battles as violence against civilians, highlighting a critical flaw for conflict monitoring applications.

Hoffmann Muki, Olukunle P. Owolabi

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Humam Khan +42w ago

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

LLMs may sound convincing when writing academic content, but they can still confidently fabricate facts and references at surprisingly high rates.

Humam Khan, Md. Tabrez Nafis, S. Sohail +2

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Elitsa Yotkova +42w ago

FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

Forget the heavy transformers: surprisingly effective LLM-generated code detection can be achieved with lightweight stylometric features and decision trees, offering near-instant inference.

Elitsa Yotkova, Violeta Kastreva, D. Dimitrov +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Richard J. Young +12w ago·also DeepNeuro AI

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

LLMs can exhibit gender bias in emergency triage even when well-calibrated, and interventions effective for one model may backfire on another.

Richard J. Young, Alice M. Matthews

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

LLMs' own self-judgments, when logically linked to their response features, can significantly improve hallucination detection.

Hao Mi, Qiang Sheng, Shaofei Wang +7

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Geert Heyman +12w ago

Steer Like the LLM: Activation Steering that Mimics Prompting

Activation steering can finally match the nuanced control of prompt engineering: token-specific interventions learned from prompts let you steer LLMs more effectively.

Geert Heyman, Frederik Vandeputte

Interpretability & Mechanistic Interp Natural Language Processing

Mohamed F. Mady +22w ago

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

Naive application of transformer-based AI-text detectors can be brittle under distribution shift, but attention-based fusion of readability and vocabulary features can significantly improve robustness.

Mohamed F. Mady, Johannes Reschke, Björn W. Schuller

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Daniel Drucker +12w ago

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

Language models can play the counterexample game, but their philosophical reasoning hits diminishing returns fast, and they're far more lenient judges than humans.

Daniel Drucker, Kyle Mahowald

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Devon Jarvis +42w ago

Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities

Model collapse isn't just a technical problem; it's a threat to AI democratization that will widen the gap between high- and low-resource communities.

Devon Jarvis, Richard Klein, Benjamin Rosman +2

Constitutional AI & AI Ethics Data Curation & Synthetic Data Natural Language Processing

2w ago

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Even top LLM judges struggle to reliably detect violations of specific constraints in complex instructions, especially when violations are partial or absent, revealing critical blind spots in current evaluation methods.

Jaeyun Lee, Junyoung Koh, Z. Tok +2

Eval Frameworks & Benchmarks Natural Language Processing

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Learn to build and evaluate your own NLP pipeline, from tokenization to RLHF, using open-weight models and reproducible research practices.

Mullosharaf K. Arabov

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

2w ago·also Google Research, Harvard, Northeastern, Notre Dame +2

Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework

Instead of creating new AI companions from scratch, Deco shows how to breathe new life into cherished physical objects by giving them a digital voice and personality powered by LLMs.

Zhihan Jiang, Meng Wu, Ruishi Zou +14

Natural Language Processing Robotics & Embodied AI Tool Use & Agents

Search

Natural Language Processing - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)