Natural Language Processing
ApplicationsText understanding, generation, summarization, translation, information extraction, and linguistic analysis.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces Hadamard Linear Attention (HLA), a novel linear attention mechanism designed to more accurately approximate softmax attention. HLA applies a nonlinearity after the computation of pairwise similarities, unlike existing linear attention methods that apply nonlinear kernel functions independently to queries and keys. The authors demonstrate that this approach results in a higher-degree rational function approximation of softmax and show its effectiveness in a large diffusion transformer model for video generation.
Introduces Hadamard Linear Attention (HLA), a linear attention variant that applies a nonlinearity after pairwise similarity computation to better approximate softmax.
The paper introduces SAGEO Arena, a realistic evaluation environment for Search-Augmented Generative Engine Optimization (SAGEO) that addresses limitations of existing benchmarks by incorporating a full generative search pipeline over a large-scale corpus of web documents with rich structural information. They demonstrate that existing optimization approaches are often impractical and degrade performance in retrieval and reranking stages under realistic conditions. The study highlights the importance of structural information and stage-specific optimization for effective SAGEO.
Introduces SAGEO Arena, a novel benchmark environment enabling realistic, stage-level evaluation of search-augmented generative engine optimization strategies.
This paper investigates the impact of different LLM-powered AI assistance modalities (Advisor, Coach, Delegate) on human performance in multi-party negotiation games. Participants played bargaining games with access to one of these modalities, despite all modalities using the same underlying LLM. The key finding is a preference-performance misalignment: participants preferred the Advisor but achieved higher individual gains with the Delegate, which acted as a "market maker" by injecting Pareto-improving proposals.
Demonstrates a preference-performance misalignment in AI-assisted negotiation, revealing that users do not always adopt the AI modality that maximizes their gains or overall group welfare.
This paper introduces a PAC learning framework for learning conditional averages, where the goal is to predict the average label within an instance-specific neighborhood rather than the label itself. The work provides a complete characterization of learnability in this setting, demonstrating that it depends on the joint finiteness of two novel combinatorial parameters related to the independence number of the neighborhood graph. The authors derive sample complexity bounds that are tight up to logarithmic factors, offering insights into the learnability of conditional averages.
Characterizes the PAC learnability of conditional averages by introducing and analyzing two novel combinatorial parameters related to the independence number of the neighborhood graph.
This paper studies bandit learning in two-sided matching markets where agents and firms conduct interviews to learn preferences. The authors introduce strategic deferral, allowing firms to delay hiring decisions and recover from suboptimal matches, and model interviews as low-cost hints that reveal partial preference information. They develop novel algorithms for centralized and decentralized settings that achieve time-independent regret, improving upon logarithmic regret bounds for learning stable matchings without interviews.
Introduces strategic deferral for firms in matching markets, enabling decentralized learning and recovery from suboptimal hires.
The paper introduces RouterXBench, a comprehensive evaluation framework for LLM routers, addressing limitations of existing benchmarks by considering router ability, scenario alignment, and cross-domain robustness. They propose ProbeDirichlet, a novel router that leverages internal hidden states and learnable Dirichlet distributions for probabilistic training, capturing model uncertainty more effectively than methods relying on output probabilities or external embeddings. Empirical results demonstrate that ProbeDirichlet outperforms existing routers, achieving significant improvements in router ability and high-accuracy scenarios, while exhibiting robust generalization across diverse model families, scales, tasks, and workflows.
Introduces ProbeDirichlet, a router that aggregates cross-layer hidden states via learnable Dirichlet distributions for improved uncertainty estimation and routing decisions.
The paper introduces RELATE, a reinforcement learning framework for end-to-end advertising text generation that directly optimizes for conversion-oriented metrics and compliance constraints. RELATE integrates performance and compliance objectives into the text generation process via policy learning, moving beyond the traditional two-stage generation and alignment paradigm. Experiments on industrial datasets and online deployment show that RELATE significantly improves click-through conversion rate (CTCVR) while adhering to policy constraints.
Introduces an end-to-end reinforcement learning framework, RELATE, that unifies advertising text generation with conversion-oriented objective alignment and compliance constraints.
The paper introduces DHPLT, a large-scale multilingual diachronic corpus comprising web-crawled data from 41 languages across three time periods (2011-2015, 2020-2021, 2024-present). The authors leverage web crawl timestamps as a proxy for document creation time, providing 1 million documents per time period per language. They also provide pre-computed word embeddings and lexical substitutions to facilitate semantic change modeling research, addressing the scarcity of such resources for many languages.
Introduces DHPLT, a novel multilingual diachronic corpus with pre-computed embeddings and lexical substitutions, designed to facilitate research in semantic change modeling across 41 languages.
This paper introduces the concept of human-LLM archetypes, defined as recurring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. Through a scoping literature review and thematic analysis of 113 papers, the authors identified 17 distinct human-LLM archetypes. They then evaluated these archetypes across clinical diagnostic cases, demonstrating that the choice of archetype influences LLM outputs and decision outcomes.
Defines and categorizes 17 human-LLM interaction archetypes to demonstrate how these archetypes impact LLM outputs and decisions in human-AI collaborative decision-making.
This paper introduces a subword embedding approach to detect lexical and orthographic variation in user-generated text, specifically addressing the challenges of "noisy" and low-resource settings without relying on normalization or predefined variant lists. The method trains subword embeddings on raw Luxembourgish user comments and clusters related forms using a combination of cosine similarity and n-gram similarity. The results demonstrate the effectiveness of distributional modeling in uncovering meaningful patterns of variation, aligning with existing dialectal and sociolinguistic research.
Introduces a novel subword embedding method that automatically discovers and clusters lexical variations in user-generated text, even in low-resource languages, without requiring prior normalization or predefined variant lists.
This paper investigates the sensicality of sentences in existing semantically deviant datasets by comparing human and LLM judgments, both with and without provided contexts. The study reveals that humans generally perceive sentences as anomalous rather than nonsensical, suggesting existing datasets may not be as nonsensical as assumed. Furthermore, the research demonstrates LLMs' ability to generate plausible contexts that render anomalous sentences more sensible.
Empirically demonstrates that existing "nonsensical" datasets are largely composed of anomalous sentences interpretable with context, and that LLMs can generate such contexts.
This paper addresses temporal domain generalization (TDG) for LLMs by reformulating it geometrically under parameter-efficient fine-tuning. It posits that the low-dimensional temporal structure of model evolution can be preserved under parameter-efficient reparameterization. The authors introduce Manifold-aware Temporal LoRA (MaT-LoRA), which constrains temporal updates to a shared low-dimensional manifold within a low-rank adaptation subspace, modeling its evolution through a structured temporal core, and achieving superior temporal generalization performance with practical scalability.
Introduces MaT-LoRA, a parameter-efficient fine-tuning method that constrains temporal updates to a low-dimensional manifold within a LoRA subspace and models its evolution with a structured temporal core for improved temporal domain generalization in LLMs.
The paper introduces SiamXBERT, a Siamese meta-learning framework leveraging a transformer-based language model, to address the challenge of detecting unknown (zero-day) attacks in IoT networks under data scarcity and encrypted traffic conditions. SiamXBERT constructs a dual-modality feature representation from flow and packet-level information and uses meta-learning for rapid adaptation to new attack types with limited labeled data. Experiments on IoT intrusion datasets demonstrate that SiamXBERT outperforms state-of-the-art baselines, achieving up to 78.8% improvement in unknown F1-score, showcasing its robustness and data efficiency.
Introduces SiamXBERT, a novel Siamese meta-learning framework empowered by a transformer-based language model, for robust and data-efficient unknown attack detection in IoT networks.
The paper introduces PosterOmni, a framework for generalized artistic poster creation that tackles both local image editing and global design creation aspects of the task. It achieves this by constructing a multi-task dataset, distilling knowledge from local and global expert models, and applying a unified reward feedback mechanism to align visual fidelity and aesthetic preferences. Experiments on the new PosterOmni-Bench demonstrate that PosterOmni outperforms existing open-source and proprietary systems in reference adherence, composition, and aesthetics.
Introduces a novel data-distillation-reward pipeline to unify local image editing and global design creation for generalized artistic poster generation.
The paper introduces ULTRA, a transformer-based recommendation architecture for Urdu, a low-resource language, to improve personalized news retrieval. ULTRA employs a dual-embedding architecture with a query-length aware routing mechanism to handle varying query lengths, directing queries to either title/headline-level or full-content pipelines. Experiments on a large Urdu news corpus demonstrate that ULTRA achieves over 90% precision compared to single-pipeline baselines, showing improved recommendation relevance.
Introduces a query-adaptive dual-embedding architecture for semantic content recommendation in low-resource languages, dynamically routing queries based on length to optimize retrieval relevance.
This paper addresses the limitations of current copyright law in the age of generative AI, where style imitation without content copying complicates infringement detection. The authors propose a new criterion for infringement based on whether an AI output could have been generated without a specific work in its training corpus. Through a model of generative systems as closure operators, they demonstrate a dichotomy: AI generation is either asymptotically unconstrained with light-tailed organic creations or persistently constrained with heavy-tailed creations.
Introduces a novel criterion for copyright infringement in the context of generative AI, focusing on whether an output could have been generated without a specific work in the training corpus.
The paper introduces HABIT, a data-driven framework for imputing missing segments in vessel trajectories using historical Automatic Identification System (AIS) data. HABIT leverages H3 geospatial indexing to aggregate and analyze vessel motion patterns, enabling the imputation of missing trajectory segments based on learned historical behaviors. Empirical evaluation demonstrates that HABIT achieves comparable accuracy to existing methods while offering improved latency and better accounting for vessel characteristics.
Introduces HABIT, a novel H3 Aggregation-Based Imputation framework, to impute missing vessel trajectories by learning and leveraging historical vessel motion patterns.
The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.
Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.
The paper introduces Temperature Adaptive Meta Policy Optimization (TAMPO), a novel framework that learns to control the temperature hyperparameter of an LLM during reinforcement learning. TAMPO uses a hierarchical two-loop process where an inner loop updates the LLM policy using trajectories sampled at temperatures selected by a meta-policy, and an outer loop updates the meta-policy to favor temperatures that maximize the likelihood of high-advantage trajectories. Experiments on mathematical reasoning benchmarks demonstrate that TAMPO outperforms baselines with fixed or heuristic temperature schedules, showing the effectiveness of learned temperature control for adaptive exploration.
Introduces a hierarchical reinforcement learning framework, TAMPO, that learns a meta-policy to dynamically adjust the temperature parameter of an LLM, optimizing exploration during policy learning.
This paper introduces Hierarchical Sparse Autoencoders (HSAEs) to explicitly model the hierarchical relationships between features extracted from LLMs, addressing the limitation of standard SAEs that treat features in isolation. HSAEs incorporate a structural constraint loss and random feature perturbation to encourage alignment between parent and child features in the learned hierarchy. Experiments across various LLMs and layers demonstrate that HSAEs recover semantically meaningful hierarchies while preserving reconstruction fidelity and interpretability.
Introduces Hierarchical Sparse Autoencoders (HSAEs) to learn and represent the hierarchical relationships between features extracted from LLMs.
This paper introduces Talk2DM, a plug-and-play module designed to enhance vehicle-road-cloud dynamic map (VRC-DM) systems with natural language querying and commonsense reasoning capabilities. To facilitate this, the authors created VRCsim, a VRC cooperative perception simulation framework, and VRC-QA, a question-answering dataset focused on spatial reasoning in mixed-traffic scenarios. Talk2DM leverages a novel chain-of-prompt (CoP) mechanism to integrate human-defined rules with LLM knowledge, achieving high accuracy and reasonable response times with models like Qwen3:8B, Gemma3:27B, and GPT-oss.
Introduces a chain-of-prompting method (CoP) that enables LLMs to effectively query and reason about dynamic maps by combining human-defined rules with the LLM's inherent commonsense knowledge.
The paper introduces MEME, a novel framework that models financial markets as an evolving ecosystem of investment narratives ("Modes of Thought") to improve portfolio construction. MEME uses a multi-agent extraction module to convert noisy data into Investment Arguments, then employs Gaussian Mixture Modeling to identify consensus within a semantic space and a temporal evaluation mechanism to track the lifecycle of these modes. Experiments on Chinese stock pools from 2023-2025 show MEME outperforms seven state-of-the-art baselines, demonstrating its ability to adapt to evolving market consensus.
Introduces a logic-oriented framework, MEME, that models financial markets as a dynamic ecosystem of evolving investment narratives to guide portfolio construction.
This paper introduces Trajectory Self-Distillation (T3D), a novel framework for improving the generation quality of few-step Diffusion Language Models (DLLMs) by distilling the model's own generative trajectories. T3D incorporates Direct Discriminative Optimization (DDO), a reverse-KL objective, to encourage mode-seeking behavior during distillation, focusing the student model on high-probability regions of the teacher model's output space. Experiments across various benchmarks demonstrate that T3D significantly outperforms existing few-step DLLM baselines, substantially reducing the performance gap with full-step decoding.
Introduces a trajectory self-distillation framework, T3D, that leverages direct discriminative optimization to improve the generation quality of few-step diffusion language models.
This paper introduces Distribution Discriminant Theory (DDT) to quantify the alignment between training data and the model-induced distribution in supervised fine-tuning (SFT) of LLMs. Based on DDT, they propose In-Distribution Finetuning (IDFT), a loss-level method, and Hinted Decoding, a data-level technique, to improve generalization by aligning the training data distribution with the model's. Experiments show that the proposed framework achieves generalization performance comparable to offline RL methods like DPO and SimPO, while retaining the efficiency of SFT.
Introduces Distribution Discriminant Theory (DDT) to quantify and improve the alignment between training data and model-induced distributions in LLM supervised fine-tuning.
This paper investigates in-context learning in LLMs by framing it as Gaussian Process (GP) regression, using controlled experiments with function samples drawn from known GP priors. They compare LLM prediction error against empirical GP-regression (lower bound) and 1-NN (upper bound) baselines, finding that LLM learning curves approach the GP lower bound with increasing demonstrations. The authors also analyze LLM inductive biases via likelihood analysis, revealing a preference for less smooth GP kernels, and demonstrate that post-training can shift these biases to improve sample efficiency on smoother kernels.
Quantifies the extent to which LLMs behave like GP learners and provides methods for steering their inductive biases for continuous function learning tasks.
The paper introduces LRBTC, a modular LLM and VLM-driven architecture for quality control in pharmaceutical content, addressing the need for scalable and verifiable validation in regulated domains. LRBTC employs a Student-Teacher dual model architecture combined with a human-in-the-loop workflow and waterfall rule filtering. The approach achieves significant improvements on AIReg-Bench (83.0% F1, 97.5% recall) and CSpelling (26.7% accuracy improvement), demonstrating its effectiveness in reducing missed violations and improving content quality.
Introduces LRBTC, a novel LLM and VLM-driven quality control architecture that leverages a Student-Teacher dual model and HITL workflow for pharmaceutical content optimization.
This paper introduces a deep learning approach to enhance social robot gaze behavior by incorporating both human and non-human stimuli, using LSTM and Transformer models trained on human gaze data collected via VR in simulated and real-world scenarios. The models predict human gaze direction with accuracies up to 72% and 71.6% for LSTM and Transformer respectively in real-world settings, outperforming existing methods by uniquely considering non-human stimuli. The system was deployed on a NAO robot and evaluated with 275 participants, demonstrating high user satisfaction.
Demonstrates a novel approach to predicting human gaze in social settings by integrating non-human stimuli and achieving state-of-the-art accuracy using LSTM and Transformer models.
The paper introduces CitiLink-Minutes, a novel multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities, designed to address the lack of annotated datasets for NLP and IR research in this domain. The dataset features over one million tokens with de-identified personal information and includes manual annotations across metadata, subjects of discussion, and voting outcomes. Experiments demonstrate the dataset's utility for downstream tasks like metadata extraction, topic classification, and vote labeling, facilitating transparent access to municipal decisions.
Contributes CitiLink-Minutes, a unique multilayer annotated dataset of municipal meeting minutes, enabling NLP and IR research on local governance.
This paper investigates the ability of Large Language Models (LLMs) to adapt to language variations across different socioeconomic status (SES) communities by comparing LLM-generated text completions with original text from a novel Reddit and YouTube dataset stratified by SES. The study analyzes 94 sociolinguistic features to assess the degree of stylistic adaptation exhibited by four LLMs. Results indicate that LLMs show limited stylistic modulation with respect to SES, often producing approximations or caricatures, and demonstrate a bias towards emulating upper SES styles, highlighting the risk of amplifying linguistic hierarchies.
Reveals that LLMs exhibit limited stylistic adaptation across socioeconomic strata and tend to favor upper SES linguistic styles, raising concerns about perpetuating linguistic biases.
This paper investigates the impact of underspecified questions on QA performance, finding that a significant portion of questions in standard QA benchmarks are underspecified. They introduce an LLM-based classifier to identify these questions and demonstrate that LLMs perform worse on them. Through a controlled rewriting experiment, they show that rewriting underspecified questions into fully specified variants, while keeping the gold answers fixed, consistently improves QA performance.
Demonstrates that question underspecification is a significant confound in QA evaluation by showing that rewriting underspecified questions improves QA performance.
The paper identifies a limitation in watermark ensembles for LLMs where strong single-layer watermarks reduce token distribution entropy, hindering subsequent layers' effectiveness. They theoretically and empirically demonstrate that detectability is bounded by entropy and that watermark ensembles monotonically decrease entropy and the expected green-list ratio across layers. To address this, they propose a framework using weaker single-layer watermarks to preserve entropy, achieving improved detectability and robustness compared to strong watermark baselines.
Demonstrates that weaker single-layer watermarks in ensembles can outperform stronger ones by preserving token distribution entropy, leading to improved detectability and robustness.
The paper introduces "analytical search" as a new search paradigm tailored for complex analytical information needs, addressing the limitations of relevance-based ranking and retrieval-augmented generation (RAG) in tasks requiring trend analysis, causal inference, and verifiable conclusions. It proposes a system framework that integrates query understanding, recall-oriented retrieval, reasoning-aware fusion, and adaptive verification to support structured, multi-step inference. The authors argue that analytical search offers improved control over reasoning, evidence usage, and verifiability, leading to more accountable and utility-driven results compared to existing search paradigms.
Introduces and formalizes the concept of "analytical search" as a distinct search paradigm designed to address complex analytical information needs by emphasizing evidence-governed, process-oriented workflows.
The authors introduce ADRD-Bench, a new benchmark dataset for evaluating LLMs on Alzheimer's Disease and Related Dementias (ADRD), comprising a unified QA set from existing medical benchmarks and a novel QA set derived from the Aging Brain Care (ABC) program. They aim to address the lack of ADRD-specific evaluation resources and practical caregiving context in existing benchmarks. Evaluating 33 state-of-the-art LLMs, they found that while some models achieve high accuracy, inconsistencies in reasoning quality and stability remain a significant limitation.
Introduces ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs, incorporating both unified clinical knowledge and practical caregiving questions.
This paper introduces a spectrum framework for polycentric digital ecosystems, conceptualizing them as nested socio-technical systems across personal, organizational, inter-organizational, and global layers. It addresses the increasing need for resilient digital collaboration amidst geopolitical and technological fragmentation. The framework highlights how AI and automation, blockchain trust, federated data spaces, and immersive technologies can orchestrate digital integration in these ecosystems.
Introduces a multi-layered framework for polycentric digital ecosystems to facilitate collaboration in fragmented environments.
The paper introduces U-Former ODE (UFO), a novel architecture for probabilistic forecasting of irregular time series data that combines U-Nets, Transformers, and Neural CDEs. UFO enables parallelizable computation and global receptive fields, addressing the scalability limitations of existing Neural CDE approaches. Experiments on five benchmarks demonstrate that UFO outperforms ten state-of-the-art baselines in predictive accuracy and achieves up to 15x faster inference, particularly on long and multivariate sequences.
Introduces a fully causal, parallelizable architecture, U-Former ODE (UFO), that integrates U-Nets, Transformers, and Neural CDEs for efficient and accurate probabilistic forecasting of irregular time series.
This paper introduces a technical curriculum designed to enhance AI literacy within the language and translation (L&T) industry, covering vector embeddings, neural networks, tokenization, and transformer networks. The curriculum aims to cultivate computational thinking, algorithmic awareness, and agency among L&T professionals to improve their digital resilience. Evaluation in an MA course at TH Koeln suggests the curriculum's effectiveness, while also highlighting the need for additional lecturer support to maximize learning outcomes.
Proposes and evaluates a technical curriculum focused on language-oriented AI to improve AI literacy and digital resilience in the language and translation industry.
The paper introduces the Prototype Transformer (ProtoT), an autoregressive language model architecture that uses prototypes (parameter vectors) instead of self-attention to improve interpretability. ProtoT establishes two-way communication between the input sequence and the prototypes, causing the prototypes to capture nameable concepts during training and creating interpretable communication channels. Experiments demonstrate that ProtoT scales linearly with sequence length, performs well on text generation and downstream tasks (GLUE), and exhibits robustness to input perturbations while providing interpretable pathways for understanding robustness and sensitivity.
Introduces the Prototype Transformer, a novel autoregressive language model architecture designed for interpretability by using prototypes to capture nameable concepts and create interpretable communication channels.
This paper introduces Differentiable Modal Logic (DML) implemented via Modal Logical Neural Networks (MLNNs) to enable multi-agent systems to learn relationships like trust networks and causal chains from behavioral data. DML addresses the limitations of traditional modal logic, which requires manual specification of relationship structures. The authors demonstrate a neurosymbolic debugging framework across epistemic, temporal, deontic, and doxastic modalities, showing how logical contradictions can be formulated as learnable optimization objectives in scenarios ranging from diplomacy games to LLM hallucination detection.
Introduces Differentiable Modal Logic (DML) and Modal Logical Neural Networks (MLNNs) to learn interpretable relationship structures in multi-agent systems directly from data, replacing manual specification.
The paper analyzes the availability of AI resources across 6003 languages to assess systemic inequalities in language AI, finding that a small number of languages dominate, exacerbating disparities. It contrasts the diffusion of AI with earlier IT technologies, revealing a hype-driven pattern. Finally, the authors introduce the Language AI Readiness Index (EQUATE) to map technological, socio-economic, and infrastructural prerequisites for AI deployment across languages, aiming to guide prioritization efforts for more equitable diffusion.
Introduces the Language AI Readiness Index (EQUATE) to map the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages.
The paper introduces a rule-based computational model for Gaidhlig morphology, addressing the challenge of limited data availability for low-resource languages that hinders the application of neural models. The model leverages data from Wiktionary and uses SQL queries to identify lexical patterns, constructing a declarative rule-base for generating inflected word forms via Python utilities. This approach demonstrates that rule-based systems can effectively utilize limited data while providing interpretability and supporting the development of educational tools.
Presents a functional rule-based system for Gaidhlig morphology using Wiktionary data and SQL queries to generate inflected word forms.
This paper investigates the impact of model and data scaling on multilingual machine translation (MT) performance using open large language models (LLMs). The authors adapt Gemma3 models via continual pretraining and instruction finetuning, creating MiLMMT-46, a model covering 46 languages. Results demonstrate that MiLMMT-46 surpasses existing open-source SOTA models and rivals proprietary systems like Google Translate and Gemini 3 Pro in multilingual translation quality.
Demonstrates that scaling model size and training data via continual pretraining and instruction finetuning significantly improves the multilingual translation capabilities of open LLMs, achieving performance competitive with proprietary systems.
The paper introduces PatientHub, a unified framework to standardize the creation, composition, and deployment of simulated patients for training counselors and scaling therapeutic assessment using Large Language Models. PatientHub addresses the fragmentation in existing patient simulation approaches by providing standardized data formats, prompts, and evaluation metrics, thus improving reproducibility and enabling fair comparisons. The authors demonstrate PatientHub's utility through case studies, showcasing standardized cross-method evaluation, seamless integration of custom evaluation metrics, and the prototyping of new simulator variants.
Introduces PatientHub, a modular framework that unifies patient simulation by standardizing data formats, prompts, and evaluation metrics to facilitate reproducibility and fair comparison of different methods.
The paper introduces DEL, a framework for differentially private and communication-efficient split inference of large language models (LLMs). DEL uses an embedding projection module and differentially private stochastic quantization to reduce communication overhead while preserving privacy. It then employs soft prompts on the server side to mitigate utility degradation caused by the privacy mechanisms, eliminating the need for local models.
Introduces a novel framework, DEL, that leverages soft prompts to improve the privacy-utility trade-off in LLM split inference, achieving differential privacy and communication efficiency.
The paper introduces EmoSpace, a framework for emotion-aware content generation that learns dynamic emotion prototypes via vision-language alignment to enable fine-grained emotional control in VR content creation. EmoSpace uses a hierarchical emotion representation with learnable prototypes that evolve during training, allowing for control without explicit emotion labels. Experiments demonstrate EmoSpace's superior performance in emotional image outpainting, stylized generation, and emotional panorama generation, further validated by a user study comparing emotional perception in VR versus desktop environments.
Introduces a novel emotion-aware content generation framework, EmoSpace, that learns dynamic, interpretable emotion prototypes through vision-language alignment.
The paper introduces Meta-Sel, a supervised meta-learning approach for efficient demonstration selection in in-context learning, which addresses the challenge of selecting optimal few-shot examples under a limited prompt budget. Meta-Sel learns a scoring function based on TF-IDF cosine similarity and length-compatibility ratio between candidate demonstrations and queries, trained on a meta-dataset constructed from training data using class agreement as supervision. Empirical evaluation across four intent datasets and five LLMs demonstrates that Meta-Sel achieves competitive accuracy and selection-time overhead compared to 12 other demonstration selection methods, especially benefiting smaller models.
Introduces Meta-Sel, a lightweight supervised meta-learning approach that learns a fast, interpretable scoring function for selecting demonstrations for in-context learning.
This paper investigates the overlap between code review comments generated by human reviewers and those produced by ChatGPT-4, focusing on the types of quality improvements recommended. The authors manually classified 739 human-generated comments from 240 pull requests and compared them to ChatGPT-4's recommendations on the same PRs. Results indicate that while ChatGPT-4 suggests more changes overall, it only identifies 10% of the issues flagged by humans, though 40% of ChatGPT-4's additional suggestions are valuable, highlighting the complementary nature of both approaches.
Quantifies the overlap and differences in quality improvement recommendations between human code reviewers and ChatGPT-4, revealing the strengths and weaknesses of each approach.
This paper addresses the challenge of achieving fairness in classification without relying on demographic information by proposing a novel minimax-fair method called SPECTRE. SPECTRE adjusts the spectrum of a Fourier feature mapping and constrains the deviation of the worst-case distribution from the empirical distribution, mitigating the over-pessimism of existing robust optimization techniques. Empirical results on American Community Survey datasets across 20 states demonstrate that SPECTRE achieves superior fairness guarantees and robustness compared to state-of-the-art methods, even those with access to demographic data.
Introduces SPECTRE, a minimax-fair classification method that enhances fairness without demographic information by adjusting the spectrum of a Fourier feature mapping and constraining the worst-case distribution's deviation from the empirical distribution.
This paper investigates the application of large language models (LLMs) to automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions, a task previously unexplored in the cybersecurity domain. The authors created a baseline for cybersecurity ATS and a test dataset of 40 CVE descriptions, which were evaluated by cybersecurity experts. Results indicate that while LLMs can superficially simplify text, they often fail to preserve the original meaning.
Establishes a baseline and dataset for automatic text simplification of cybersecurity vulnerability descriptions using large language models.
This paper introduces LLM-DRS, a novel Large Language Model (LLM)-based framework for disaster reconnaissance summarization in structural health monitoring. The framework integrates vision data and metadata from on-site investigations, using deep convolutional neural networks to extract key attributes like damage state and material type. The extracted data, along with carefully designed prompts, are then fed into an LLM to generate summary reports for individual structures or affected regions.
Introduces a novel LLM-based framework, LLM-DRS, that automates the generation of structural reconnaissance reports by integrating vision data, metadata, and deep learning-extracted attributes.
The authors introduce ExtractBench, a new benchmark and evaluation framework for end-to-end PDF-to-JSON structured extraction, designed to address the lack of comprehensive benchmarks and principled evaluation methodologies for complex, nested extraction tasks. ExtractBench comprises 35 PDF documents paired with JSON Schemas and human-annotated gold labels across diverse domains, resulting in 12,867 evaluatable fields with varying schema complexities. Evaluations using ExtractBench reveal that state-of-the-art LLMs struggle with realistic schemas, particularly as schema breadth increases, with some models achieving 0% valid output on a 369-field schema.
Introduces ExtractBench, a novel benchmark and evaluation framework, to address the limitations of existing methods in evaluating complex structured extraction from PDFs using LLMs.

