Search papers, labs, and topics across Lattice.
100 papers published across 6 labs.
Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.
Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.
AI harms disproportionately impact specific intersections of identity, with adolescent girls, lower-class people of color, and upper-class political elites experiencing up to 3x greater harm, revealing critical blind spots in current AI risk assessments.
Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.
Finally, a practical implementation for globally-optimal repair-based semantics allows for querying inconsistent prioritized data with theoretical guarantees.
Unlock more diverse and effective LLM outputs by explicitly rewarding semantic novelty during decoding with Exploratory Sampling.
LLMs are revolutionizing conversational AI research, and this survey offers a structured guide to navigating the rapidly evolving landscape of LLM-powered user simulation.
LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.
See where your citations are coming from with a single command, thanks to CiteRadar's open-source platform that automatically generates interactive maps and detailed researcher profiles from your Google Scholar ID.
Machine translation alone ruins agent benchmark validity across languages, but careful functional and cultural alignment can close the performance gap by up to 30%.
LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.
Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.
Multi-anchor word embeddings, previously impractical for LLMs, can now outperform standard embeddings with 98% fewer parameters and a 40x smaller embedding layer.
Denoising fMRI data with independent component analysis reveals interpretable, subject-invariant cognitive networks that correlate with large language model representations of stories.
LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.
A BiLSTM with a custom slang dictionary rivals AutoML in classifying the sentiment and emotion of messy, real-world Indonesian e-commerce reviews.
LLMs can evaluate clinical AI as well as human experts, but at 1/1000th the cost, unlocking scalable and continuous monitoring.
Your sign language translation model's performance could be bottlenecked by your choice of pose estimator: switching from MediaPipe to SDPose or Sapiens could boost BLEU score by 1.5 points.
LLMs that nail individual personas can still fail spectacularly at generating diverse populations, instead defaulting to coarse stereotypes.
Forget fixed steering strengths - CLAS dynamically adapts steering based on context, unlocking more consistent and powerful control over LLM behavior.
On-device SLMs in mobile apps demand a radical shift: the less the LLM does, the more reliable it becomes.
LLMs re-rank documents better when you learn to route each query to the specific attention heads that matter, instead of relying on static subsets or everything at once.
LLMs can now generate driving rules from traffic laws with significantly improved accuracy by grounding their reasoning in structured traffic scenarios.
Students spend only 40% of math classwork time on actual math practice, suggesting a massive, untapped opportunity for improved learning outcomes.
LLMs can learn to generate better compromises by iteratively incorporating feedback on how empathically similar a compromise is to each viewpoint, opening the door to more socially intelligent AI.
Forget painstakingly curating datasets – STELLAR-E auto-generates high-quality, domain-specific LLM benchmarks, rivaling real-world data in evaluation quality.
Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.
AI harms disproportionately impact specific intersections of identity, with adolescent girls, lower-class people of color, and upper-class political elites experiencing up to 3x greater harm, revealing critical blind spots in current AI risk assessments.
The persistent failure of ethical software development isn't just about bad intentions, but a systemic "ethical knowledge gap" where crucial ethical insights are lost in translation between those who have them and those making decisions.
Prediction markets don't just predict the future, they shape it, and the most visible market isn't always the most accurate.
Early childhood educators' online discourse reveals a stark imbalance: discussions of workplace demands outweigh resources by nearly 2:1, painting a picture of a profession grappling with systemic strain.
Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.
Split learning offers a surprisingly viable path to fine-tuning LLMs on sensitive data without breaking the bank or sacrificing privacy.
Even with cross-campaign aggregation of telemetry data, distinguishing sophisticated cyber adversaries remains fundamentally limited by shared operational practices, revealing a structural ceiling on attribution accuracy.
LLMs, when orchestrated as collaborative agents, can dramatically improve vulnerability-inducing commit identification, outperforming existing SZZ algorithms by a large margin.
Backdoor attacks in LLMs can be defused at inference time, without retraining or external data, by geometrically smoothing attention patterns to disrupt adversarial routing.
LLM stability under uncertainty isn't just about accuracy – a new information-geometric framework reveals how internal model structure non-linearly attenuates the impact of disorder.
Transformer-based vulnerability detection is booming, but this review reveals critical gaps in data balance, interpretability, and cross-language generalization that could be holding back truly robust systems.
Turns out, a tiny fine-tuned model can spot flaws in coding instructions that trip up even the biggest LLMs, suggesting we're over-relying on brute force for code generation.
LLMs can achieve near-perfect structural fidelity when generating multi-file DSL code at repository scale, but only with fine-tuning.
LLMs can both spark and stifle creativity in collaborative software design, so designers must wield them intentionally.
Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.
Students are already using GenAI extensively in real-world software projects, but without guardrails, learning, collaboration, and software quality may suffer.
OSS developers who saw automatically generated user personas responded to issues with more empathy and tailored explanations, suggesting a simple UI intervention can bridge the user-developer gap.
Developers aren't surgically extracting information from migration guides; they're largely linking to the whole document, suggesting opportunities for improved guide structure and searchability.
Open-source library vulnerabilities are easier to spot when you connect the dots between bug reports, code changes, and commit messages.
Text-guided 3D medical image segmentation just got a whole lot more practical: ESICA achieves state-of-the-art accuracy with a "Lite" variant that slashes parameter count without sacrificing performance.
Interactive feedback slashes error rates in episodic memory retrieval, outperforming even large vision-language models while remaining efficient.
Test-time adaptation of vision-language models can actually *hurt* performance when modalities shift asymmetrically; MG-MTTA fixes this by explicitly modeling modality reliability.
Robots can strengthen family bonds, but only if designers carefully consider the robot's initiative and communication timing, as families experience tensions around privacy and control.
A social robot can successfully integrate into family life to support family-school partnerships, but parental facilitation styles significantly impact its effectiveness.
Segmenting music into meaningful chunks and predicting chords sequence-to-sequence boosts recognition accuracy, especially for those pesky, rare non-triad chords that plague existing systems.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
Classifying temporal relations is easier when you break it down: predicting relationships between endpoints first unlocks state-of-the-art performance on a challenging benchmark.
Semantic grounding, not token probability, is the key to better multimodal RAG.
User-driven privacy ratings of mobile apps reveal significant discrepancies with expert assessments, suggesting a need for more inclusive and user-centric privacy evaluation mechanisms.
Stop relying on LLMs to "hallucinate" reasoning paths – SEARCH-R uses a fine-tuned Llama3.1-8B model and dependency tree-based retrieval to navigate multi-hop question answering more reliably.
Speculative design can effectively catalyze critical reflection and generate actionable insights for fostering designer inclusion within the often developer-centric world of Open Source Software.
Achieve surgical 3D edits without training: Prox-E lets you reshape objects with language by manipulating a compact set of geometric primitives.
By reconstructing extractions and comparing them to the original document, RaV-IDP offers a grounded, label-free quality signal that dramatically improves the fidelity of intelligent document processing pipelines.
LLMs' gender biases aren't fixed; they warp and intensify based on the *personality* you give them, especially when those personalities lean toward the "Dark Triad."
Neurosymbolic grounding of LLMs in telemetry and knowledge graphs slashes expert-rated overclaims in industrial maintenance explanations by 93%, making AI assistants far more trustworthy in safety-critical settings.
LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.
Highlighting pivotal evidence can boost LLM performance without altering the original context, leading to substantial improvements in reasoning tasks.
Finally, a TTS system that lets you control the *exact* timing and pauses of individual words, opening the door to applications like perfectly paced guided reading and accessible code narration.
LLMs, when combined with efficient indexing, can extract actionable incidents from just a handful of noisy user descriptions in real-time, enabling rapid anomaly detection in large-scale cloud services.
Transforming human motion into structured language allows LLMs to achieve unprecedented accuracy in motion understanding without the constraints of traditional encoding methods.
LLMs can be made 20% more accurate by jointly attributing claims to sources and verifying them, rather than just verifying.
Signal processing offers a surprisingly effective lens for understanding and improving LoRA, the reigning champ of parameter-efficient fine-tuning.
Forget polling every user on every idea – this algorithm learns to find common ground by strategically asking for feedback on a few key statements.
Persistent homology, when applied to eye-tracking data via novel filtration techniques, unlocks dyslexia detection performance exceeding traditional statistical methods.
N-gram models can rival neural networks in event log prediction, but the secret sauce is a smart ensemble method that dynamically promotes the best model during inference.
IoT intrusion detection gets a boost: A-THENA's time-aware encoding and network-specific augmentation beats state-of-the-art methods by up to 6.88% in accuracy, all while running on a Raspberry Pi Zero 2 W.
Conformal prediction regions can be drastically shrunk, especially in high-dimensional settings, by using a novel kernel score that adapts to the geometry of the residual distribution.
Achieve LLM personalization with the guarantee that deleting a small user-specific proxy deterministically erases all traces of their data, sidestepping the need for computationally expensive retraining.
LLMs generate better features when you make them think harder: CoFEE enforces cognitive behaviors like backward chaining and subgoal decomposition, boosting feature quality by 15% while slashing costs.
Forget memorizing table headers: TaNOS unlocks surprisingly robust numerical reasoning by pre-training on operation sketches and correctness-guaranteed programs.
Fixing miscalibrated black-box predictions with a simple post-hoc calibration step can significantly boost the accuracy and efficiency of semisupervised mean estimation.
Forget about fine-tuning: this new prompting method lets you selectively erase knowledge from LLMs on demand, even without access to model weights.
Ignoring why clinical data is missing can lead to suboptimal treatment policies; this work shows how explicitly modeling informative missingness in multimodal time series data significantly improves both offline treatment policy learning and outcome prediction.
Even GPT-5 only achieves 63% accuracy on time series anomaly questions from real software incidents, but a model-expert combination reaches 87%, highlighting the potential for hybrid intelligence in incident response.
LLMs are surprisingly susceptible to multi-turn attacks that evade content filters by distributing malicious intent across multiple, seemingly benign turns.
LLMs can extract events more effectively when combined with graph-based document representations that overcome their "lost-in-the-middle" limitations.
Mimicking how clinicians review capsule endoscopy videos—first screening, then weaving context, and finally converging evidence—yields surprisingly effective summarization of these ultra-long videos.
Forget flat numerical compression – GS-Quant unlocks better knowledge graph completion by generating discrete codes that mirror the hierarchical nature of human reasoning.
Ditch the fixed interface: DiffMAS unlocks surprisingly large gains in multi-agent reasoning by jointly optimizing latent communication, outperforming text-based and prior latent methods by a wide margin.
LLM leaderboard rankings are more a reflection of benchmark designer priorities than actual user needs, but a new interactive visualization tool lets you reshape those rankings based on your specific prompt types and goals.
Students' willingness to disclose AI use in academic work hinges on a delicate balance: psychological safety encourages transparency, while evaluation apprehension drives strategic concealment.
LLMs can be backdoored with nearly imperceptible style changes, turning them into sleeper agents that reliably deliver attacker-specified payloads even after deployment and against common defenses.
Modeling annotator-specific explanations substantially boosts NLI prediction accuracy and provides a richer understanding of disagreement compared to simply conditioning on annotator identity.
Get LLM-boosted recommendations without the LLM latency: this distillation method lets you bake rich user profiles into efficient sequential recommenders.
Finally, a practical implementation for globally-optimal repair-based semantics allows for querying inconsistent prioritized data with theoretical guarantees.
AI governance risks becoming performative box-ticking unless practitioners understand how compliance directly improves system quality and user protection.
A surprisingly simple, linear-time algorithm, MinCov, nearly matches the performance of much slower metaheuristics in identifying critical nodes in bipartite dependency networks.
Chatbots can subtly and persistently reshape our moral compass, even when we don't realize it's happening.
Existing translation quality estimation models exhibit systematic gender bias, but FairQE shows you can fix this without hurting overall accuracy.
GPT-4.1-mini wins on accuracy for meeting summarization, but GPT-5.1 crushes it on completeness and coverage, revealing that the best model depends on the specific metric you care about.
Forget party lines: in Brazilian politics, regional and gender identities often dictate discursive alignment more strongly.
LLMs' factual knowledge is surprisingly brittle: simply changing an entity's surface form in a question (e.g., using an abbreviation instead of the full name) can drastically alter the answer.
LLMs may fail in real-world moral decisions because they rigidly adhere to fairness norms, even when their own internal models predict humans would prioritize loyalty.
Pinpointing exactly *when* misinformation occurs in videos is now possible, thanks to two new datasets and a strong baseline for misinformation span detection.
Sentence embeddings can be objectively evaluated for conceptual stability without relying on downstream classifiers, revealing their true capacity to capture meaning.
Surprisingly, how speech degrades due to diseases like Parkinson's and ALS follows consistent patterns across languages, offering a universal fingerprint for these conditions.
Deploying language models in the Global South requires bridging the gap between multilingual NLP and edge computing, two fields that have largely evolved independently despite their shared goals.
Mid-sized LLMs can actually be *more* fair in news summarization than their larger counterparts, challenging the common wisdom of "bigger is better."