Search papers, labs, and topics across Lattice.
Open-weight model releases, reproducibility, model licensing, and community-driven AI development.
#18 of 24
5
Synthetic data augmentation and per-language threshold tuning can significantly boost the performance of LLMs on multilingual tasks, outperforming alternative architectures that showed promise on the development set.
Graph models can now generalize to entirely new datasets with different input features, thanks to a simple projection into a shared random space.
Teachers can now scalably provide high-quality, personalized feedback to students by leveraging a multi-LLM system that synthesizes rubric data and qualitative observations, while retaining control through a teacher-in-the-loop workflow.
A judge-orchestrated ensemble of diverse LLMs trounces single models in multi-turn response generation, proving that strategic model selection beats brute force scaling.
Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.
Dissimilarity, not just similarity, unlocks better language generalization for low-resource varieties.
Unlock Tajik NLP: a new open-source toolkit delivers a comprehensive pipeline for processing Cyrillic-script Tajik text, complete with datasets and pre-trained embeddings.
Forget retraining: NeWTral instantly restores safety to your LLM after adding a risky LoRA, slashing attack success rates from 70% to 13% without sacrificing expertise.
Exponent bits are the Achilles' heel of floating-point arithmetic, as corrupting them in RISC-V vector processors leads to the most severe silent data corruption.
Forget complex assembly: this 3D printing technique lets you pop out functional, self-folding robots with integrated sensors and actuators directly from a flat sheet.
Forget full fine-tuning: LoRA lets you adapt Geospatial Foundation Models for wildfire mapping with comparable accuracy while only tweaking 1% of the parameters.
Forget resource-intensive pipelines: a purely academic team achieves SOTA search agent performance with just 10.6k SFT data points, outperforming models trained with CPT+SFT+RL.
Forget massive models: small, locally-deployable language models can achieve surprisingly strong performance on privacy-sensitive clinical information extraction tasks with self-prompting and preference-based optimization.
Despite impressive multilingual capabilities, today's LLMs still can't reliably translate between English and Ghanaian languages at scale.
LLMs exhibit a surprising "False Illegitimation bias," systematically misclassifying legitimate battles as violence against civilians, highlighting a critical flaw for conflict monitoring applications.
LLM benchmarks are increasingly measuring the capabilities of yesterday's models, not today's frontier, creating a widening gap that misrepresents the state of AI.
Finally, a zero-knowledge data valuation system that scales: ZK-Value proves Shapley values in seconds to minutes, beating specialized ZK baselines by over an order of magnitude.
Current package managers are surprisingly vulnerable: a single misconfiguration can silently allow attackers to inject malicious dependencies, a problem solved by this paper's cryptographically enforced provenance system.
An open-source alternative to expensive, proprietary digital human modeling software could democratize ergonomic analysis and workplace design.
Public antiviral drug discovery datasets are riddled with errors that can be fixed with careful polyprotein splitting, unlocking significant performance gains in binding affinity prediction.
Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.
Sustainable scientific software isn't just about the code; it's about consistent testing and clear links between code quality and tests, a pattern often missing in unsustainable projects.
Open-sourcing a VLA model that beats closed-source giants on embodied reasoning tasks could finally make real-world robot deployment practical.
Autonomous agents can produce plausible-sounding research that's subtly wrong, so ARIS uses adversarial collaboration between different LLMs to catch these errors.
Synthetic data closes the Indic ASR gap where commercial and open-source systems fail, boosting entity recognition by up to 22x.
Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.
Unlock advanced robotic manipulation with FlexiTac, a tactile sensing solution so cheap and easy to integrate, you'll wonder why you were using anything else.
Hyperbolic embeddings are powerful, but a fragmented ecosystem makes them hard to use—this framework finally puts them all in one place.
Forget computationally verifying stability – VibroML automatically *fixes* dynamically unstable crystal structures, opening the door to exploring previously inaccessible materials.
Even with emotion-aware prompting, today's best small language models still struggle to preserve subtle emotional nuances when translating between languages.
Forget turn-based interactions: MiniCPM-o 4.5 lets you build AI that sees, hears, speaks, and *reacts* in real-time, all on a device with only 12GB of RAM.
Newcomers beware: the odds of your "good first issue" pull request getting merged have plummeted nearly 20% in the last year.
Forget Shakespeare, LLMs can now sling verses in Arabic dialects, thanks to a new dataset for instruction-guided poetry generation.
LLMs exhibit surprisingly human-like biases and overconfidence in math, revealed by a new dataset mapping their mathematical reasoning across diverse personas.
Thai voice cloning just leapfrogged human performance on short-duration speech, thanks to a new model that directly handles code-switching and numerals.
LLM-powered query reformulation, a hot topic in IR, often fails to translate gains from lexical to neural retrieval, and bigger models don't always help.
LLM upgrades are a chaotic mix of progress and decay: despite overall gains, up to 47% of questions get *worse* after an update, and single-shot evals miss almost half of these critical regressions.
LLMs trained on raw code text learn surface-level cues that trigger false positives when detecting vulnerabilities in other languages, but simply feeding them ASTs at inference time can dramatically reduce these errors.
You can steal secrets from locally fine-tuned LLMs by backdooring their model code, even bypassing common defenses like differential privacy and code audits.
"Utility" code, intended to be broadly useful and reusable, is actually 2.75x more likely to be involved in a vulnerability than other code.
Defining "hero developers" in open-source projects is more nuanced than previously thought: technical prowess doesn't guarantee social engagement, and vice versa, impacting bug-fixing success in surprising ways.
Reproducibility issues plague over 20% of Defects4J, a widely used benchmark for automated program repair, casting doubt on the validity of many APR evaluations.
Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.
You can slash false positives in PyPI malware detection by 82% while simultaneously reducing feature dimensionality by 50% using a carefully tuned deep learning approach.
AI agents and humans exhibit over 10 distinct repair behaviors when performing urgent hot fixes, suggesting opportunities for targeted human-automation collaboration.
NVIDIA's closed-source driver secrets are out: researchers can now see the exact hardware commands triggered by CUDA code.
Complex, multi-step instructions can cause LLMs to completely ignore question content and instead rely on positional shortcuts when asked to underperform, revealing a critical vulnerability in adversarial evaluation.
Forget giant LLMs: fine-tuned small language models can actually *beat* GPT-4o on critical clinical tasks like emergency triage.
Non-linear scoring with Hypencoders boosts retrieval performance, but don't expect it to fix your speed or adversarial robustness problems.
Sanctions and censorship breed a shadow economy: Iranian third-party iOS app stores are rife with cracked apps, unauthorized monetization, and privacy-invading trackers.