Open-Source Models & Weights
InfrastructureOpen-weight model releases, reproducibility, model licensing, and community-driven AI development.
Keywords
Top Labs in This Topic
Recent Papers
This paper presents an empirical study of AI coding agent contributions in open-source Android and iOS mobile app development by analyzing 2,901 AI-authored pull requests (PRs) from 193 GitHub repositories. The study reveals that Android projects receive more AI-authored PRs and exhibit higher acceptance rates compared to iOS, with routine tasks showing higher acceptance rates than structural changes. The analysis also indicates an initial improvement followed by a decline in PR resolution time on Android, providing insights into the evolving impact of AI agents on OSS mobile projects.
Empirically characterizes the effects of AI coding agents on open-source Android and iOS mobile app projects by analyzing PR acceptance behaviors across platforms, agents, and task categories.
The paper introduces DHPLT, a large-scale multilingual diachronic corpus comprising web-crawled data from 41 languages across three time periods (2011-2015, 2020-2021, 2024-present). The authors leverage web crawl timestamps as a proxy for document creation time, providing 1 million documents per time period per language. They also provide pre-computed word embeddings and lexical substitutions to facilitate semantic change modeling research, addressing the scarcity of such resources for many languages.
Introduces DHPLT, a novel multilingual diachronic corpus with pre-computed embeddings and lexical substitutions, designed to facilitate research in semantic change modeling across 41 languages.
This paper extends crosscoder model diffing to cross-architecture comparisons, enabling the unsupervised discovery of behavioral differences between LLMs with different architectures. They introduce Dedicated Feature Crosscoders (DFCs), an architectural modification to improve the isolation of unique features in one model compared to another. Applying this technique, they identify features such as CCP alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B.
Introduces Dedicated Feature Crosscoders (DFCs), an architectural modification to enhance crosscoder model diffing for isolating features unique to individual models in cross-architecture comparisons.
The authors extend the Puzzle post-training neural architecture search framework to optimize the gpt-oss-120B model, creating gpt-oss-puzzle-88B, by combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning. This optimized model achieves significant per-token throughput speedups (up to 2.82X on a single H100 GPU) while maintaining or slightly exceeding the parent model's accuracy across various benchmarks. The paper advocates for request-level efficiency metrics to account for varying token counts and demonstrates that gpt-oss-puzzle-88B improves request-level efficiency by up to 1.29X.
Introduces a pipeline combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning within the Puzzle framework to optimize large language models for inference.
This paper investigates the impact of model and data scaling on multilingual machine translation (MT) performance using open large language models (LLMs). The authors adapt Gemma3 models via continual pretraining and instruction finetuning, creating MiLMMT-46, a model covering 46 languages. Results demonstrate that MiLMMT-46 surpasses existing open-source SOTA models and rivals proprietary systems like Google Translate and Gemini 3 Pro in multilingual translation quality.
Demonstrates that scaling model size and training data via continual pretraining and instruction finetuning significantly improves the multilingual translation capabilities of open LLMs, achieving performance competitive with proprietary systems.
The paper introduces PatientHub, a unified framework to standardize the creation, composition, and deployment of simulated patients for training counselors and scaling therapeutic assessment using Large Language Models. PatientHub addresses the fragmentation in existing patient simulation approaches by providing standardized data formats, prompts, and evaluation metrics, thus improving reproducibility and enabling fair comparisons. The authors demonstrate PatientHub's utility through case studies, showcasing standardized cross-method evaluation, seamless integration of custom evaluation metrics, and the prototyping of new simulator variants.
Introduces PatientHub, a modular framework that unifies patient simulation by standardizing data formats, prompts, and evaluation metrics to facilitate reproducibility and fair comparison of different methods.
This paper investigates the effectiveness of using small language models (SLMs) as judges to improve code generation, particularly in scenarios where large language models (LLMs) may underperform. The authors train and evaluate several state-of-the-art SLMs to discriminate between correct and incorrect code implementations, focusing on classification accuracy. Results demonstrate that modern SLMs, even without execution-based information, outperform previous approaches and achieve comparable performance to much larger LLMs when used as code rankers, offering a cost-effective alternative for code generation.
Demonstrates that modern small language models can effectively serve as code correctness judges and rankers, achieving performance competitive with much larger language models at a significantly reduced cost.
The paper introduces VIRENA, a virtual platform designed for controlled experimentation within realistic social media environments, addressing limitations in data access and ethical constraints in studying online dynamics. VIRENA allows researchers to simulate feed-based platforms and messaging apps, enabling interactions between human participants and LLM-powered AI agents with configurable personas. The platform's no-code interface facilitates manipulation of content moderation, scheduling of stimuli, and execution of experiments, making it accessible for studying human-AI interaction, moderation interventions, and group deliberation.
Introduces VIRENA, a novel virtual platform enabling controlled social media experiments with human and AI participants, featuring a no-code interface and realistic platform simulations.
This paper investigates the influence of team dynamics on OSS project selection by surveying 198 OSS practitioners. The study reveals that communication-related team dynamics like responsiveness and clarity are consistently prioritized, but the relative importance varies based on contributor motivations such as gaining reputation or networking. The findings demonstrate that aligning team dynamics with contributor motivations is crucial for understanding project selection behavior and designing better project recommendation systems.
Empirically demonstrates that team dynamics, particularly communication-related aspects, significantly influence OSS project selection, with the relative importance of specific dynamics varying based on contributor motivations.
The paper introduces a RAG-pipeline and two-layer prompting strategy to extract actionable recommendations (ReACTs) for improving OSS sustainability from software engineering literature. They systematically explore open LLMs and prompting techniques to derive candidate ReACTs from ICSE and FSE papers, followed by a filtering and refinement stage to ensure quality and extract supporting evidence. The pipeline generates 1,922 ReACTs, with 1,312 meeting strict quality criteria, providing a structured and scalable approach to translate research findings into practical guidance for OSS projects.
Introduces a novel RAG-pipeline leveraging LLMs to extract and structure evidence-based, actionable recommendations (ReACTs) from software engineering literature for improving OSS project sustainability.
This paper introduces zk-compilation, a novel approach to verifiable software provenance by executing a compiler within a zero-knowledge virtual machine (zkVM). This method generates both the compiled output and a cryptographic proof that the compilation was performed on the claimed source code with the specified compiler. The authors demonstrate the feasibility of zk-compilation using the RISC Zero zkVM and the ChibiCC C compiler, evaluating it on synthetic programs, OpenSSL, and libsodium source files, showing strong security guarantees against various attacks.
Introduces and demonstrates zk-compilation, a novel method for verifiable software provenance using zero-knowledge virtual machines.
The paper introduces GeoFormer, a Swin Transformer-based framework for jointly estimating building height (BH) and footprint (BF) from Sentinel-1/2 imagery and open DEM data. By using a geo-blocked splitting strategy for training and evaluation across 54 diverse cities, the authors address the challenge of cross-city generalization in urban data estimation. GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, demonstrating significant improvements over CNN baselines and strong cross-continent transferability.
Introduces GeoFormer, a novel Swin Transformer-based architecture, for joint building height and footprint estimation from multi-source satellite imagery, achieving state-of-the-art accuracy and generalization across diverse urban environments.
The paper introduces StealthRL, a reinforcement learning framework that generates adversarial paraphrases to evade AI-text detectors. StealthRL trains a paraphrase policy using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen-3B, optimizing for both detector evasion and semantic similarity. Experiments across six attack settings and three detector families demonstrate StealthRL's ability to achieve near-zero detection rates (0.001 TPR@1%FPR) and high attack success rates (99.9%), even transferring to unseen detector families.
Demonstrates a reinforcement learning approach, StealthRL, for generating adversarial paraphrases that effectively evade multiple AI-text detectors, revealing shared vulnerabilities across detector architectures.
The paper introduces Private Mask Pre-Training (PMP), a pre-training framework designed to create foundation models that are broadly usable but resistant to unauthorized fine-tuning. PMP concentrates representation learning into a sparse, privately masked subnetwork, releasing only the final dense weights. This induces a mismatch between the fine-tuning objective and the pre-training geometry for those without the mask, thereby limiting adaptation gains.
Introduces Private Mask Pre-Training (PMP) to build foundation models that are robust against unauthorized fine-tuning by concentrating representation learning in a private, sparse subnetwork.
The paper introduces Soft-Verified Efficient Repository Agents (SERA), a supervised finetuning method for efficiently training coding agents specialized to private codebases. SERA leverages Soft Verified Generation (SVG) to create thousands of synthetic trajectories from a single repository, enabling rapid and cost-effective specialization. The resulting SERA models achieve state-of-the-art performance among fully open-source models, matching the performance of models like Devstral-Small-2 at a fraction of the cost compared to reinforcement learning or previous synthetic data methods.
Introduces Soft Verified Generation (SVG), a novel method for generating synthetic code trajectories that enables efficient supervised finetuning of coding agents specialized to private codebases.
The authors introduce Muse, an open-source system for long-form song generation with fine-grained style conditioning, addressing the lack of reproducibility in academic research due to unavailable training data. They release a dataset of 116k fully licensed synthetic songs with lyrics and style descriptions paired with SunoV5-synthesized audio. Muse, a Qwen-based language model finetuned with discrete audio tokens, achieves competitive performance in phoneme error rate, text-music style similarity, and audio aesthetic quality, demonstrating controllable segment-level generation.
Releases Muse, a fully open-source system for long-form song generation, along with a licensed synthetic dataset and training/evaluation pipelines, to enable reproducible research.
This paper replicates Anthropic's mechanistic interpretability work using sparse autoencoders (SAEs) on Llama 3.1 to extract and steer human-interpretable features, stress-testing the generalizability of these methods. The authors successfully reproduce basic feature extraction and steering, but find significant fragility in feature steering, sensitivity to various parameters, and difficulty in distinguishing thematically similar features. The study concludes that current SAE-based interpretability methods lack the systematic reliability needed for safety-critical applications, suggesting a shift towards prioritizing reliable model output prediction and control.
Demonstrates the fragility and limitations of current SAE-based mechanistic interpretability techniques for Llama 3.1, particularly regarding feature steering and thematic feature differentiation.
This paper introduces the Agentic Learning Ecosystem (ALE), an open-source infrastructure comprising ROLL (a post-training framework), ROCK (a sandbox environment manager), and iFlow CLI (an agent framework), designed to streamline agentic model development. They release ROME, an agent trained within ALE on over a million trajectories, utilizing data composition protocols for complex behavior synthesis and a novel Interaction-Perceptive Agentic Policy Optimization (IPA) algorithm for improved long-horizon training. Empirical evaluations on benchmarks like SWE-bench Verified and Terminal Bench Pro demonstrate ROME's strong performance, validating the effectiveness of the ALE ecosystem.
Introduces the Agentic Learning Ecosystem (ALE) and the ROME agent, demonstrating a complete open-source pipeline for training and evaluating agentic models with improved long-horizon stability through Interaction-Perceptive Agentic Policy Optimization (IPA).
The paper introduces Moxin 7B, a fully open-source LLM developed with complete transparency in training, datasets, and implementation details. To extend Moxin's capabilities, the authors developed three variants: Moxin-VLM (vision-language), Moxin-VLA (vision-language-action), and Moxin-Chinese. Experiments demonstrate that these models achieve strong performance in their respective domains, leveraging open-source frameworks and data.
Introduces Moxin, a fully transparent and open-source LLM, along with its multimodal and multilingual variants, promoting a collaborative research environment.
The paper introduces CUVIRIS, a new dataset of ISO/IEC 29794-6 compliant visible light iris images captured via a custom Android application with real-time quality assessment, and benchmarks two iris recognition systems on this dataset. They also present LightIrisNet, a MobileNetV3-based segmentation model for on-device deployment, and adapt IrisFormer, a transformer-based matcher, to the visible light domain. Experiments demonstrate that the open-source OSIRIS system achieves a TAR of 97.9% at FAR = 0.01 on CUVIRIS, and IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057%, indicating the feasibility of accurate smartphone-based iris recognition under controlled conditions.
Provides an open-source framework including a quality-assured VIS iris image dataset, a lightweight segmentation model, and a VIS-adapted transformer-based matcher to advance smartphone-based iris recognition.
The paper introduces MiniLingua, a 1-billion parameter multilingual LLM trained from scratch on 13 European languages, addressing the limitations of larger, English-centric models. MiniLingua aims to balance language coverage with instruction-following capabilities in a smaller, more efficient model. The instruction-tuned version of MiniLingua outperforms EuroLLM on summarization, classification, and question answering tasks, while remaining competitive on open-ended generation.
Demonstrates that a small, multilingual LLM trained from scratch can outperform larger models with similar training approaches on instruction-following tasks and remain competitive on open-ended generation.
The paper introduces a weighted transparency framework based on the EU AI Act and Stanford Transparency Index to evaluate AI model documentation, addressing the current fragmentation and inconsistency. They developed an automated multi-agent pipeline leveraging LLMs to extract documentation and score completeness across 50 models, revealing significant gaps, especially in safety-critical categories. The evaluation shows frontier labs achieve higher compliance (around 80%) compared to other providers (below 60%), highlighting areas for improvement in AI transparency.
Introduces a novel weighted transparency framework and automated evaluation pipeline to systematically assess and score the completeness of AI model documentation.
The paper introduces MixtureKit, an open-source framework designed to facilitate the construction, training, and analysis of Mixture-of-Experts (MoE) models using pre-trained or fine-tuned models. MixtureKit implements three MoE methods: Traditional MoE, BTX (fine-grained token routing), and BTS (trainable stitch layers for information exchange). Experiments on multilingual code-switched data demonstrate that BTX-based models built with MixtureKit outperform dense baselines, showcasing the framework's utility.
Introduces MixtureKit, a modular open-source framework that simplifies the creation, training, and visualization of Mixture-of-Experts models with multiple routing strategies.
The paper introduces PCMind-2.1-Kaiyuan-2B, a 2B parameter open-source LLM, designed to improve training efficiency under resource constraints. They employ a Quantile Data Benchmarking method for data mixing, Strategic Selective Repetition for high-quality data leverage, and a Multi-Domain Curriculum Training policy for sample ordering. Kaiyuan-2B achieves competitive performance with state-of-the-art open-source models while using optimized data preprocessing and architectural modifications for FP16 stability.
Introduces a novel training methodology for resource-constrained LLMs, combining quantile data benchmarking, strategic selective repetition, and multi-domain curriculum training.
This paper investigates the application of Large Language Models (LLMs) to mutual fund portfolio optimization and risk-adjusted asset allocation, aiming to enhance traditional financial decision-making. The authors employed a Retrieval-Augmented Generation (RAG) pipeline, integrating real-time economic data with standard financial optimization techniques, to guide LLMs in generating investment strategies. The study found that the Zypher 7B model outperformed Microsoft Phi 2 and Mistral 7B, consistently producing strategies that maximized investment returns while delivering superior risk-adjusted results.
Demonstrates the efficacy of using LLMs, particularly Zypher 7B, within a RAG framework to generate superior risk-adjusted mutual fund portfolio allocations compared to other LLMs.
The paper introduces K2-V2, a fully open large language model (LLM) designed with a focus on reasoning adaptation, conversation, and knowledge retrieval. K2-V2 is claimed to outperform Qwen2.5-72B and approach the performance of Qwen3-235B, positioning it as a leading open-weight model in its size class. The model is trained with explicit infusion of domain knowledge, reasoning skills, long-context understanding, and tool use, and the authors release the full training history and data composition to facilitate continuous training.
Presents K2-V2, a high-performing, fully open LLM specifically engineered for enhanced reasoning capabilities through targeted training data and methodology.
The authors developed COPE, a Chain-of-Thought (CoT) Outcome Prediction Engine based on sequential open-source LLaMA-3-8B models, to predict 90-day functional outcomes after acute ischemic stroke (AIS) from unstructured clinical notes. COPE first generates clinical reasoning and then outputs a modified Rankin Scale (mRS) prediction. COPE achieved comparable performance to GPT-4.1 and outperformed ClinicalBERT, Clinical ML, and a single-step LLM, demonstrating its potential as a lightweight, interpretable, and privacy-preserving solution for outcome prediction.
Introduces COPE, a novel two-step Chain-of-Thought framework leveraging open-source LLaMA-3-8B models for predicting stroke outcomes from clinical notes.
This paper fine-tunes the open-weight Mistral 7B LLM on the Araneum Slovacum VII Maximum corpus (5.3B tokens) to create Mistral-SK-7b, a specialized Slovak language model. The motivation is to address the lack of high-quality, open-source LLMs for low-resource languages like Slovak, where commercial models are proprietary. The resulting Mistral-SK-7b exhibits significantly improved grammatical correctness and contextual coherence in Slovak, eliminating issues like code-switching and repetition loops present in the original Mistral 7B.
Demonstrates the effective adaptation of a state-of-the-art LLM for a low-resource language through fine-tuning on a large, relevant corpus, resulting in a publicly available model with improved performance.
The authors introduce Z-Image, a 6B-parameter image generation foundation model based on a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture, designed to be efficient and accessible. They optimize the model lifecycle through data curation and training curriculum, achieving full training in 314K H800 GPU hours and developing Z-Image-Turbo with sub-second inference latency and consumer-grade hardware compatibility via few-step distillation and reward post-training. Z-Image demonstrates comparable or superior performance to larger models in photorealistic image generation and bilingual text rendering, while significantly reducing computational costs.
Introduces an efficient 6B-parameter image generation model, Z-Image, that rivals the performance of much larger proprietary models, demonstrating state-of-the-art results with significantly reduced computational overhead.
This paper analyzes the Hugging Face Model Hub download history from June 2020 to August 2025, encompassing 851,000 models and 2.2B downloads, to understand concentration dynamics in the open model economy. The study reveals a shift away from US industry dominance by Google, Meta, and OpenAI towards unaffiliated developers, community organizations, and Chinese industry players like DeepSeek and Qwen. The analysis also identifies trends in model properties, including increased model size, multimodal generation, quantization, and MoE architectures, alongside decreased data transparency.
Provides a comprehensive longitudinal analysis of the open-weight AI model ecosystem, revealing shifts in economic power and model characteristics.
The paper introduces HunyuanVideo 1.5, an 8.3B parameter open-source video generation model achieving state-of-the-art visual quality and motion coherence. This is accomplished through data curation, a DiT architecture with selective and sliding tile attention (SSTA), glyph-aware text encoding for improved bilingual understanding, progressive pre-training and post-training, and an efficient video super-resolution network. The model supports both text-to-video and image-to-video generation across various durations and resolutions, demonstrating superior performance compared to existing open-source alternatives.
Introduces a highly efficient video generation model that achieves state-of-the-art performance with a relatively small parameter count, making it accessible for use on consumer-grade hardware.
The paper introduces HuggingR$^4$, a novel framework for selecting optimal AI models from large repositories like Hugging Face by framing model selection as an iterative reasoning process. HuggingR$^4$ integrates Reasoning, Retrieval, Refinement, and Reflection to decompose user intent, retrieve candidates, refine selections, and validate results. Experiments on a new benchmark of 14,399 user requests demonstrate that HuggingR$^4$ significantly outperforms existing methods in workability and reasonability while reducing token consumption.
Introduces a progressive reasoning framework, HuggingR$^4$, that iteratively selects AI models from large repositories by synergistically integrating reasoning, retrieval, refinement, and reflection.
The paper introduces a cost-efficient pipeline for training domain-specific small language models (SLMs) by combining guided synthetic data generation from a seed corpus with bottom-up domain data curation. This pipeline leverages Domain-Adaptive Pretraining (DAPT), Domain-Specific Fine-tuning (DSFT), and Direct Preference Optimization (DPO). The authors demonstrate the effectiveness of their approach by training DiagnosticSLM, a 3B-parameter model for fault diagnosis, which achieves up to 25% accuracy improvement over larger open-source models on a newly introduced DiagnosticMCQ benchmark and performs competitively on other diagnostic tasks.
Introduces a guided data generation and training pipeline for creating domain-specific small language models that outperforms larger general-purpose models in specialized tasks.
The AlphaFold Protein Structure Database (AFDB) has been updated to align with the UniProt 2025_03 release, expanding its structural coverage to include isoforms and underlying multiple sequence alignments. A redesigned entry page enhances usability by integrating annotations with an interactive 3D viewer and introducing dedicated domains and summary tabs. This update reinforces AFDB as a key resource for exploring protein sequence-structure relationships.
Enhances the AlphaFold Protein Structure Database by updating its structural coverage, redesigning the user interface for improved accessibility, and integrating annotations with an interactive 3D viewer.
The authors introduce OpenPyRo-A1, a low-cost (approximately $14K) bimanual humanoid robot with 0.2mm repeatability and 5kg payload per arm, designed to address the scarcity of affordable dual-arm platforms for embodied AI research. They also present a Python-first distributed control framework, installable via pip, to facilitate teleoperation, data collection, and policy deployment. Imitation learning experiments, integrating the robot with perception models, motion planning, and a large language model, demonstrate the platform's stability, user-friendliness, and high precision.
Introduces a complete open-source, low-cost bimanual robot platform, OpenPyRo-A1, along with a Python-based control framework to democratize research in dual-arm manipulation and embodied AI.
The paper introduces Instella, a family of fully open 3B parameter language models trained on publicly available data, addressing the lack of transparency in high-performing LLMs. Instella achieves state-of-the-art performance among fully open models of comparable size, despite using fewer pre-training tokens. The authors also release Instella-Long (128K context) and Instella-Math (reasoning-focused) variants, demonstrating the versatility of the base model.
Introduces Instella, a family of fully open 3B parameter language models, achieving state-of-the-art performance among fully open models and demonstrating competitive results with leading open-weight models of comparable size.
The authors introduce Llama-Embed-Nemotron-8B, a new open-weights text embedding model achieving state-of-the-art results on the MMTEB benchmark. The model is trained on a novel data mix of 16.1 million query-document pairs, combining public datasets with synthetically generated data from open-weight LLMs. Key findings include the effectiveness of their data mix, the impact of different contrastive loss implementations, and the benefits of instruction-aware training for various embedding tasks, especially in multilingual scenarios.
Presents a high-performing, fully open-source text embedding model, Llama-Embed-Nemotron-8B, along with comprehensive ablation studies on data mixing, loss functions, and synthetic data generation strategies.
This paper investigates the applicability of open-source LLM frameworks, including both large-scale and lightweight models, for automating penetration testing tasks relevant to commercial security assessments. The study identifies both the potential and limitations of these frameworks in addressing fundamental challenges in penetration testing. The authors propose a practical approach to overcome key limitations and demonstrate the potential of LLM-based frameworks in real-world penetration testing scenarios.
Demonstrates the practical application of open-source LLM frameworks for penetration testing, highlighting their capabilities and limitations, and proposes solutions to address identified challenges.
The paper introduces OpenMENA, an open-source memristor interfacing system designed for energy-efficient edge AI applications, featuring a reproducible hardware interface, a firmware-software stack with high-level APIs, and a Voltage-Incremental Proportional-Integral (VIPI) programming method. OpenMENA enables weight transfer and on-device adaptation by mitigating device non-idealities through chip-in-the-loop fine-tuning. The system's efficacy is demonstrated through digit recognition and a real-world robot obstacle-avoidance task, showcasing its ability to map localization inputs to motor commands.
Introduces OpenMENA, the first fully open-source memristor interfacing system with integrated hardware, firmware, and software components for edge AI applications.
The authors developed LOGICAL, a PII removal system for clinical notes, by fine-tuning a Generalist and Lightweight Named Entity Recognition (GLiNER) model on a dataset of psychiatric hospital EHRs. This approach addresses the limitations of LLMs, such as high computational costs and data privacy risks, especially in low-resource settings. The fine-tuned GLiNER model achieved a micro-average F1-score of 0.980, outperforming other methods like Gemini-Pro-2.5, while operating efficiently on a standard laptop.
Demonstrates that a fine-tuned, specialized transformer model (GLiNER) provides a more accurate, computationally efficient, and secure solution for PII removal from clinical notes compared to larger LLMs and cloud-based services.
This paper analyzes the framing of AI openness in 223 news articles from the U.S., France, and China, revealing inconsistencies and oversimplifications in media portrayals. The study finds that inaccurate terminology, misleading information, and a binary "open vs. closed" framing impede effective communication about AI openness. The authors highlight the media's focus on a limited number of models and call for the AI community to contribute to a more nuanced and accurate public discourse.
Reveals how media coverage of AI openness is often inaccurate, oversimplified, and heterogeneous across news sources, hindering effective communication and potentially misinforming public opinion.
The authors introduce Honey-Data-15M, a high-quality SFT dataset of 15M QA pairs enhanced with dual-level CoT, and HoneyPipe, a data curation pipeline built on the DataStudio framework. They trained Bee-8B on Honey-Data-15M, achieving state-of-the-art performance among fully open MLLMs, rivaling semi-open models like InternVL3.5-8B. This work demonstrates the importance of high-quality data for developing competitive fully open MLLMs.
Introduces a comprehensive suite of resources, including a high-quality SFT dataset (Honey-Data-15M), a data curation pipeline (HoneyPipe), and a competitive 8B MLLM (Bee-8B), to advance fully open MLLMs.
The paper introduces AndesVL, a suite of mobile-side Multimodal Large Language Models (MLLMs) ranging from 0.6B to 4B parameters, built upon Qwen3's LLM and various visual encoders, designed to address the limitations of deploying large cloud-based MLLMs on edge devices. AndesVL achieves competitive performance on diverse benchmarks, including text-rich image understanding and VQA, compared to similar-scale models. The authors also present a 1+N LoRA architecture and Quantization-Aware LoRA Fine-Tuning (QALFT) framework, along with optimizations like a cache eviction algorithm (OKV), speculative decoding, and compression, to enhance deployment efficiency on mobile devices, demonstrating significant speedups and memory reduction.
Introduces AndesVL, a suite of mobile-optimized MLLMs, along with a novel quantization-aware LoRA fine-tuning framework and memory optimization techniques, enabling efficient deployment and inference on edge devices.
This paper investigates the evolution of vocabulary embedding geometry in LLMs during training by correlating input and output embeddings of Pythia 12B and OLMo 7B with semantic, syntactic, and frequency-based metrics using representational similarity analysis. The study reveals that vocabulary embedding geometry rapidly aligns with semantic and syntactic features early in training. Furthermore, high-frequency and function words converge faster than low-frequency words, which retain initial bias.
Demonstrates that linguistic structure emerges rapidly in vocabulary embeddings during LLM training, with distinct convergence rates based on word frequency and function.
Apriel-1.5-15B-Thinker, a 15B parameter multimodal model, achieves competitive performance through a three-stage training methodology involving depth upscaling, staged continual pre-training with synthetic data for enhanced visual reasoning, and high-quality text-only supervised fine-tuning with reasoning traces. The model attains a score of 52 on the Artificial Analysis Intelligence Index, matching DeepSeek-R1-0528, and performs comparably to Gemini-2.5-Flash and Claude Sonnet-3.7 on image benchmarks, demonstrating that targeted training can bridge capability gaps without relying on massive scale or reinforcement learning. This work highlights the effectiveness of data-centric continual pre-training for multimodal reasoning, particularly for organizations with limited computational resources.
Demonstrates that a carefully designed, data-centric continual pre-training approach, including depth upscaling and targeted synthetic data generation, can enable a 15B parameter model to achieve frontier-level multimodal reasoning performance competitive with much larger models.
This paper investigates the collaborative practices in open large language model (LLM) development by conducting semi-structured interviews with developers from 14 open LLM projects. It identifies that collaboration extends beyond the models themselves to include datasets, benchmarks, and compute partnerships, and that developers are driven by diverse motivations, including democratizing AI and promoting open science. The study also reveals five distinct organizational models employed by open LLM projects, varying in centralization and community engagement.
Systematizes the landscape of open LLM development by characterizing collaboration types, developer motivations, and organizational models across a diverse set of open LLM projects.
The paper investigates the data requirements for reasoning in sub-billion parameter language models, challenging the assumption that massive datasets (>10T tokens) are necessary. They demonstrate that by carefully curating and resampling open-source datasets to ~2T tokens, strong reasoning abilities can emerge with significantly less data. The resulting MobileLLM-R1 models achieve state-of-the-art performance among open-source sub-billion parameter models, even surpassing larger models trained on much larger datasets.
Demonstrates that strong reasoning capabilities can emerge in sub-billion parameter language models with significantly less data than previously believed by carefully curating and resampling open-source datasets.
The paper red-teams OpenAI's GPT-OSS-20B model in Hausa, a low-resource language, to evaluate its safety alignment. It demonstrates that minimal prompting can induce the model to generate harmful, culturally insensitive, and factually inaccurate content, particularly when using polite language that exploits reward hacking. The study reveals critical vulnerabilities, including the model's false assumptions about the safety of common toxins and its inability to distinguish between raw and processed foods, highlighting the need for improved safety tuning in low-resource languages.
Demonstrates that OpenAI's GPT-OSS-20B model exhibits significant safety alignment failures and biases when used in Hausa, a low-resource language, due to insufficient safety tuning.
This paper investigates the impact of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on the safety and helpfulness of the OPT-350M language model using the Anthropic Helpful-Harmless RLHF dataset. The study introduces three reward-model-derived metrics—Harmlessness Rate (HmR), Helpfulness Rate (HpR), and Combined Alignment Score (CAS)—to evaluate the models. Results indicate that the combined SFT+DPO model achieves the best performance across all alignment metrics, surpassing individual SFT and DPO models.
Demonstrates that combining SFT and DPO yields superior safety and helpfulness alignment compared to using either technique alone for the OPT-350M model.
The paper introduces Dream-Coder 7B, a discrete diffusion language model for code generation capable of any-order generation, adapting its decoding strategy based on the coding task. They convert a pretrained autoregressive model into a diffusion model using a continuous-time weighted cross-entropy objective and address padding issues with random truncation and a padding penalty during supervised fine-tuning. The model is further refined using reinforcement learning with verifiable rewards on a curated prompt set, achieving 21.4\% pass@1 on LiveCodeBench.
Introduces a novel approach to code generation by adapting a pretrained autoregressive model into a discrete diffusion model capable of any-order generation.

