ZJUAug 17, 2025arXiv:2508.12461

Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junmin Huang, Jibin Guan, Junfeng Hao, Jun-Jie Song

AI Summary

The paper evaluates OpenAI's recently released open-weight GPT-OSS models (20B and 120B parameter mixture-of-experts architectures) against six other open-source LLMs on ten benchmarks spanning general knowledge, reasoning, coding, multilingualism, and conversation. Surprisingly, the smaller GPT-OSS-20B model often outperformed the larger GPT-OSS-120B model, suggesting diminishing returns from scaling in sparse architectures. The GPT-OSS models achieved mid-tier performance overall, showing strengths in code generation but weaknesses in multilingual tasks compared to the broader open-source landscape.

Key Contribution

OpenAI's 120B parameter GPT-OSS model gets beat by its smaller 20B sibling on key benchmarks, challenging the assumption that bigger sparse models are always better.

Abstract

In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments. More details and evaluation scripts are available at https://ai-agent-lab.github.io/gpt-oss (Project Webpage).

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Eval Frameworks & Benchmarks Open-Source Models & Weights

Citation Metrics

Citations6

Influential citations1

References107

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Related Papers