OpenAI

AI research and deployment company. Creator of GPT, DALL-E, and ChatGPT.

openai.com

Total papers

Total citations

518

Avg citations

Top Researchers

Ilya SutskeverAlec RadfordMark ChenLilian Weng

Recent Papers

Feb 12, 2026

2d ago

Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idris

This paper investigates GPT-5's ability to learn Idris, a functional programming language, through iterative prompting strategies. The authors found that zero-shot performance on Idris programming exercises was significantly lower than performance on Python and Erlang. By incorporating local compilation errors into the prompts, the authors achieved a substantial performance increase, solving 54 out of 56 problems.

Demonstrates that compiler-guided, error-driven iterative prompting significantly improves GPT-5's performance in a low-resource programming language.

Minda Li, Bhaskar Krishnamachari2602.11481

Code Generation & Program SynthesisEval Frameworks & Benchmarks

2d ago

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.

Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.

Martijn Bartelds, Federico Bianchi2602.12249

Eval Frameworks & BenchmarksNatural Language ProcessingSpeech & Audio

2d ago

Single-minus gluon tree amplitudes are nonzero

The paper re-examines single-minus tree-level n-gluon scattering amplitudes, demonstrating that they do not vanish for specific "half-collinear" configurations in Klein space or with complexified momenta, contrary to common assumptions. The authors derive a closed-form, piecewise-constant expression for the decay of a single minus-helicity gluon into n-1 plus-helicity gluons as a function of their momenta. This derived formula is shown to satisfy Weinberg's soft theorem, confirming its consistency.

Discovers and formulates a non-zero solution for single-minus gluon tree amplitudes under specific kinematic conditions.

Alfredo Guevara, A. Lupsasca, David Skinner +62602.12176

Scientific Discovery & Drug Design

Dec 19, 2025

Dec 19, 2025·affiliated labs: MIT CSAIL, Mila, Tsinghua AI

OpenAI GPT-5 System Card

The paper introduces GPT-5, a unified system comprising a fast, general-purpose model and a deeper reasoning model, managed by a real-time router trained on user feedback and performance metrics. GPT-5 demonstrates improved performance on benchmarks, faster response times, and enhanced utility for real-world queries, with significant reductions in hallucinations, improved instruction following, and minimized sycophancy. The system incorporates "safe-completions" for safety and is treated as High capability in the Biological and Chemical domain under OpenAI's Preparedness Framework, triggering associated safeguards.

Introduces a unified GPT-5 system with a real-time router that dynamically selects between a fast, general-purpose model and a deeper reasoning model based on query characteristics, optimizing for speed and accuracy.

Aaditya K. Singh, Adam Fry, Adam Perelman +479622601.03267

Reasoning & Chain-of-ThoughtTool Use & AgentsEval Frameworks & Benchmarks

Dec 1, 2025

Performance of Physicians and AI Systems on Pulmonary Thromboembolism Questions

This study compared the performance of three large language models (LLMs) – ChatGPT-4, Claude 2, and Google Med-PaLM – against 17 physicians across different specialties on a 25-question multiple-choice exam focused on pulmonary thromboembolism (PTE). The goal was to assess the AI systems' clinical reasoning capabilities in a complex medical domain. Claude 2 matched the performance of internal medicine and pulmonary specialists (80% accuracy) and significantly outperformed emergency medicine physicians, while ChatGPT-4 and Med-PaLM demonstrated non-inferiority to the specialists.

Demonstrates that advanced LLMs can achieve specialist-level performance on structured medical knowledge assessments related to PTE, suggesting their potential for medical education and clinical decision support.

Evren Ekingen, Mete Ucdal

Scientific Discovery & Drug DesignEval Frameworks & BenchmarksReasoning & Chain-of-Thought

Nov 26, 2025

PO:09:139 | Comparative evaluation of GPT-4.0, Claude 4, and MedGEMMA in automatic Kellgren-Lawrence grading of knee osteoarthritis

This paper evaluates the performance of GPT-4o, Claude 4, and MedGEMMA on the task of automated Kellgren-Lawrence (KL) grading of knee osteoarthritis from radiographic images. The models were assessed using exact match accuracy, ±1 tolerance accuracy, macro-averaged precision, and recall against a dataset of 100 expert-annotated knee radiographs. GPT-4o achieved the highest performance with 26% exact match accuracy and 63% ±1 tolerance accuracy, but all models exhibited limitations, particularly in accurately classifying moderate to severe OA.

Benchmarks GPT-4o, Claude 4, and MedGEMMA on the fine-grained ordinal classification task of Kellgren-Lawrence grading of knee osteoarthritis from radiographic images, revealing limitations in their current diagnostic utility.

Società Italiana Di Reumatologia

Eval Frameworks & BenchmarksMultimodal ModelsComputer Vision

Oct 14, 2025

Comparing AI-Generated Preview and Portfolio Feedback: Gpt 4.o vs. Claude 4

The study compares the quality and accuracy of portfolio feedback generated by GPT-4o and Claude-sonnet-4 (via Amazon Bedrock) in the context of Qpercom's digital assessment tools for high-stakes clinical assessments. It analyzes both preview feedback (for examiners) and direct student feedback, evaluating how well each model identifies different levels of student performance. The findings assess the safety, constructiveness, and educational value of the AI-generated feedback.

Empirically compares the performance of GPT-4o and Claude-sonnet-4 in generating portfolio feedback for high-stakes clinical assessments, evaluating their ability to accurately reflect student performance levels.

Thomas Kropmans, Oleh Bilokrylyi, Dmytro Predchyshyn +3

Eval Frameworks & BenchmarksNatural Language Processing

Aug 8, 2025

Aug 8, 2025·affiliated lab: MIT CSAIL

gpt-oss-120b&gpt-oss-20b Model Card

This paper introduces gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models built using a mixture-of-experts transformer architecture and trained via large-scale distillation and reinforcement learning. These models are optimized for agentic capabilities, including research browsing and tool use, and utilize a chat format for instruction following. The authors demonstrate strong performance on mathematics, coding, and safety benchmarks and release the model weights and related resources under an Apache 2.0 license.

Introduces and releases the weights for gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models with strong agentic capabilities and performance across diverse benchmarks.

OpenAI Sandhini Agarwal, Lama Ahmad, Jason Ai +1214032508.10925

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & WeightsTool Use & Agents

May 3, 2025

Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology

The paper introduces EVAL, a framework for evaluating and improving the safety of large language models (LLMs) in the context of upper gastrointestinal bleeding (UGIB) diagnosis and management. EVAL combines similarity-based ranking using Fine-Tuned ColBERT with a reward model trained on human-graded responses to enable rejection sampling, thereby improving accuracy. The framework demonstrates that Fine-Tuned ColBERT achieves high alignment with human performance (ρ = 0.81–0.91), and the reward model significantly enhances accuracy through rejection sampling by 8.36%.

Introduces EVAL, a novel framework that combines similarity-based ranking and reward modeling with rejection sampling to improve the safety and accuracy of LLMs in high-stakes medical decision-making.

Mauro Giuffrè, Kisung You, Ziteng Pang +165

Eval Frameworks & BenchmarksScientific Discovery & Drug DesignNatural Language Processing

Apr 29, 2025

Apr 29, 2025·affiliated lab: Cohere

The Leaderboard Illusion

The paper investigates biases in the Chatbot Arena leaderboard, a popular platform for ranking AI systems, revealing that undisclosed private testing practices and data access asymmetries distort the evaluation playing field. It demonstrates that selective disclosure of performance results by certain providers, like Meta, Google, and OpenAI, leads to biased Arena scores and overfitting to Arena-specific dynamics. The study quantifies the data access disparities, showing that closed models receive disproportionately more data compared to open-weight models, and estimates the performance gains achievable through access to Arena data.

Demonstrates that private testing practices and data access asymmetries in the Chatbot Arena leaderboard lead to biased scores and overfitting, undermining its reliability as a benchmark for general model quality.

Shivalika Singh, Yiyang Nan, Alex Wang +10342504.20879

Eval Frameworks & BenchmarksOpen-Source Models & Weights

Mar 25, 2025

Integration of Artificial Intelligence (AI) into the Data Extraction Phase of a Scoping Review

This paper investigates the use of three AI models (ChatGPT 3.5, ChatGPT 4, and Microsoft Copilot) to assist with data extraction in a scoping review, comparing their performance against human extraction. The study found that ChatGPT-4 was the most effective model, offering faster extraction times (20 minutes per source) compared to human extraction (1 hour). However, human extraction provided more specific verbatim information, highlighting the need for human oversight to ensure accuracy and address potential biases in AI-assisted extraction.

Demonstrates the feasibility and efficiency gains of using ChatGPT-4 for data extraction in scoping reviews, while also underscoring the continued importance of human oversight for accuracy and nuanced understanding.

Paige Maylott, Shaminder Dhillon, D. Brooks +1

Natural Language ProcessingEval Frameworks & BenchmarksData Curation & Synthetic Data

Mar 1, 2025

Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries

This study compared the readability, understandability, and overall quality of plain language summaries (PLSs) generated by six LLM chatbots (ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity) against 30 human-written PLSs. Using Flesch reading ease scores, Flesch-Kincaid grade levels, and a seven-item predefined criteria rated by three authors, the research found that chatbots produced PLSs with lower grade levels and comparable quality to human-written summaries. The results suggest that LLM chatbots can effectively assist in generating accessible summaries of scientific research, particularly benefiting researchers in developing countries, although accuracy should be verified.

Demonstrates that LLM chatbots can generate plain language summaries with comparable quality and improved readability compared to human-written summaries.

Himel Mondal, Gaurav Gupta, Pradosh Kumar Sarangi +512

Eval Frameworks & BenchmarksNatural Language Processing

Lattice is designed for desktop

OpenAI

Top Researchers

Recent Papers

Search