12
518
43
Top Researchers
Recent Papers
This paper investigates GPT-5's ability to learn Idris, a functional programming language, through iterative prompting strategies. The authors found that zero-shot performance on Idris programming exercises was significantly lower than performance on Python and Erlang. By incorporating local compilation errors into the prompts, the authors achieved a substantial performance increase, solving 54 out of 56 problems.
Demonstrates that compiler-guided, error-driven iterative prompting significantly improves GPT-5's performance in a low-resource programming language.
The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.
Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.
The paper re-examines single-minus tree-level n-gluon scattering amplitudes, demonstrating that they do not vanish for specific "half-collinear" configurations in Klein space or with complexified momenta, contrary to common assumptions. The authors derive a closed-form, piecewise-constant expression for the decay of a single minus-helicity gluon into n-1 plus-helicity gluons as a function of their momenta. This derived formula is shown to satisfy Weinberg's soft theorem, confirming its consistency.
Discovers and formulates a non-zero solution for single-minus gluon tree amplitudes under specific kinematic conditions.
The paper introduces GPT-5, a unified system comprising a fast, general-purpose model and a deeper reasoning model, managed by a real-time router trained on user feedback and performance metrics. GPT-5 demonstrates improved performance on benchmarks, faster response times, and enhanced utility for real-world queries, with significant reductions in hallucinations, improved instruction following, and minimized sycophancy. The system incorporates "safe-completions" for safety and is treated as High capability in the Biological and Chemical domain under OpenAI's Preparedness Framework, triggering associated safeguards.
Introduces a unified GPT-5 system with a real-time router that dynamically selects between a fast, general-purpose model and a deeper reasoning model based on query characteristics, optimizing for speed and accuracy.
This study compared the performance of three large language models (LLMs) – ChatGPT-4, Claude 2, and Google Med-PaLM – against 17 physicians across different specialties on a 25-question multiple-choice exam focused on pulmonary thromboembolism (PTE). The goal was to assess the AI systems' clinical reasoning capabilities in a complex medical domain. Claude 2 matched the performance of internal medicine and pulmonary specialists (80% accuracy) and significantly outperformed emergency medicine physicians, while ChatGPT-4 and Med-PaLM demonstrated non-inferiority to the specialists.
Demonstrates that advanced LLMs can achieve specialist-level performance on structured medical knowledge assessments related to PTE, suggesting their potential for medical education and clinical decision support.
This paper evaluates the performance of GPT-4o, Claude 4, and MedGEMMA on the task of automated Kellgren-Lawrence (KL) grading of knee osteoarthritis from radiographic images. The models were assessed using exact match accuracy, ±1 tolerance accuracy, macro-averaged precision, and recall against a dataset of 100 expert-annotated knee radiographs. GPT-4o achieved the highest performance with 26% exact match accuracy and 63% ±1 tolerance accuracy, but all models exhibited limitations, particularly in accurately classifying moderate to severe OA.
Benchmarks GPT-4o, Claude 4, and MedGEMMA on the fine-grained ordinal classification task of Kellgren-Lawrence grading of knee osteoarthritis from radiographic images, revealing limitations in their current diagnostic utility.
The study compares the quality and accuracy of portfolio feedback generated by GPT-4o and Claude-sonnet-4 (via Amazon Bedrock) in the context of Qpercom's digital assessment tools for high-stakes clinical assessments. It analyzes both preview feedback (for examiners) and direct student feedback, evaluating how well each model identifies different levels of student performance. The findings assess the safety, constructiveness, and educational value of the AI-generated feedback.
Empirically compares the performance of GPT-4o and Claude-sonnet-4 in generating portfolio feedback for high-stakes clinical assessments, evaluating their ability to accurately reflect student performance levels.
This paper introduces gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models built using a mixture-of-experts transformer architecture and trained via large-scale distillation and reinforcement learning. These models are optimized for agentic capabilities, including research browsing and tool use, and utilize a chat format for instruction following. The authors demonstrate strong performance on mathematics, coding, and safety benchmarks and release the model weights and related resources under an Apache 2.0 license.
Introduces and releases the weights for gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models with strong agentic capabilities and performance across diverse benchmarks.
The paper introduces EVAL, a framework for evaluating and improving the safety of large language models (LLMs) in the context of upper gastrointestinal bleeding (UGIB) diagnosis and management. EVAL combines similarity-based ranking using Fine-Tuned ColBERT with a reward model trained on human-graded responses to enable rejection sampling, thereby improving accuracy. The framework demonstrates that Fine-Tuned ColBERT achieves high alignment with human performance (ρ = 0.81–0.91), and the reward model significantly enhances accuracy through rejection sampling by 8.36%.
Introduces EVAL, a novel framework that combines similarity-based ranking and reward modeling with rejection sampling to improve the safety and accuracy of LLMs in high-stakes medical decision-making.
The paper investigates biases in the Chatbot Arena leaderboard, a popular platform for ranking AI systems, revealing that undisclosed private testing practices and data access asymmetries distort the evaluation playing field. It demonstrates that selective disclosure of performance results by certain providers, like Meta, Google, and OpenAI, leads to biased Arena scores and overfitting to Arena-specific dynamics. The study quantifies the data access disparities, showing that closed models receive disproportionately more data compared to open-weight models, and estimates the performance gains achievable through access to Arena data.
Demonstrates that private testing practices and data access asymmetries in the Chatbot Arena leaderboard lead to biased scores and overfitting, undermining its reliability as a benchmark for general model quality.
This paper investigates the use of three AI models (ChatGPT 3.5, ChatGPT 4, and Microsoft Copilot) to assist with data extraction in a scoping review, comparing their performance against human extraction. The study found that ChatGPT-4 was the most effective model, offering faster extraction times (20 minutes per source) compared to human extraction (1 hour). However, human extraction provided more specific verbatim information, highlighting the need for human oversight to ensure accuracy and address potential biases in AI-assisted extraction.
Demonstrates the feasibility and efficiency gains of using ChatGPT-4 for data extraction in scoping reviews, while also underscoring the continued importance of human oversight for accuracy and nuanced understanding.
This study compared the readability, understandability, and overall quality of plain language summaries (PLSs) generated by six LLM chatbots (ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity) against 30 human-written PLSs. Using Flesch reading ease scores, Flesch-Kincaid grade levels, and a seven-item predefined criteria rated by three authors, the research found that chatbots produced PLSs with lower grade levels and comparable quality to human-written summaries. The results suggest that LLM chatbots can effectively assist in generating accessible summaries of scientific research, particularly benefiting researchers in developing countries, although accuracy should be verified.
Demonstrates that LLM chatbots can generate plain language summaries with comparable quality and improved readability compared to human-written summaries.

