Cohere
Enterprise AI company building foundation models for text understanding, generation, and search.
cohere.com3
73
24
Top Researchers
Recent Papers
The paper investigates biases in the Chatbot Arena leaderboard, a popular platform for ranking AI systems, revealing that undisclosed private testing practices and data access asymmetries distort the evaluation playing field. It demonstrates that selective disclosure of performance results by certain providers, like Meta, Google, and OpenAI, leads to biased Arena scores and overfitting to Arena-specific dynamics. The study quantifies the data access disparities, showing that closed models receive disproportionately more data compared to open-weight models, and estimates the performance gains achievable through access to Arena data.
Demonstrates that private testing practices and data access asymmetries in the Chatbot Arena leaderboard lead to biased scores and overfitting, undermining its reliability as a benchmark for general model quality.
ZeroSumEval introduces a competition-based evaluation protocol for LLMs using zero-sum games across diverse tasks like security, classic games, knowledge tests, and persuasion, aiming to create dynamic benchmarks resistant to saturation. The protocol involves pitting LLMs against each other in these games and analyzing their performance in strategic reasoning, planning, knowledge application, and creativity. Experiments across 7 games and 13 models reveal that while frontier models perform well in common games and answering questions, they struggle with tasks requiring creativity and generating novel challenges.
Introduces ZeroSumEval, a novel and extensible framework for evaluating LLMs through zero-sum games, providing a dynamic and standardized approach to assess capabilities like strategic reasoning and creativity.
The paper introduces Command A, a large language model designed for enterprise applications, featuring agent optimization, multilingual support (23 languages), and a hybrid architecture. The model leverages a decentralized training approach with self-refinement and model merging to achieve strong RAG capabilities, grounding, and tool use for automating business processes. Evaluations across enterprise tasks and public benchmarks demonstrate excellent performance and efficiency, with weights released for research.
Introduces Command A, a novel enterprise-focused LLM, and details its training and evaluation.

