Michal Valko

Inria, ENS Paris-Saclay MVA

Papers on Lattice

Total citations

227

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Training Efficiency & Optimization (10)Recommendation & Information Retrieval (6)Natural Language Processing (5)Robotics & Embodied AI (4)

Frequent co-authors

B. Kveton (4)Pierre Ménard (4)Branislav Kveton (3)Tomáš Kocák (3)

Papers (21)

Apr 30, 2026

3w ago·also Adobe Research, Paris-Saclay

Learning from a single labeled face and a stream of unlabeled data

Unlock face recognition with just one labeled example and a flood of unlabeled data, achieving state-of-the-art accuracy in a practical authentication scenario.

Branislav Kveton, B. Kveton, Michal Valko

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

3w ago·also Adobe Research, INRIA, Paris-Saclay, Pitt

Online semi-supervised perception: Real-time learning without explicit feedback

Forget reinforcement learning; this algorithm learns in real-time without any feedback at all.

Branislav Kveton, B. Kveton, Matthai Philipose +311

Computer Vision Robotics & Embodied AI

3w ago·also Paris-Saclay

Bayesian Policy Gradient and Actor-Critic Algorithms

By modeling policy gradients as Gaussian processes, this work dramatically reduces the sample complexity in reinforcement learning, offering faster convergence and uncertainty estimates at little extra cost.

Mohammad Ghavamzadeh, M. Ghavamzadeh, Y. Engel +239

Robotics & Embodied AI Training Efficiency & Optimization

Apr 28, 2026

3w ago·also Adobe Research, Paris-Saclay

Spectral bandits

Learn user preferences across thousands of items from just tens of node evaluations by exploiting graph smoothness in a new spectral bandit framework.

Tomás Kocák, R. Munos, B. Kveton +312

Recommendation & Information Retrieval

SequeL3w ago·also INRIA, Paris-Saclay, UPF

Online learning with Erdős-Rényi side-observation graphs

Learning in multi-armed bandits gets a boost: even with only probabilistic side observations of other arms' losses, near-optimal regret is achievable without knowing the observation probability.

Tomáš Kocák, Michal Valko

Recommendation & Information Retrieval

Apr 21, 2026

Univ. LilleApr 21, 2026·also DeepMind, Centrale Lille, INRIA, Paris-Saclay

On two ways to use determinantal point processes for Monte Carlo integration

DPP-based Monte Carlo integration can offer variance reduction, but choosing the right DPP—fixed vs. tailored to the integrand—determines whether you get a biased but faster converging estimator or an unbiased but standard-rate estimator.

Guillaume Gautier, Rémi Bardenet, Michal Valko

Scientific Discovery & Drug Design Training Efficiency & Optimization

Pierre Perrault +3Apr 21, 2026·also INRIA, Paris-Saclay

Budgeted Online Influence Maximization

Forget picking influencers by headcount; this new framework lets you maximize influence based on your actual ad budget, and it even sharpens the math for the old way of doing things.

Pierre Perrault, Jennifer Healey, Zheng Wen +1

Natural Language Processing Recommendation & Information Retrieval

DeepMindApr 21, 2026·also INRIA, Paris-Saclay, SequeL team

Planning in entropy-regularized Markov decision processes and games

Entropy regularization makes planning provably easy: SmoothCruiser achieves polynomial sample complexity in MDPs where standard methods fail.

Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Ménard +2

Training Efficiency & Optimization World Models & Planning

Apr 20, 2026

Peter L. Bartlett +3Apr 20, 2026·also INRIA, Paris-Saclay

Scale-free adaptive planning for deterministic dynamics & discounted rewards

Platypoos adapts seamlessly to unknown reward scales, achieving optimal sample complexity in planning under uncertainty.

Peter L. Bartlett, Victor Gabillon, Jennifer Healey +1

Robotics & Embodied AI World Models & Planning

Apr 20, 2026·also DeepMind, Microsoft Research, Paris-Saclay, SequeL +1

Spectral bandits for smooth graph functions

Learning user preferences for thousands of items can be achieved with just a handful of evaluations, thanks to a novel approach that leverages effective dimension in graph-based bandit problems.

Michal Valko, Rémi Munos, Branislav Kveton +1

Recommendation & Information Retrieval

Apr 16, 2026

Apr 16, 2026·also DeepMind, Paris-Saclay, SequeL team

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

TrailBlazer offers a computationally efficient Monte-Carlo planning algorithm that drastically reduces sample complexity by focusing exploration on near-optimal state trajectories within an MDP.

Jean-Bastien Grill, Michal Valko, Rémi Munos +119

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Apr 16, 2026·also ENSAE Paris -CREST, Paris-Saclay

Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

Log-barrier regularization unlocks optimal O-tilde(t^{-1/4}) last-iterate convergence in uncoupled matrix games with bandit feedback, finally closing the gap to the theoretical limit.

Côme Fiegel, Come Fiegel, Pierre Ménard +4

Natural Language Processing Training Efficiency & Optimization

Apr 15, 2026

DeepMindApr 15, 2026·also INRIA, Paris-Saclay

Covariance-adapting algorithm for semi-bandits with application to sparse rewards

Forget sub-Gaussian assumptions: this semi-bandit algorithm adapts to the true covariance structure of outcomes, leading to tighter regret bounds and better performance.

Pierre Perrault, Michal Valko

Training Efficiency & Optimization

Microsoft ResearchApr 15, 2026·also INRIA, Paris-Saclay

Spectral Thompson sampling

Spectral Thompson Sampling offers a computationally tractable alternative for bandit problems on graphs, achieving comparable regret bounds to existing methods while scaling efficiently to large action spaces.

Tomas Kocak, Michal Valko, Remi Munos

Recommendation & Information Retrieval

SequeLApr 15, 2026·also INRIA, Paris-Saclay

Online learning with noisy side observations

Learning from noisy feedback doesn't have to be a guessing game: this new algorithm achieves near-optimal regret in online learning without needing to estimate the quality of the feedback.

Tomáš Kocák, Michal Valko

Natural Language Processing Training Efficiency & Optimization

Mar 12, 2026

Mar 12, 2026·also Paris-Saclay

Language Generation with Replay: A Learning-Theoretic View of Model Collapse

Re-training LLMs on their own generated content can fundamentally limit what they can learn, but only under specific, theoretically-defined conditions related to generation quality.

G. Racca, Michal Valko, M. Valko +1

Data Curation & Synthetic Data Natural Language Processing Scaling Laws & Emergent Abilities

May 26, 2025

LMO - Laboratoire de Mathématiques d'Orsay (Bâtiment 307May 26, 2025·also DeepMind, ENS, HuggingFace, INRIA +3

Accelerating Nash Learning from Human Feedback via Mirror Prox

Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.

D. Tiapkin, Daniele Calandriello, D. Belomestny +5

RLHF & Preference Learning Training Efficiency & Optimization

Jun 3, 2020

Jun 3, 2020·also Paris-Saclay

A single algorithm for both restless and rested rotting bandits

A single algorithm now solves both rested and restless rotting bandits, problems previously thought to require fundamentally different approaches.

Julien Seznec, Pierre Ménard, A. Lazaric +127

Recommendation & Information Retrieval

Jul 5, 2018

Jul 5, 2018·also Paris-Saclay

Best of both worlds: Stochastic & adversarial best-arm identification

You can have your cake and eat it too: this new algorithm nearly matches the optimal performance for stochastic best-arm identification while remaining robust to adversarial attacks, despite the theoretical impossibility of a universally optimal learner.

Yasin Abbasi-Yadkori, P. Bartlett, Victor Gabillon +254

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Dec 8, 2014

Dec 8, 2014·also Paris-Saclay

Online combinatorial optimization with stochastic decision sets and adversarial losses

Stochastic action availability doesn't have to hamstring online learning: "Counting Asleep Times" unlocks improved regret bounds in combinatorial settings.

Gergely Neu, Michal Valko34

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Dec 1, 2011

Dec 1, 2011·also Adobe Research, Paris-Saclay, Pitt

Conditional Anomaly Detection with Soft Harmonic Functions

Spotting unusual labels in your data just got easier with a new method that avoids the pitfalls of flagging isolated or boundary cases as anomalies.

Michal Valko, B. Kveton, Hamed Valizadegan +225

Data Curation & Synthetic Data Natural Language Processing

Search

Michal Valko

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (21)