Search papers, labs, and topics across Lattice.
Inria, ENS Paris-Saclay MVA
21
227
11
44
Unlock face recognition with just one labeled example and a flood of unlabeled data, achieving state-of-the-art accuracy in a practical authentication scenario.
Forget reinforcement learning; this algorithm learns in real-time without any feedback at all.
By modeling policy gradients as Gaussian processes, this work dramatically reduces the sample complexity in reinforcement learning, offering faster convergence and uncertainty estimates at little extra cost.
Learn user preferences across thousands of items from just tens of node evaluations by exploiting graph smoothness in a new spectral bandit framework.
Learning in multi-armed bandits gets a boost: even with only probabilistic side observations of other arms' losses, near-optimal regret is achievable without knowing the observation probability.
DPP-based Monte Carlo integration can offer variance reduction, but choosing the right DPP—fixed vs. tailored to the integrand—determines whether you get a biased but faster converging estimator or an unbiased but standard-rate estimator.
Forget picking influencers by headcount; this new framework lets you maximize influence based on your actual ad budget, and it even sharpens the math for the old way of doing things.
Entropy regularization makes planning provably easy: SmoothCruiser achieves polynomial sample complexity in MDPs where standard methods fail.
Platypoos adapts seamlessly to unknown reward scales, achieving optimal sample complexity in planning under uncertainty.
Learning user preferences for thousands of items can be achieved with just a handful of evaluations, thanks to a novel approach that leverages effective dimension in graph-based bandit problems.
TrailBlazer offers a computationally efficient Monte-Carlo planning algorithm that drastically reduces sample complexity by focusing exploration on near-optimal state trajectories within an MDP.
Log-barrier regularization unlocks optimal O-tilde(t^{-1/4}) last-iterate convergence in uncoupled matrix games with bandit feedback, finally closing the gap to the theoretical limit.
Forget sub-Gaussian assumptions: this semi-bandit algorithm adapts to the true covariance structure of outcomes, leading to tighter regret bounds and better performance.
Spectral Thompson Sampling offers a computationally tractable alternative for bandit problems on graphs, achieving comparable regret bounds to existing methods while scaling efficiently to large action spaces.
Learning from noisy feedback doesn't have to be a guessing game: this new algorithm achieves near-optimal regret in online learning without needing to estimate the quality of the feedback.
Re-training LLMs on their own generated content can fundamentally limit what they can learn, but only under specific, theoretically-defined conditions related to generation quality.
Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.
A single algorithm now solves both rested and restless rotting bandits, problems previously thought to require fundamentally different approaches.
You can have your cake and eat it too: this new algorithm nearly matches the optimal performance for stochastic best-arm identification while remaining robust to adversarial attacks, despite the theoretical impossibility of a universally optimal learner.
Stochastic action availability doesn't have to hamstring online learning: "Counting Asleep Times" unlocks improved regret bounds in combinatorial settings.
Spotting unusual labels in your data just got easier with a new method that avoids the pitfalls of flagging isolated or boundary cases as anomalies.