Optimal Exploration-Exploitation in Multi-Armed-Bandit Problems with Non-stationary Rewards
Coauthor(s): O. Besbes, Y. Gur, and A. Zeevi
In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; a constraining assumption from the standpoint of realism, yet supporting a sharp characterization of the regret. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, lending itself to many application settings while also being mathematically tractable. We fully characterize the (regret) complexity of this class of MAB by establishing a direct link between the extent of reward variation and the minimal achievable regret. Our analysis also draws concrete connections between two rather disparate strands of literature: the adversarial and the stochastic MAB frameworks.
O. Besbes, Y. Gur, and A. Zeevi "Optimal Exploration-Exploitation in Multi-Armed-Bandit Problems with Non-stationary Rewards." , Columbia Business School, (2013).