Summary
BubbleRank is a bandit algorithm for safe online learning to re-rank, combining offline and online settings. It gradually improves an initial base list by exchanging items to improve the quality of displayed lists. The algorithm explores safely by bubbling up more attractive items. The theoretical analysis provides bounds on the n-step regret and proves the algorithm's safety. BubbleRank aims to optimize the re-ranking stage in production ranking systems by adaptively re-ranking items. The paper discusses click models, stochastic click bandit framework, and presents BubbleRank algorithm with detailed explanations and theoretical guarantees.