Theoretical Analysis of Bandit Algorithms¶

The main topic for today is theoretical analysis of Bandit algorithms. Consider the same setting as last lecture. We have $K$ coins with unknown success probabilities $p_1, \dots, p_K$. We play for $T$ rounds. In each round, you pick one of the coins which will be tossed. If the result is a heads, you win $\$1$; and if the result is a tails, you win $\$0$. Your objective is to maximize earnings.

Regret¶

Here is some terminology that we shall use today. Given a specific algorithm (such as ETC, UCB, TS) for solving this problem, we can associate a reward function $\text{Reward}(t)$ for $t = 1, 2, 3, \dots, T$. This function is simply the number of dollars earned by the algorithm after $t$ rounds: \begin{align*} \text{Reward}(t) = \text{dollars earned after } t \text{ rounds}. \end{align*} We also associate a closely related function called Regret$(t)$ which is defined as: \begin{align*} \text{Regret}(t) := t \max_{1 \leq a \leq K} p_a - \text{Reward}(t). \end{align*} In other words, Regret$(t)$ compares the Reward of the algorithm after $t$ rounds with the Expected Reward of the best possible algorithm which always selects the coin with the highest success probability.

Because of the randomness in the outcomes of the coin tosses (which further influences the selection process of coins in subsequent rounds by the algorithm), Reward$(t)$ and Regret$(t)$ are random functions. For example, if we focus on a specific algorithm (such as TS) and play the game to $T$ rounds, multiple times, then we would obtain different answers for Regret$(t)$ every time. One can obtain a deterministic measure of the performance of a specific Bandit algorithm by averaging Regret$(t)$ over hypothetical re-runs of the whole game. This leads to the Average Regret given by: \begin{align*} \text{Average-Regret}(t) = t \max_{1 \leq a \leq K} p_a - \mathbb{E} \left(\text{Reward}(t) \right). \end{align*} We would like to work with algorithms for which the Average Regret is small.

Average-Regret$(t)$ increases with $t$¶

The Average Regret, as a function of $t$, always increases with $t$. This is because: \begin{align*} & \text{Average-Regret}(t+1) - \text{Average-Regret}(t) \\ &= (t+1) \max_a p_a - \mathbb{E} \left(\text{Reward}(t+1) \right) - \left[t \max_a p_a - \mathbb{E} \left(\text{Reward}(t) \right) \right] \\ &= \max_a p_a - \mathbb{E} \left[\text{Reward}(t+1) - \text{Reward}(t) \right] \\ &= \max_a p_a - \mathbb{E} \left[ \text{payoff in Round} ~ t+1 \right] \geq 0 \end{align*} The last term is nonnegative because, in any single round, the expected payoff will always be smaller than $\max_a p_a$.

Next, we shall look at two very simple (and somewhat unrealistic) examples of Bandit algorithms for whom it is easy to explicitly evaluate the Average-Regret.

Two Simple Bandit Algorithms¶

Consider the setting from last lecture where we are dealing with $K = 9$ coins with success probabilities $p_1 = 0.1, p_2 = 0.2, dots, p_8 = 0.8, p_9 = 0.9$. The best coin is clearly coin 9 with success probability $0.9$.

Algorithm One¶

Consider the algorithm which, at each round, picks one of coin 8 or coin 9 randomly with equal probabilities. This might be considered a reasonable algorithm as it is picking between the two best coins. However, it is not attempting to identify the best coin between 8 and 9 using data from previous rounds. Even after playing the game for a very large number of rounds, this algorithm will appear to be clueless as to which coin is the best by continuing to randomly select between 8 and 9.

Let us evaluate the Average-Regret$(t)$ function for this algorithm: \begin{align*} \text{Average-Regret}(t) &= t \max_a p_a - \mathbb{E} \left(\text{Reward}(t)\right) \\ &= t p_9 - \mathbb{E} \sum_{s=1}^t I\{\text{success in round}~s\} \\ &= t p_9 - \sum_{s=1}^t \mathbb{E} I\{\text{success in round}~s\} \\ &= t p_9 - \sum_{s=1}^t \mathbb{P} \{\text{success in round}~s\} \\ &= t p_9 - \sum_{s=1}^t \left[\mathbb{P} \{\text{success in round}~s \mid \text{coin 8 picked} \} \mathbb{P} \{\text{coin 8 picked}\} + \mathbb{P} \{\text{success in round}~s \mid \text{coin 9 picked} \} \mathbb{P} \{\text{coin 9 picked}\} \right] \\ &= t p_9 - \sum_{s=1}^t \left[p_8 \times \frac{1}{2} + p_9 \times \frac{1}{2} \right] \\ &= \frac{t}{2} \left(p_9 - p_8 \right). \end{align*} With $p_8 = 0.8$ and $p_9 = 0.9$, the $ \text{Average-Regret}(t) $ equals $0.05 \times t$. Note that this average regret increases linearly with $t$. The linear regret means that even after playing the game for a large number of rounds, the algorithm will continue to lose some constant fraction of the best payout in each round. Linear Average Regret is a sign of a poor Bandit algorithm. The Average Regret of properly designed bandit algorithms (such as UCB and TS) will grow much more slowly (logarithmically) with $t$.

Algorithm Two¶

A good bandit algorithm will utilize available information effectively. One would expect that after playing the game for a large number of rounds, the algorithm should figure out which is the best coin. As an example of such an algorithm, consider the following variant of the previous algorithm. This algorithm also picks one of coin 8 and coin 9 in each round. However, instead of selecting between them with equal probabilities, this algorithm picks coin 8 with probability $\rho_t$ and coin 9 with probability $1-\rho_t$ in round $t$. Here $\rho_t$ is the probability of picking coin 8 in Round $t$. Let us assume that $\rho_t$ decreases with $t$ (this is supposed to mimic the idea that the algorithm will become more sure that coin 9 is the best coin with each passing round). For a concrete example, let $\rho_t = 1/t$ or $\rho_t = 1/t^2$.

Here is the average regret of this algorithm. \begin{align*} \text{Average-Regret}(t) &= t \max_a p_a - \mathbb{E} \left(\text{Reward}(t)\right) \\ &= t p_9 - \mathbb{E} \sum_{s=1}^t I\{\text{success in round}~s\} \\ &= t p_9 - \sum_{s=1}^t \mathbb{E} I\{\text{success in round}~s\} \\ &= t p_9 - \sum_{s=1}^t \mathbb{P} \{\text{success in round}~s\} \\ &= t p_9 - \sum_{s=1}^t \left[\mathbb{P} \{\text{success in round}~s \mid \text{coin 8 picked} \} \mathbb{P} \{\text{coin 8 picked}\} + \mathbb{P} \{\text{success in round}~s \mid \text{coin 9 picked} \} \mathbb{P} \{\text{coin 9 picked}\} \right] \\ &= t p_9 - \sum_{s=1}^t \left[p_8 \times \rho_s + p_9 \times (1-\rho_s) \right] \\ &= \left(p_9 - p_8 \right) \sum_{s=1}^t \rho_s. \end{align*} As a check, note that when $\rho_s = 1/2$ (as in the previous algorithm), we get back the linear average regret $(p_9 - p_8)t/2$. Now with $\rho_s = 1/s$, the Average Regret will equal \begin{align*} \text{Average-Regret}(t) = \left(p_9 - p_8 \right) \sum_{s=1}^t \rho_s = \left(p_9 - p_8 \right) \left(1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{t} \right) \approx \left(p_9 - p_8 \right) \left(\log t + \gamma \right) \end{align*} for a constant $\gamma$ (see https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)). Thus in this case, the average regret only grows logarithmically with $t$. We shall take this to be a characteristic of a good Bandit algorithm. The implication is that if the probability of picking a suboptimal coin decreases with increasing rounds as $1/t$, then the algorithm will achieve logarithmic regret.

If $\rho_t$ decreases even more rapidly with $t$ such as $\rho_t = 1/t^2$, then the average regret will increase even more slowly than logarithmic and would always be bounded by a constant. It turns out that it is generally impossible to achieve such small $\rho_t$ with realistic algorithms (note that in practice $\rho_t$ will be determined by utilization of data on previous rounds).

It turns out that the UCB and TS algorithms from last lecture both have logarithmic regret. We shall sketch the argument for proving this in the case of the UCB algorithm.

UCB Algorithm¶

Let us first revisit the UCB algorithm from last time. We shall use the following notation and terminology:

Let $T_a(t)$ denote the number of times coin $a$ was tossed in the first $t$ rounds.
Let $X_a(t)$ denote the number of heads obtained from tossing coin $a$ in the first $t$ rounds.
The proportion of heads for coin $a$ after $t$ rounds is then $X_a(t)/T_a(t)$.
Let $\text{UCB}_a(t, \delta)$ be defined by \begin{align*} \text{UCB}_a(t, \delta) := \frac{X_a(t)}{T_a(t)} + \sqrt{\frac{\log(1/\delta)}{2 T_a(t)}} \end{align*}

In the $t^{th}$ round, the UCB algorithm picks the coin $a$ for which $\text{UCB}_a(t-1, \delta_t)$ is the largest. Here $\delta_t$ is the confidence level which needs to be appropriately chosen. Theoretical analysis recommends choices such as $\delta_t = 1/t^3$. Before proceeding with the proof that the UCB algorithm achieves logarithmic regret, let us first provide some intuition for the UCB algorithm. For this, we need to revisit the Hoeffding inequality.

Revisiting Hoeffding's Inequality¶

Let $X \sim \text{Bin}(n, p)$. We can write $X = X_1 + \dots + X_n$ for $X_i \overset{\text{i.i.d}}{\sim} \text{Bernoulli}(p)$. Then Hoeffding's inequality is given by \begin{align*} \mathbb{P}\{X_1 + \dots + X_n \geq x\} \leq \exp \left(-2 \frac{(x - np)^2}{n} \right) ~~ \text{ for } ~ x \geq np ~~~~~ \text{ and } ~~~~~ \mathbb{P}\{X_1 + \dots + X_n \leq x\} \leq \exp \left(-2 \frac{(x - np)^2}{n} \right) ~~ \text{ for } ~ x \leq np \end{align*} Equating the probability bounds (right hand sides) in each inequality above to $\delta$, and solving for $x$, we obtain: \begin{align*} \mathbb{P} \left\{p > \frac{X_1 + \dots + X_n}{n} - \sqrt{\frac{\log(1/\delta)}{2n}} \right\} \geq 1- \delta ~~~~ \text{ and } ~~~~ \mathbb{P} \left\{p < \frac{X_1 + \dots + X_n}{n} + \sqrt{\frac{\log(1/\delta)}{2n}} \right\} \geq 1- \delta \tag{1} \end{align*} The above probability bounds hold for a fixed value of $n$. By using the union bound, we can also guarantee then for smaller values of $n$ as well as explained below: \begin{align*} & \mathbb{P}\left\{p > \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} \text{ for every } m = 1, \dots, n \right\} \\ &= 1 - \mathbb{P}\left\{p \leq \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} \text{ for some } m = 1, \dots, n \right\} \\ &= 1 - \mathbb{P}\left[ \bigcup_{m=1}^n \left\{p \leq \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} \right\} \right] \\ &\geq 1 - \sum_{m=1}^n \mathbb{P} \left\{p \leq \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} \right\} \\ &= 1 - \sum_{m=1}^n \left(1 - \mathbb{P} \left\{p > \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} \right\} \right) \geq 1 - n \delta. \end{align*} Here we used the inequality $\mathbb{P} \left(\cup_{i=1}^n B_i \right) \leq \sum_{i=1}^n \mathbb{P}(B_i)$ which is known as the union bound. A similar argument also holds for the upper bound. We thus have the following uniform versions of the inequalities in (1): \begin{align*} & \mathbb{P} \left\{p > \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} \text{ for every } m = 1, \dots, n \right\} \geq 1- n \delta, ~~~~ \text{ and } \\ & \mathbb{P} \left\{p < \frac{X_1 + \dots + X_m}{m} + \sqrt{\frac{\log(1/\delta)}{2m}} \text{ for every } m = 1, \dots, n \right\} \geq 1- n \delta \tag{2} \end{align*} For example, with $\delta = 1/n^2$, the above two inequalities become \begin{align*} & \mathbb{P} \left\{p > \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log n}{m}} \text{ for every } m = 1, \dots, n \right\} \geq 1- \frac{1}{n}, ~~~~ \text{ and } \\ & \mathbb{P} \left\{p < \frac{X_1 + \dots + X_m}{m} + \sqrt{\frac{\log n}{m}} \text{ for every } m = 1, \dots, n \right\} \geq 1- \frac{1}{n}. \tag{2} \end{align*} These inequalities motivate the following definitions of confidence bounds: \begin{align*} \text{LCB}(m, \delta) := \frac{X_1 + \dots + X_m}{m} - \sqrt{\frac{\log(1/\delta)}{2m}} ~~~~ \text{ and } ~~~~ \text{UCB}(m, \delta) := \frac{X_1 + \dots + X_m}{m} + \sqrt{\frac{\log(1/\delta)}{2m}}, \end{align*} With $\delta = 1/n^2$, these bounds $\text{LCB}(m, 1/n^2)$ and $\text{UCB}(m, 1/n^2)$ provide bounds for $p$ that hold uniformly over $m = 1, \dots, n$ with probability at least $1 - 1/n$: we can write \begin{align*} &\mathbb{P} \left\{p > \text{LCB}(m, 1/n^2)~ \text{for every } 1 \leq m \leq n \right\} \geq 1 - (1/n) \\ &\mathbb{P} \left\{p < \text{UCB}(m, 1/n^2) ~ \text{for every } 1 \leq m \leq n \right\} \geq 1 - (1/n)\tag{3} \end{align*}

Regret of UCB¶

We shall now argue that UCB (applied with confidence levels $\delta_t = 1/t^2$) has logarithmic regret. The rigorous proof of this fact is somewhat complicated, but we shall only provide the main idea here. For a precise statement of the result and its proof, we refer you to Theorem 7.1 and its proof in the book titled 'Bandit Algorithms' by Lattimore and Szepesvári (the book is available here https://tor-lattimore.com/downloads/book/book.pdf).

In Round $t$, the UCB algorithm picks the coin $a$ for which $\text{UCB}_a(t-1, \delta_t)$ is the largest. Let us again consider our example where the two best coins are the ones with probabilities $p_9 = 0.9$ and $p_8 = 0.8$. What is the probability $\rho_t$ of picking coin 8 over coin 9 in Round $t$? We have already seen (in the context of Algorithm Two above) that if $\rho_t$ is small enough (e.g., if $\rho_t$ decays like $1/t$), then the Average Regret will be logarithmic in $t$. The probability of picking coin 8 over coin 9 (with $\delta_t = 1/t^2$) is \begin{align*} \rho_t &:= \mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > \text{UCB}_9(t-1, 1/t^2) \right\} \\ &\leq \mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > \text{UCB}_9(t-1, 1/t^2), p_9 < \text{UCB}_9(t-1, 1/t^2) \right\} \\ & + \mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > \text{UCB}_9(t-1, 1/t^2), p_9 \geq \text{UCB}_9(t-1, 1/t^2) \right\} \\ &\leq \mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > p_9 \right\} + \mathbb{P} \left\{p_9 \geq \text{UCB}_9(t-1, 1/t^2) \right\}. \end{align*} By inequality (3) above, the second probability is bounded from above by $1/t$. Thus \begin{align*} \rho_t \leq \mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > p_9 \right\} + \frac{1}{t} \end{align*} The key then is to bound $\mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > p_9 \right\}$: \begin{align*} \mathbb{P} \left\{\text{UCB}_8(t-1, 1/t^2) > p_9 \right\} &= \mathbb{P} \left\{\frac{X_8(t-1)}{T_8(t-1)} + \sqrt{\frac{\log t}{T_8(t-1)}} > p_9 \right\} \\ &= \mathbb{P} \left\{\frac{X_8(t-1)}{T_8(t-1)} - p_8 > p_9 - p_8 - \sqrt{\frac{\log t}{T_8(t-1)}} \right\} \end{align*} Suppose now that \begin{align*} p_9 - p_8 \geq 2\sqrt{\frac{\log t}{T_8(t-1)}} ~~ \text{ or, equivalently } ~ T_8(t-1) \geq \frac{4 \log t}{(p_9 - p_8)^2} \tag{4}. \end{align*} Then \begin{align*} \mathbb{P} \left\{\frac{X_8(t-1)}{T_8(t-1)} - p_8 > p_9 - p_8 - \sqrt{\frac{\log t}{T_8(t-1)}} \right\} &\leq \mathbb{P} \left\{\frac{X_8(t-1)}{T_8(t-1)} - p_8 > \sqrt{\frac{\log t}{T_8(t-1)}} \right\} \\ &=\mathbb{P} \left\{ p_8 < \frac{X_8(t-1)}{T_8(t-1)} - \sqrt{\frac{\log t}{T_8(t-1)}} \right\} \leq 1/t \end{align*} where the last inequality follows from the LCB confidence guarantee. This gives $\rho_t \leq 2/t$ in the case where the condition (4) is satisfied. Recall, from our analysis of Algorithm 2, that $\rho_t \leq 2/t$ leads to logarithmic regret. Also note that if condition (4) is violated, then in the first $(t-1)$ rounds, coin 8 is selected only in logarithmically many rounds (in the rest of rounds, the best coin is selected) which also leads to logarithmic regret. All these pieces can be put together to yield a rigorous proof of the logarithmic regret for UCB (as in the proof of Theorem 7.1 in the Lattimore-Szepesvári book).

General Bandit Problems¶

We have focussed so far on the Bandit problem where there are $K$ coins. In more general bandit setups, instead of coins, there are $K$ arms with each arm giving a random reward every time it is picked (instead of 'picking' an arm, the usual terminology speaks of 'pulling' an arm). In each round, the player decides to pull one of the $K$ arms. The pulled arm gives a random reward. We assume that the $a^{th}$ arm gives rewards that are i.i.d according to a probability distribution $P_a$ with mean $\mu_a$.

In each round $t \in \{1,\ldots,n\}$, the player selects an arm $A_t$. The arm then samples a reward $X_t$ from the distribution $P_{A_t}$, and reveals $X_t$ to the player. Let us assume that the rewards always take values in a bounded interval $[\alpha, \beta]$.

Let $T_a(t)$ denote the number of times arm $a$ is selected, up to and including time $t$: $$T_a(t) = \sum_{s=1}^t \mathbf{1}\{A_s = a\}.$$ Also let $\hat{\mu}_a(t)$ denote the average reward observed for arm $a$ up to and including time $t$: $$\hat \mu_a(t) = \frac{1}{T_a(t)}\sum_{s=1}^t X_s \mathbf{1}\{A_s = a\}.$$

Each of the three Bandit algorithms: ETC, UCB, TS can be described in this more general setup.

ETC: This algorithm divides exploration and exploitation into two separate phases: In the first phase, we pull every arm $m$ times. In the second phase, we use the average reward observed for each arm in the first phase to guess which arm is the best, and thereafter keep pulling that arm. More formally, the algorithm takes as input the parameter $m$, and when executed, iteratively chooses action $A_t$ according to

$$ A_t = \begin{cases} (t ~\text{mod}~ K) + 1 & \quad \text{if}~ t \leq mK \\ \arg\max_a \hat{\mu}_a(mK) & \quad \text{if}~ t > mK. \end{cases} $$

UCB: For each arm, define the upper confidence bound as:

$$ \text{UCB}_a(t,\delta) = \begin{cases} \infty & \quad \text{if}~ T_a(t) = 0 \\ \hat{\mu}_a(t) + \sqrt{\frac{(\beta - \alpha)\log(1/\delta)}{2T_a(t)}} & \quad \text{otherwise}. \end{cases} $$

This form for the UCB is derived by the general form of the Hoeffding inequality. The UCB algorithm is defined by iteratively making the arm selection: $$A_{t}=\Bigg\{\begin{array}{ll} {t,} & {\text { if } t\leq K}\\ {\operatorname{argmax}_{a} \text{UCB}_a(t-1,\delta_t)} & {\text { otherwise. }}\end{array}$$ In other words, we first pull every arm once, and thereafter, keep pulling the arm with the highest upper confidence bound.

TS: To set up Thompson Sampling, we will need a prior $\pi_a$ for the mean of each arm, as well as a likelihood for the rewards. It is typical to use a Beta-Bernoulli or Normal-Normal conjugate pair. Next, at each round $t$, we consider the posterior distribution over $\mu_a$, for each arm $a \in \{ 1,...,K\}$, given all the samples you have observed from that arm $X_{1,a},...,X_{T_a(t-1),a}$:

$$P_{a,t}=\mathbb{P}(\mu_a | X_{1,a},...,X_{T_a(t-1),a}).$$

The Thompson algorithm can now be written as follows. At each round t: (a) Draw a posterior sample for each arm: $\mu_{a,t} \sim P_{a,t}$ for $a \in \{1,...,K\}$. (b) Choose the arm with the largest sample: $A_t=\underset{1 \leq a \leq K}{\operatorname{argmax}} \mu_{a,t}$. We can think of the probability we assign to choosing arm $a$ as the probability that it is the largest arm under the posterior on all of the data.

It turns out that both UCB and TS achieve logarithmic regret. ETC also achieves logarithmic regret if the parameter $m$ is chosen appropriately (but the optimal choice of $m$ depends on the unknown means $\mu_a, a = 1, \dots, K$ so it might be difficult to choose correctly in practice). For more details on bandits, please refer to the book 'Bandit Algorithms' by Lattimore and Szepesvári (https://tor-lattimore.com/downloads/book/book.pdf).