Adversarial Bandits
The stochastic framework assumes arm rewards are drawn from fixed, unknown distributions. But what if the environment can adapt or change the reward structure over time? This leads to adversarial settings, where we face evolving reward sequences that may respond to our actions.
In adversarial bandits, we have loss sequences $\ell_{i,t}$ for each arm $i$ and time $t$, determined by an adversary who may respond to our strategy but cannot anticipate our specific randomized decisions. Our goal is to minimize regret against the best fixed arm in hindsight:
$$R_T = \sum_{t=1}^T \ell_{I_t,t} - \min_{i \in [K]} \sum_{t=1}^T \ell_{i,t}$$
This framework captures scenarios where the environment adapts to our choices. Unlike stochastic settings where $O(\log T)$ regret is achievable, adversarial settings have fundamental $\Omega(\sqrt{T})$ regret, reflecting the inherent difficulty of competing against an adaptive adversary.
The key challenge is the information bottleneck: we only observe losses for selected arms, not the counterfactual losses we would have experienced from other arms. This partial feedback constraint makes learning significantly harder than full-information settings.
Adversarial Lower Bound
For any algorithm and adversarial sequence, there exists a loss sequence such that:
$$\mathbb{E}[R_T] \geq \frac{1}{2}\sqrt{\frac{K-1}{2}T}$$
The proof constructs an adversary that acts randomly until the algorithm reveals which arms are being avoided, then assigns high losses to alternatives. This forces algorithms to maintain sufficient exploration of all available arms.
The EXP3 Algorithm
EXP3 (Exponential-weight algorithm for Exploration and Exploitation) elegantly handles adversarial bandits using importance-weighted estimation and exponential weights. The algorithm maintains weights for each arm and selects arms according to these weights.
EXP3 Algorithm
Parameters: Learning rate $\gamma \in (0,1]$
Initialize: $w_i(1) = 1$ for all $i \in [K]$
For $t = 1, 2, \ldots, T$:
- Set probabilities: $p_i(t) = (1-\gamma)\frac{w_i(t)}{\sum_j w_j(t)} + \frac{\gamma}{K}$
- Sample arm: $I_t \sim p(t)$
- Observe loss: $\ell_{I_t,t}$
- Estimate full loss vector: $\hat{\ell}_i(t) = \frac{\ell_{i,t} \mathbb{1}_{I_t = i}}{p_i(t)}$
- Update weights: $w_i(t+1) = w_i(t) \exp(-\gamma \hat{\ell}_i(t))$
The importance-weighted estimator $\hat{\ell}_i(t)$ is unbiased: $\mathbb{E}[\hat{\ell}_i(t)] = \ell_{i,t}$, but has high variance $\text{Var}[\hat{\ell}_i(t)] = \ell_{i,t}^2 \frac{1-p_i(t)}{p_i(t)}$. The exploration probability $\frac{\gamma}{K}$ ensures all arms have minimum selection probability, controlling this variance.
EXP3's regret analysis combines the potential function method with careful variance control:
EXP3 Regret Bound
With $\gamma = \sqrt{\frac{2\log K}{KT}}$, EXP3 achieves:
$$\mathbb{E}[R_T] \leq 2\sqrt{2KT\log K}$$
The proof tracks the potential $\Phi_t = \sum_i w_i(t)$ and shows that its expected decrease relates to regret. The key insight is that the exponential weights algorithm approximately minimizes regret if the loss estimates were perfect, and importance weighting provides unbiased estimates at the cost of increased variance.
import numpy as np
def exp3(losses, gamma=None, T=None):
"""
EXP3 algorithm for adversarial bandits
Args:
losses: T x K matrix of losses (revealed online)
gamma: exploration parameter
T: time horizon (for default gamma)
Returns:
cumulative_regret: array of regret at each time step
"""
T, K = losses.shape
if gamma is None:
gamma = np.sqrt(2 * np.log(K) / (K * T))
# Initialize weights
weights = np.ones(K)
cumulative_regret = np.zeros(T)
best_arm_losses = np.min(np.sum(losses, axis=0))
for t in range(T):
# Compute probabilities
weight_sum = np.sum(weights)
probs = (1 - gamma) * weights / weight_sum + gamma / K
# Sample arm
arm = np.random.choice(K, p=probs)
# Observe loss
loss = losses[t, arm]
# Compute regret
regret = loss - best_arm_losses / T
cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret
# Update weights using importance-weighted loss estimate
loss_estimate = loss / probs[arm]
weights[arm] *= np.exp(-gamma * loss_estimate)
return cumulative_regret
# Alternative: Full-information EXP3 for comparison
def exp3_full_info(losses, gamma=None):
"""EXP3 with full information (for comparison)"""
T, K = losses.shape
if gamma is None:
gamma = np.sqrt(2 * np.log(K) / T)
weights = np.ones(K)
cumulative_regret = np.zeros(T)
best_arm_losses = np.min(np.sum(losses, axis=0))
for t in range(T):
# Compute probabilities
weight_sum = np.sum(weights)
probs = weights / weight_sum
# Sample arm
arm = np.random.choice(K, p=probs)
# Observe loss
loss = losses[t, arm]
# Compute regret
regret = loss - best_arm_losses / T
cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret
# Update weights using true losses (full information)
weights *= np.exp(-gamma * losses[t])
return cumulative_regret
High Probability Lower Bounds
Beyond expected regret, we often need high-probability guarantees. For adversarial bandits, concentration becomes more challenging due to the adaptive nature of the algorithm and the adversarial rewards.
The high-probability analysis of EXP3 requires careful treatment of the martingale structure. Define the martingale:
$$M_t = \sum_{s=1}^t \left( \hat{\ell}_{I_s,s} - \ell_{I_s,s} \right)$$
This captures the difference between importance-weighted estimates and true losses. Since $\mathbb{E}[\hat{\ell}_{I_s,s} | \mathcal{F}_{s-1}] = \ell_{I_s,s}$, where $\mathcal{F}_{s-1}$ is the history up to time $s-1$, we have $\mathbb{E}[M_t] = 0$.
Using Azuma's inequality for bounded martingale differences:
High-Probability Bound for EXP3
With probability at least $1-\delta$, EXP3 with $\gamma = \sqrt{\frac{2\log K}{KT}}$ achieves:
$$R_T \leq 2\sqrt{2KT\log K} + \sqrt{\frac{KT\log(1/\delta)}{2}}$$
The proof bounds the martingale $M_t$ using the fact that importance-weighted losses are bounded by $1/\gamma$. The second term comes from the concentration inequality and grows as $\sqrt{T}$, matching the leading term's dependence.
For stochastic bandits, high-probability bounds are tighter because we can use concentration of sums of independent random variables. The adversarial setting's adaptive nature makes sharp concentration more difficult to achieve.
Contextual Bandits, Prediction with Expert Advice and EXP4
Contextual bandits extend the basic framework by incorporating side information. Before making each decision, we observe a context $x_t$ (features, covariates, side information). Arm rewards depend on both the context and the chosen arm: $r_t = r(x_t, I_t)$.
This setting interpolates between full-information learning (where we could test all arms on identical contexts) and standard multi-armed bandits (ignoring all context information). Applications include personalized recommendations, adaptive systems, and structured decision making.
The challenge is generalization: we must learn effective arm selection for new contexts based on limited data from previous contexts and arms. This requires leveraging structure in how rewards depend on contexts and arm characteristics.
Prediction with Expert Advice
A key stepping stone is prediction with expert advice. We have $N$ experts, each providing predictions $f_i(x_t)$ for context $x_t$. Our goal is to perform nearly as well as the best expert in hindsight.
The Weighted Majority algorithm maintains weights on experts and predicts according to weighted averages. For convex loss functions, this achieves regret $O(\sqrt{T \log N})$, optimal up to constants.
The connection to bandits is that each expert corresponds to a policy mapping contexts to arms. The challenge is that we can't evaluate all expert predictions - only the one corresponding to our chosen arm.
Linear Contextual Bandits
A tractable special case assumes linear rewards: the expected reward for arm $i$ in context $x_t$ is $\langle \theta_i, x_t \rangle$ for unknown parameters $\theta_i \in \mathbb{R}^d$.
This setting enables efficient learning because:
- The problem has finite-dimensional parametrization
- Confidence sets for $\theta_i$ can be computed using least-squares
- Optimism over confidence sets yields effective algorithms
The key insight is that linear structure enables generalization across contexts - observing reward for context $x$ and arm $i$ provides information about rewards for other contexts through the shared parameter $\theta_i$.
EXP4 Algorithm
EXP4 (EXP3 with expert advice) combines the robustness of EXP3 with the ability to leverage expert predictions. It maintains weights over expert-arm pairs and uses importance weighting to handle partial feedback.
EXP4 Algorithm
Input: $N$ experts, learning rate $\gamma$
Initialize: Weights $w_{i,j}(1) = 1$ for expert $i$, arm $j$
For $t = 1, 2, \ldots, T$:
- Observe context $x_t$ and expert predictions $f_i(x_t)$ for $i \in [N]$
- Compute arm probabilities: $p_j(t) = \sum_i \frac{w_{i,j}(t)}{\sum_{k,\ell} w_{k,\ell}(t)} f_i(x_t)$
- Add exploration: $p_j(t) \leftarrow (1-\gamma)p_j(t) + \frac{\gamma}{K}$
- Select arm $I_t \sim p(t)$ and observe reward $r_t$
- Update weights: $w_{i,j}(t+1) = w_{i,j}(t) \exp\left(\gamma \frac{r_t f_i(x_t) \mathbb{1}_{I_t = j}}{p_j(t)}\right)$
EXP4's regret bound depends on the diversity of expert predictions:
EXP4 Regret Bound
EXP4 achieves regret against the best expert:
$$\mathbb{E}[R_T] \leq O\left(\sqrt{KT\log(NK)/N}\right)$$
When $N$ is large and experts are diverse, this can be much better than standard EXP3. The algorithm effectively uses expert advice to guide exploration while maintaining robustness through importance weighting.
The weight update rule rewards experts whose predictions align with good outcomes. The importance weighting $\frac{f_i(x_t) \mathbb{1}_{I_t = j}}{p_j(t)}$ ensures unbiased estimation while accounting for the expert's influence on the action probability.
def exp4(contexts, experts, rewards, gamma=None, T=None):
"""
EXP4 algorithm for contextual bandits with expert advice
Args:
contexts: T x d matrix of contexts
experts: N x T x K tensor of expert predictions
rewards: T x K matrix of rewards
gamma: exploration parameter
Returns:
cumulative_regret: regret at each time step
"""
T, d = contexts.shape
N, _, K = experts.shape
if gamma is None:
gamma = np.sqrt(2 * np.log(N * K) / (K * T))
# Initialize expert-arm weights
weights = np.ones((N, K))
cumulative_regret = np.zeros(T)
# Compute best expert regret for comparison
expert_rewards = np.sum([
np.sum([rewards[t, np.argmax(experts[i, t])] for t in range(T)])
for i in range(N)
])
best_expert_reward = np.max(expert_rewards)
for t in range(T):
context = contexts[t]
expert_preds = experts[:, t, :] # N x K
# Compute arm probabilities weighted by experts
total_weight = np.sum(weights)
arm_probs = np.zeros(K)
for j in range(K):
for i in range(N):
arm_probs[j] += (weights[i, j] / total_weight) * expert_preds[i, j]
# Add exploration
arm_probs = (1 - gamma) * arm_probs + gamma / K
# Sample arm
arm = np.random.choice(K, p=arm_probs)
reward = rewards[t, arm]
# Compute regret
regret = best_expert_reward / T - reward
cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret
# Update weights
for i in range(N):
importance_weight = expert_preds[i, arm] / arm_probs[arm]
weights[i, arm] *= np.exp(gamma * reward * importance_weight)
return cumulative_regret
Stochastic Linear Bandits and UCB
Linear bandits assume arm rewards follow $r_t = \langle \theta^*, x_{I_t} \rangle + \eta_t$, where $x_i \in \mathbb{R}^d$ represents the feature vector of arm $i$, $\theta^* \in \mathbb{R}^d$ is an unknown parameter vector, and $\eta_t$ represents noise.
This setting is much richer than finite arms - we can consider infinitely many arms as long as their feature vectors are known. The algorithm must learn the parameter vector $\theta^*$ while optimizing arm selection.
LinUCB extends the optimism principle to linear settings using confidence ellipsoids around least-squares estimates of the parameter vector $\theta^*$.
LinUCB Algorithm
Initialize: Design matrix $V_0 = \lambda I_d$, data vector $b_0 = 0 \in \mathbb{R}^d$
For round $t = 1, 2, \ldots, T$:
- Estimate parameters: $\hat{\theta}_t = V_{t-1}^{-1} b_{t-1}$
- Select arm: $I_t = \arg\max_i \langle \hat{\theta}_t, x_i \rangle + \beta_t \|x_i\|_{V_{t-1}^{-1}}$
- Observe reward $r_t$ and update:
- $V_t = V_{t-1} + x_{I_t} x_{I_t}^T$ (update design matrix)
- $b_t = b_{t-1} + r_t x_{I_t}$ (update data vector)
The confidence width $\beta_t \|x_i\|_{V_{t-1}^{-1}}$ captures uncertainty in arm $i$'s feature direction. Arms with features in poorly-explored directions receive larger confidence bonuses, encouraging exploration of uncertain regions.
The regret analysis leverages the elliptic potential lemma, which bounds the sum of confidence widths:
LinUCB Regret Bound
With appropriate choice of $\beta_t$, LinUCB achieves:
$$\mathbb{E}[R_T] = O(d\sqrt{T \log T})$$
This $\sqrt{T}$ dependence reflects the difficulty of learning in high-dimensional feature space, but the linear dependence on dimension $d$ enables practical application to structured problems - much better than treating all arms independently.
The key insight is that the confidence ellipsoid shrinks in directions where we've observed many similar feature vectors, enabling generalization across the arm space. The algorithm automatically balances exploration of uncertain directions with exploitation of well-understood regions.
Ellipsoidal Confidence Sets for Least-Squares Estimators
The foundation of LinUCB is the construction of confidence sets for the least-squares estimator. This requires careful analysis of the regularized least-squares problem.
Given observations $(x_1, r_1), \ldots, (x_t, r_t)$ with $r_s = \langle \theta^*, x_s \rangle + \eta_s$, the regularized least-squares estimator is:
$$\hat{\theta}_t = \arg\min_{\theta} \left\{ \sum_{s=1}^t (r_s - \langle \theta, x_s \rangle)^2 + \lambda \|\theta\|^2 \right\}$$
This has closed form $\hat{\theta}_t = V_t^{-1} b_t$ where:
- $V_t = \lambda I + \sum_{s=1}^t x_s x_s^T$ (design matrix)
- $b_t = \sum_{s=1}^t r_s x_s$ (data vector)
The confidence ellipsoid is based on the concentration of $\hat{\theta}_t$ around $\theta^*$. For Gaussian noise, we can show:
$$\|\hat{\theta}_t - \theta^*\|_{V_t} \leq \beta_t$$
with high probability, where $\|\theta\|_{V_t} = \sqrt{\theta^T V_t \theta}$ and $\beta_t = O(\sqrt{d \log t})$.
Confidence Ellipsoid
With probability at least $1-\delta$:
$$\theta^* \in \{\theta : \|\theta - \hat{\theta}_t\|_{V_t} \leq \beta_t(\delta)\}$$
where $\beta_t(\delta) = \sqrt{d \log((1+t/\lambda)/\delta) + \lambda^{1/2} S}$ and $S$ bounds $\|\theta^*\|$.
This ellipsoid has several important properties:
- Adaptive shape: Narrow in directions with many observations, wide in unexplored directions
- Shrinking volume: $\text{Vol}(\text{ellipsoid}) \propto \det(V_t)^{-1/2}$ decreases over time
- Optimal exploration: Maximizing $\|x\|_{V_t^{-1}}$ minimizes ellipsoid volume most quickly
The UCB selection rule $\max_i \langle \hat{\theta}_t, x_i \rangle + \beta_t \|x_i\|_{V_t^{-1}}$ implements optimism over this confidence set. It selects the arm with highest upper confidence bound on expected reward.
import numpy as np
from numpy.linalg import inv, norm
def linucb(features, rewards, lam=1.0, delta=0.1):
"""
Linear UCB algorithm
Args:
features: T x K x d tensor of arm features
rewards: T x K matrix of rewards (observed online)
lam: regularization parameter
delta: confidence parameter
Returns:
cumulative_regret: regret at each time step
"""
T, K, d = features.shape
# Initialize
V = lam * np.eye(d)
b = np.zeros(d)
cumulative_regret = np.zeros(T)
# Compute best arm regret for comparison
true_theta = np.random.randn(d) # Unknown in practice
true_rewards = np.array([[np.dot(true_theta, features[t, k]) for k in range(K)] for t in range(T)])
best_rewards = np.max(true_rewards, axis=1)
for t in range(T):
# Compute least-squares estimate
theta_hat = inv(V) @ b
# Compute confidence width
beta = np.sqrt(d * np.log((1 + t/lam) / delta) + np.sqrt(lam) * norm(theta_hat))
# Compute UCB for each arm
ucb_values = np.zeros(K)
for k in range(K):
x = features[t, k]
mean_reward = np.dot(theta_hat, x)
confidence_width = beta * np.sqrt(x.T @ inv(V) @ x)
ucb_values[k] = mean_reward + confidence_width
# Select arm with highest UCB
arm = np.argmax(ucb_values)
reward = rewards[t, arm]
# Update design matrix and data vector
x_chosen = features[t, arm]
V += np.outer(x_chosen, x_chosen)
b += reward * x_chosen
# Compute regret
regret = best_rewards[t] - reward
cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret
return cumulative_regret
# Ellipsoidal confidence set visualization (2D case)
def plot_confidence_ellipsoid(V, theta_hat, beta):
"""Plot confidence ellipsoid for 2D case"""
theta = np.linspace(-2, 2, 100)
ellipsoid_points = []
for angle in np.linspace(0, 2*np.pi, 100):
direction = np.array([np.cos(angle), np.sin(angle)])
# Find radius in this direction
radius = beta / np.sqrt(direction.T @ V @ direction)
point = theta_hat + radius * direction
ellipsoid_points.append(point)
return np.array(ellipsoid_points)
Integration and Outlook
We've explored how bandit algorithms handle increasingly complex environments: adversarial settings, contextual information, and structured relationships. Key insights include:
- Robustness vs. Efficiency: Adversarial algorithms like EXP3 sacrifice statistical efficiency for worst-case protection against adaptive environments
- Context Utilization: Contextual bandits enable leveraging side information, requiring balanced exploration across context-arm combinations
- Structural Understanding: Linear models achieve scalable learning by exploiting feature structure
- Confidence-Based Selection: Ellipsoidal confidence regions provide principled optimism in high-dimensional feature spaces
The progression from simple arm comparisons to linear feature models illustrates a fundamental theme: how structural understanding enables efficient learning. Without structure, we face the curse of dimensionality; with appropriate assumptions, we can achieve learning rates that scale with intrinsic dimensions rather than all possible combinations.
Looking ahead to Part 3, we'll explore the frontiers of bandit theory: sparse linear models that exploit sparsity structure, refined statistical bounds, and the elegant duality between Bayesian and frequentist approaches that provides deep insights into optimal learning.
The mathematical tools developed here - importance weighting for unbiased estimation, ellipsoidal confidence sets for optimism, potential function analysis for regret bounds - form the foundation for modern bandit algorithms, extending from simple comparisons to complex structured problems.
References
- Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285-294.
- Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48-77.
- Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. WWW, 661-670.
- Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. COLT, 355-366.
- Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. NIPS, 2312-2320.
- Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., & Schapire, R. (2014). Taming the monster: A fast and simple algorithm for contextual bandits. ICML, 1638-1646.
- Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press.
- Villar, S. S., Bowden, J., & Wason, J. (2015). Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science, 30(2), 199-215.
- Aziz, M., Kaufmann, E., & Riviere, M. K. (2021). On multi-armed bandit designs for dose-finding clinical trials. Journal of Machine Learning Research, 22(1), 685-723.