Multi-Armed Bandits: Advanced Algorithms

Adversarial Bandits

The stochastic framework assumes arm rewards are drawn from fixed, unknown distributions. But what if the environment can adapt or change the reward structure over time? This leads to adversarial settings, where we face evolving reward sequences that may respond to our actions.

In adversarial bandits, we have loss sequences $\ell_{i,t}$ for each arm $i$ and time $t$, determined by an adversary who may respond to our strategy but cannot anticipate our specific randomized decisions. Our goal is to minimize regret against the best fixed arm in hindsight:

$$R_T = \sum_{t=1}^T \ell_{I_t,t} - \min_{i \in [K]} \sum_{t=1}^T \ell_{i,t}$$

This framework captures scenarios where the environment adapts to our choices. Unlike stochastic settings where $O(\log T)$ regret is achievable, adversarial settings have fundamental $\Omega(\sqrt{T})$ regret, reflecting the inherent difficulty of competing against an adaptive adversary.

The key challenge is the information bottleneck: we only observe losses for selected arms, not the counterfactual losses we would have experienced from other arms. This partial feedback constraint makes learning significantly harder than full-information settings.

Adversarial Lower Bound

For any algorithm and adversarial sequence, there exists a loss sequence such that:

$$\mathbb{E}[R_T] \geq \frac{1}{2}\sqrt{\frac{K-1}{2}T}$$

The proof constructs an adversary that acts randomly until the algorithm reveals which arms are being avoided, then assigns high losses to alternatives. This forces algorithms to maintain sufficient exploration of all available arms.

The EXP3 Algorithm

EXP3 (Exponential-weight algorithm for Exploration and Exploitation) elegantly handles adversarial bandits using importance-weighted estimation and exponential weights. The algorithm maintains weights for each arm and selects arms according to these weights.

EXP3 Algorithm

Parameters: Learning rate $\gamma \in (0,1]$

Initialize: $w_i(1) = 1$ for all $i \in [K]$

For $t = 1, 2, \ldots, T$:

Set probabilities: $p_i(t) = (1-\gamma)\frac{w_i(t)}{\sum_j w_j(t)} + \frac{\gamma}{K}$
Sample arm: $I_t \sim p(t)$
Observe loss: $\ell_{I_t,t}$
Estimate full loss vector: $\hat{\ell}_i(t) = \frac{\ell_{i,t} \mathbb{1}_{I_t = i}}{p_i(t)}$
Update weights: $w_i(t+1) = w_i(t) \exp(-\gamma \hat{\ell}_i(t))$

The importance-weighted estimator $\hat{\ell}_i(t)$ is unbiased: $\mathbb{E}[\hat{\ell}_i(t)] = \ell_{i,t}$, but has high variance $\text{Var}[\hat{\ell}_i(t)] = \ell_{i,t}^2 \frac{1-p_i(t)}{p_i(t)}$. The exploration probability $\frac{\gamma}{K}$ ensures all arms have minimum selection probability, controlling this variance.

EXP3's regret analysis combines the potential function method with careful variance control:

EXP3 Regret Bound

With $\gamma = \sqrt{\frac{2\log K}{KT}}$, EXP3 achieves:

$$\mathbb{E}[R_T] \leq 2\sqrt{2KT\log K}$$

The proof tracks the potential $\Phi_t = \sum_i w_i(t)$ and shows that its expected decrease relates to regret. The key insight is that the exponential weights algorithm approximately minimizes regret if the loss estimates were perfect, and importance weighting provides unbiased estimates at the cost of increased variance.

import numpy as np

def exp3(losses, gamma=None, T=None):
    """
    EXP3 algorithm for adversarial bandits

    Args:
        losses: T x K matrix of losses (revealed online)
        gamma: exploration parameter
        T: time horizon (for default gamma)

    Returns:
        cumulative_regret: array of regret at each time step
    """
    T, K = losses.shape

    if gamma is None:
        gamma = np.sqrt(2 * np.log(K) / (K * T))

    # Initialize weights
    weights = np.ones(K)
    cumulative_regret = np.zeros(T)
    best_arm_losses = np.min(np.sum(losses, axis=0))

    for t in range(T):
        # Compute probabilities
        weight_sum = np.sum(weights)
        probs = (1 - gamma) * weights / weight_sum + gamma / K

        # Sample arm
        arm = np.random.choice(K, p=probs)

        # Observe loss
        loss = losses[t, arm]

        # Compute regret
        regret = loss - best_arm_losses / T
        cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret

        # Update weights using importance-weighted loss estimate
        loss_estimate = loss / probs[arm]
        weights[arm] *= np.exp(-gamma * loss_estimate)

    return cumulative_regret

# Alternative: Full-information EXP3 for comparison
def exp3_full_info(losses, gamma=None):
    """EXP3 with full information (for comparison)"""
    T, K = losses.shape

    if gamma is None:
        gamma = np.sqrt(2 * np.log(K) / T)

    weights = np.ones(K)
    cumulative_regret = np.zeros(T)
    best_arm_losses = np.min(np.sum(losses, axis=0))

    for t in range(T):
        # Compute probabilities
        weight_sum = np.sum(weights)
        probs = weights / weight_sum

        # Sample arm
        arm = np.random.choice(K, p=probs)

        # Observe loss
        loss = losses[t, arm]

        # Compute regret
        regret = loss - best_arm_losses / T
        cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret

        # Update weights using true losses (full information)
        weights *= np.exp(-gamma * losses[t])

    return cumulative_regret

High Probability Lower Bounds

Beyond expected regret, we often need high-probability guarantees. For adversarial bandits, concentration becomes more challenging due to the adaptive nature of the algorithm and the adversarial rewards.

The high-probability analysis of EXP3 requires careful treatment of the martingale structure. Define the martingale:

$$M_t = \sum_{s=1}^t \left( \hat{\ell}_{I_s,s} - \ell_{I_s,s} \right)$$

This captures the difference between importance-weighted estimates and true losses. Since $\mathbb{E}[\hat{\ell}_{I_s,s} | \mathcal{F}_{s-1}] = \ell_{I_s,s}$, where $\mathcal{F}_{s-1}$ is the history up to time $s-1$, we have $\mathbb{E}[M_t] = 0$.

Using Azuma's inequality for bounded martingale differences:

High-Probability Bound for EXP3

With probability at least $1-\delta$, EXP3 with $\gamma = \sqrt{\frac{2\log K}{KT}}$ achieves:

$$R_T \leq 2\sqrt{2KT\log K} + \sqrt{\frac{KT\log(1/\delta)}{2}}$$

The proof bounds the martingale $M_t$ using the fact that importance-weighted losses are bounded by $1/\gamma$. The second term comes from the concentration inequality and grows as $\sqrt{T}$, matching the leading term's dependence.

For stochastic bandits, high-probability bounds are tighter because we can use concentration of sums of independent random variables. The adversarial setting's adaptive nature makes sharp concentration more difficult to achieve.

Contextual Bandits, Prediction with Expert Advice and EXP4

Contextual bandits extend the basic framework by incorporating side information. Before making each decision, we observe a context $x_t$ (features, covariates, side information). Arm rewards depend on both the context and the chosen arm: $r_t = r(x_t, I_t)$.

This setting interpolates between full-information learning (where we could test all arms on identical contexts) and standard multi-armed bandits (ignoring all context information). Applications include personalized recommendations, adaptive systems, and structured decision making.

The challenge is generalization: we must learn effective arm selection for new contexts based on limited data from previous contexts and arms. This requires leveraging structure in how rewards depend on contexts and arm characteristics.

Prediction with Expert Advice

A key stepping stone is prediction with expert advice. We have $N$ experts, each providing predictions $f_i(x_t)$ for context $x_t$. Our goal is to perform nearly as well as the best expert in hindsight.

The Weighted Majority algorithm maintains weights on experts and predicts according to weighted averages. For convex loss functions, this achieves regret $O(\sqrt{T \log N})$, optimal up to constants.

The connection to bandits is that each expert corresponds to a policy mapping contexts to arms. The challenge is that we can't evaluate all expert predictions - only the one corresponding to our chosen arm.

Linear Contextual Bandits

A tractable special case assumes linear rewards: the expected reward for arm $i$ in context $x_t$ is $\langle \theta_i, x_t \rangle$ for unknown parameters $\theta_i \in \mathbb{R}^d$.

This setting enables efficient learning because:

The problem has finite-dimensional parametrization
Confidence sets for $\theta_i$ can be computed using least-squares
Optimism over confidence sets yields effective algorithms

The key insight is that linear structure enables generalization across contexts - observing reward for context $x$ and arm $i$ provides information about rewards for other contexts through the shared parameter $\theta_i$.

EXP4 Algorithm

EXP4 (EXP3 with expert advice) combines the robustness of EXP3 with the ability to leverage expert predictions. It maintains weights over expert-arm pairs and uses importance weighting to handle partial feedback.

EXP4 Algorithm

Input: $N$ experts, learning rate $\gamma$

Initialize: Weights $w_{i,j}(1) = 1$ for expert $i$, arm $j$

For $t = 1, 2, \ldots, T$:

Observe context $x_t$ and expert predictions $f_i(x_t)$ for $i \in [N]$
Compute arm probabilities: $p_j(t) = \sum_i \frac{w_{i,j}(t)}{\sum_{k,\ell} w_{k,\ell}(t)} f_i(x_t)$
Add exploration: $p_j(t) \leftarrow (1-\gamma)p_j(t) + \frac{\gamma}{K}$
Select arm $I_t \sim p(t)$ and observe reward $r_t$
Update weights: $w_{i,j}(t+1) = w_{i,j}(t) \exp\left(\gamma \frac{r_t f_i(x_t) \mathbb{1}_{I_t = j}}{p_j(t)}\right)$

EXP4's regret bound depends on the diversity of expert predictions:

EXP4 Regret Bound

EXP4 achieves regret against the best expert:

$$\mathbb{E}[R_T] \leq O\left(\sqrt{KT\log(NK)/N}\right)$$

When $N$ is large and experts are diverse, this can be much better than standard EXP3. The algorithm effectively uses expert advice to guide exploration while maintaining robustness through importance weighting.

The weight update rule rewards experts whose predictions align with good outcomes. The importance weighting $\frac{f_i(x_t) \mathbb{1}_{I_t = j}}{p_j(t)}$ ensures unbiased estimation while accounting for the expert's influence on the action probability.

def exp4(contexts, experts, rewards, gamma=None, T=None):
    """
    EXP4 algorithm for contextual bandits with expert advice

    Args:
        contexts: T x d matrix of contexts
        experts: N x T x K tensor of expert predictions
        rewards: T x K matrix of rewards
        gamma: exploration parameter

    Returns:
        cumulative_regret: regret at each time step
    """
    T, d = contexts.shape
    N, _, K = experts.shape

    if gamma is None:
        gamma = np.sqrt(2 * np.log(N * K) / (K * T))

    # Initialize expert-arm weights
    weights = np.ones((N, K))
    cumulative_regret = np.zeros(T)

    # Compute best expert regret for comparison
    expert_rewards = np.sum([
        np.sum([rewards[t, np.argmax(experts[i, t])] for t in range(T)])
        for i in range(N)
    ])
    best_expert_reward = np.max(expert_rewards)

    for t in range(T):
        context = contexts[t]
        expert_preds = experts[:, t, :]  # N x K

        # Compute arm probabilities weighted by experts
        total_weight = np.sum(weights)
        arm_probs = np.zeros(K)

        for j in range(K):
            for i in range(N):
                arm_probs[j] += (weights[i, j] / total_weight) * expert_preds[i, j]

        # Add exploration
        arm_probs = (1 - gamma) * arm_probs + gamma / K

        # Sample arm
        arm = np.random.choice(K, p=arm_probs)
        reward = rewards[t, arm]

        # Compute regret
        regret = best_expert_reward / T - reward
        cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret

        # Update weights
        for i in range(N):
            importance_weight = expert_preds[i, arm] / arm_probs[arm]
            weights[i, arm] *= np.exp(gamma * reward * importance_weight)

    return cumulative_regret

Stochastic Linear Bandits and UCB

Linear bandits assume arm rewards follow $r_t = \langle \theta^*, x_{I_t} \rangle + \eta_t$, where $x_i \in \mathbb{R}^d$ represents the feature vector of arm $i$, $\theta^* \in \mathbb{R}^d$ is an unknown parameter vector, and $\eta_t$ represents noise.

This setting is much richer than finite arms - we can consider infinitely many arms as long as their feature vectors are known. The algorithm must learn the parameter vector $\theta^*$ while optimizing arm selection.

LinUCB extends the optimism principle to linear settings using confidence ellipsoids around least-squares estimates of the parameter vector $\theta^*$.

LinUCB Algorithm

Initialize: Design matrix $V_0 = \lambda I_d$, data vector $b_0 = 0 \in \mathbb{R}^d$

For round $t = 1, 2, \ldots, T$:

Estimate parameters: $\hat{\theta}_t = V_{t-1}^{-1} b_{t-1}$
Select arm: $I_t = \arg\max_i \langle \hat{\theta}_t, x_i \rangle + \beta_t \|x_i\|_{V_{t-1}^{-1}}$
Observe reward $r_t$ and update:

$V_t = V_{t-1} + x_{I_t} x_{I_t}^T$ (update design matrix)
$b_t = b_{t-1} + r_t x_{I_t}$ (update data vector)

The confidence width $\beta_t \|x_i\|_{V_{t-1}^{-1}}$ captures uncertainty in arm $i$'s feature direction. Arms with features in poorly-explored directions receive larger confidence bonuses, encouraging exploration of uncertain regions.

The regret analysis leverages the elliptic potential lemma, which bounds the sum of confidence widths:

LinUCB Regret Bound

With appropriate choice of $\beta_t$, LinUCB achieves:

$$\mathbb{E}[R_T] = O(d\sqrt{T \log T})$$

This $\sqrt{T}$ dependence reflects the difficulty of learning in high-dimensional feature space, but the linear dependence on dimension $d$ enables practical application to structured problems - much better than treating all arms independently.

The key insight is that the confidence ellipsoid shrinks in directions where we've observed many similar feature vectors, enabling generalization across the arm space. The algorithm automatically balances exploration of uncertain directions with exploitation of well-understood regions.

Ellipsoidal Confidence Sets for Least-Squares Estimators

The foundation of LinUCB is the construction of confidence sets for the least-squares estimator. This requires careful analysis of the regularized least-squares problem.

Given observations $(x_1, r_1), \ldots, (x_t, r_t)$ with $r_s = \langle \theta^*, x_s \rangle + \eta_s$, the regularized least-squares estimator is:

$$\hat{\theta}_t = \arg\min_{\theta} \left\{ \sum_{s=1}^t (r_s - \langle \theta, x_s \rangle)^2 + \lambda \|\theta\|^2 \right\}$$

This has closed form $\hat{\theta}_t = V_t^{-1} b_t$ where:

$V_t = \lambda I + \sum_{s=1}^t x_s x_s^T$ (design matrix)
$b_t = \sum_{s=1}^t r_s x_s$ (data vector)

The confidence ellipsoid is based on the concentration of $\hat{\theta}_t$ around $\theta^*$. For Gaussian noise, we can show:

$$\|\hat{\theta}_t - \theta^*\|_{V_t} \leq \beta_t$$

with high probability, where $\|\theta\|_{V_t} = \sqrt{\theta^T V_t \theta}$ and $\beta_t = O(\sqrt{d \log t})$.

Confidence Ellipsoid

With probability at least $1-\delta$:

$$\theta^* \in \{\theta : \|\theta - \hat{\theta}_t\|_{V_t} \leq \beta_t(\delta)\}$$

where $\beta_t(\delta) = \sqrt{d \log((1+t/\lambda)/\delta) + \lambda^{1/2} S}$ and $S$ bounds $\|\theta^*\|$.

This ellipsoid has several important properties:

Adaptive shape: Narrow in directions with many observations, wide in unexplored directions
Shrinking volume: $\text{Vol}(\text{ellipsoid}) \propto \det(V_t)^{-1/2}$ decreases over time
Optimal exploration: Maximizing $\|x\|_{V_t^{-1}}$ minimizes ellipsoid volume most quickly

The UCB selection rule $\max_i \langle \hat{\theta}_t, x_i \rangle + \beta_t \|x_i\|_{V_t^{-1}}$ implements optimism over this confidence set. It selects the arm with highest upper confidence bound on expected reward.

import numpy as np
from numpy.linalg import inv, norm

def linucb(features, rewards, lam=1.0, delta=0.1):
    """
    Linear UCB algorithm

    Args:
        features: T x K x d tensor of arm features
        rewards: T x K matrix of rewards (observed online)
        lam: regularization parameter
        delta: confidence parameter

    Returns:
        cumulative_regret: regret at each time step
    """
    T, K, d = features.shape

    # Initialize
    V = lam * np.eye(d)
    b = np.zeros(d)
    cumulative_regret = np.zeros(T)

    # Compute best arm regret for comparison
    true_theta = np.random.randn(d)  # Unknown in practice
    true_rewards = np.array([[np.dot(true_theta, features[t, k]) for k in range(K)] for t in range(T)])
    best_rewards = np.max(true_rewards, axis=1)

    for t in range(T):
        # Compute least-squares estimate
        theta_hat = inv(V) @ b

        # Compute confidence width
        beta = np.sqrt(d * np.log((1 + t/lam) / delta) + np.sqrt(lam) * norm(theta_hat))

        # Compute UCB for each arm
        ucb_values = np.zeros(K)
        for k in range(K):
            x = features[t, k]
            mean_reward = np.dot(theta_hat, x)
            confidence_width = beta * np.sqrt(x.T @ inv(V) @ x)
            ucb_values[k] = mean_reward + confidence_width

        # Select arm with highest UCB
        arm = np.argmax(ucb_values)
        reward = rewards[t, arm]

        # Update design matrix and data vector
        x_chosen = features[t, arm]
        V += np.outer(x_chosen, x_chosen)
        b += reward * x_chosen

        # Compute regret
        regret = best_rewards[t] - reward
        cumulative_regret[t] = cumulative_regret[t-1] + regret if t > 0 else regret

    return cumulative_regret

# Ellipsoidal confidence set visualization (2D case)
def plot_confidence_ellipsoid(V, theta_hat, beta):
    """Plot confidence ellipsoid for 2D case"""
    theta = np.linspace(-2, 2, 100)
    ellipsoid_points = []

    for angle in np.linspace(0, 2*np.pi, 100):
        direction = np.array([np.cos(angle), np.sin(angle)])
        # Find radius in this direction
        radius = beta / np.sqrt(direction.T @ V @ direction)
        point = theta_hat + radius * direction
        ellipsoid_points.append(point)

    return np.array(ellipsoid_points)

Integration and Outlook

We've explored how bandit algorithms handle increasingly complex environments: adversarial settings, contextual information, and structured relationships. Key insights include:

Robustness vs. Efficiency: Adversarial algorithms like EXP3 sacrifice statistical efficiency for worst-case protection against adaptive environments
Context Utilization: Contextual bandits enable leveraging side information, requiring balanced exploration across context-arm combinations
Structural Understanding: Linear models achieve scalable learning by exploiting feature structure
Confidence-Based Selection: Ellipsoidal confidence regions provide principled optimism in high-dimensional feature spaces

The progression from simple arm comparisons to linear feature models illustrates a fundamental theme: how structural understanding enables efficient learning. Without structure, we face the curse of dimensionality; with appropriate assumptions, we can achieve learning rates that scale with intrinsic dimensions rather than all possible combinations.

Looking ahead to Part 3, we'll explore the frontiers of bandit theory: sparse linear models that exploit sparsity structure, refined statistical bounds, and the elegant duality between Bayesian and frequentist approaches that provides deep insights into optimal learning.

The mathematical tools developed here - importance weighting for unbiased estimation, ellipsoidal confidence sets for optimism, potential function analysis for regret bounds - form the foundation for modern bandit algorithms, extending from simple comparisons to complex structured problems.

References

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285-294.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48-77.
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. WWW, 661-670.
Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. COLT, 355-366.
Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. NIPS, 2312-2320.
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., & Schapire, R. (2014). Taming the monster: A fast and simple algorithm for contextual bandits. ICML, 1638-1646.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press.
Villar, S. S., Bowden, J., & Wason, J. (2015). Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science, 30(2), 199-215.
Aziz, M., Kaufmann, E., & Riviere, M. K. (2021). On multi-armed bandit designs for dose-finding clinical trials. Journal of Machine Learning Research, 22(1), 685-723.

Multi-Armed Bandit Series

Adversarial Bandits

Adversarial Lower Bound

The EXP3 Algorithm

EXP3 Algorithm

EXP3 Regret Bound

High Probability Lower Bounds

High-Probability Bound for EXP3

Contextual Bandits, Prediction with Expert Advice and EXP4

Prediction with Expert Advice

Linear Contextual Bandits

EXP4 Algorithm

EXP4 Algorithm

EXP4 Regret Bound

Stochastic Linear Bandits and UCB

LinUCB Algorithm

LinUCB Regret Bound

Ellipsoidal Confidence Sets for Least-Squares Estimators

Confidence Ellipsoid

Integration and Outlook

References