Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Entropy — Exercises
8 exercises covering Shannon entropy, MaxEnt, Huffman coding, entropy rate, differential entropy, Rényi entropy, and MaxEnt RL.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1–3 | Core entropy mechanics |
| ★★ | 4–6 | Deeper theory: entropy rate, Huffman, differential |
| ★★★ | 7–8 | AI/ML applications |
Topic Map
| Topic | Exercises |
|---|---|
| Shannon entropy, binary entropy function | 1 |
| Maximum entropy / Lagrange multipliers | 2 |
| Joint, conditional entropy, chain rule | 3 |
| Entropy rate of Markov chains | 4 |
| Huffman coding, source coding theorem | 5 |
| Differential entropy, Gaussian MaxEnt | 6 |
| Rényi entropy family | 7 |
| MaxEnt RL entropy bonus | 8 |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
from scipy.special import entr
import heapq
try:
import matplotlib.pyplot as plt
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-6):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(f' expected: {expected}')
print(f' got: {got}')
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
def softmax(logits, T=1.0):
logits = np.asarray(logits, dtype=float) / T
logits = logits - logits.max()
e = np.exp(logits)
return e / e.sum()
def H(p):
"""Shannon entropy in nats."""
return np.sum(entr(np.asarray(p, dtype=float)))
def H_bits(p):
"""Shannon entropy in bits."""
return H(p) / np.log(2)
print('Setup complete. Helper functions ready.')
Exercise 1 ★ — Binary Entropy Function and Token Distributions
Part (a). Compute in bits for:
- (i) Fair coin:
- (ii) Biased coin:
- (iii) Fair 6-sided die: for
- (iv) Distribution
Part (b). Implement the binary entropy function and verify , bit.
Part (c). A language model outputs a distribution over tokens. Compute the maximum possible entropy in nats. Then compute entropy when the top token has probability and the remaining tokens share the remaining probability uniformly.
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
# Exercise 1: Solution
def compute_entropy_bits_sol(p):
return H_bits(p)
def binary_entropy_sol(p):
if p <= 0 or p >= 1:
return 0.0
return -p*np.log2(p) - (1-p)*np.log2(1-p)
def lm_entropy_sol(n_vocab, top_prob):
rest_prob = (1 - top_prob) / (n_vocab - 1)
p = np.concatenate([[top_prob], np.full(n_vocab-1, rest_prob)])
return H(p) # in nats
header('Exercise 1: Binary Entropy and Token Distributions')
distributions = {
'fair_coin': (np.array([0.5, 0.5]), 1.0),
'biased_coin': (np.array([0.1, 0.9]), None),
'fair_die': (np.ones(6)/6, np.log2(6)),
'categorical': (np.array([0.5,0.25,0.125,0.125]), 1.75),
}
for name, (p, expected) in distributions.items():
h = compute_entropy_bits_sol(p)
print(f' {name}: H = {h:.4f} bits' + (f' (expected {expected:.4f})' if expected else ''))
print()
check_close('h(0) = 0', binary_entropy_sol(1e-10), 0.0, tol=0.01)
check_close('h(1) = 0', binary_entropy_sol(1-1e-10), 0.0, tol=0.01)
check_close('h(0.5) = 1 bit', binary_entropy_sol(0.5), 1.0)
print()
H_max = np.log(32000)
H_top95 = lm_entropy_sol(32000, 0.95)
print(f' Max entropy (uniform 32k): {H_max:.4f} nats')
print(f' H (top token p=0.95): {H_top95:.4f} nats')
check_true('LM entropy with top=0.95 << log(32000)', H_top95 < H_max/2)
print('\nTakeaway: entropy decreases rapidly when one token dominates. '
'A model placing 95% probability on one token has very low entropy — '
'it is essentially deterministic.')
Exercise 2 ★ — Maximum Entropy via Lagrange Multipliers
Part (a). Prove numerically that the uniform distribution maximizes subject to , . Run constrained optimization from 10 different random starting points and verify the optimizer always finds .
Part (b). Find the maximum entropy distribution over subject to . Implement the Lagrange multiplier solution: and find such that .
Part (c). Verify the MaxEnt inequality: compare your solution's entropy to 5 other distributions satisfying the same mean constraint and confirm yours has the highest entropy.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
# Exercise 2: Solution
from scipy.optimize import minimize, brentq
def maxent_uniform_verify_sol(n, n_starts=10):
results = []
for seed in range(n_starts):
np.random.seed(seed)
x0 = np.random.dirichlet(np.ones(n))
res = minimize(lambda p: -H(p), x0,
method='SLSQP',
bounds=[(1e-10,1)]*n,
constraints={'type':'eq','fun': lambda p: p.sum()-1})
results.append(res.x)
return results
def find_lambda_sol(mu, support):
def mean_diff(lam):
logits = -lam * support
p = np.exp(logits - logits.max())
p /= p.sum()
return np.dot(p, support) - mu
return brentq(mean_diff, -10, 10)
header('Exercise 2: Maximum Entropy via Lagrange Multipliers')
# Part (a)
n = 5
optimal_dists = maxent_uniform_verify_sol(n)
uniform_target = np.ones(n)/n
all_uniform = all(np.allclose(d, uniform_target, atol=1e-4) for d in optimal_dists)
check_true('All 10 starts converge to uniform distribution', all_uniform)
print(f' Optimal entropy: {H_bits(optimal_dists[0]):.4f} bits = log2({n}) = {np.log2(n):.4f}')
# Part (b)
mu_target = 3.0
support = np.arange(10)
lam = find_lambda_sol(mu_target, support)
logits = -lam * support
p_maxent = np.exp(logits - logits.max())
p_maxent /= p_maxent.sum()
mean_check = np.dot(p_maxent, support)
check_close('E[X] = 3 for MaxEnt distribution', mean_check, mu_target)
print(f' lambda = {lam:.4f}')
print(f' p_maxent = {p_maxent.round(4)}')
print(f' H(p_maxent) = {H(p_maxent):.4f} nats')
# Part (c): compare to alternatives with same mean
alt_dists = [
('Geometric-like', lambda: (lambda p: p/p.sum())(np.array([0.3, 0.25, 0.2, 0.12, 0.06, 0.04, 0.02, 0.005, 0.005, 0.0]))),
('Concentrated', lambda: (lambda p: p/p.sum())(np.array([0.1, 0.1, 0.1, 0.5, 0.1, 0.05, 0.03, 0.01, 0.01, 0.0]))),
]
print('\nComparison with same E[X]=3 constraint:')
print(f' MaxEnt: H = {H(p_maxent):.4f} nats')
all_less = True
for name, dist_fn in alt_dists:
p_alt = dist_fn()
mean_alt = np.dot(p_alt, support)
h_alt = H(p_alt)
is_less = h_alt <= H(p_maxent) + 0.01
if not is_less: all_less = False
print(f' {name}: mean={mean_alt:.2f}, H={h_alt:.4f} (MaxEnt wins: {is_less})')
check_true('MaxEnt distribution has highest entropy', all_less)
print('\nTakeaway: among all distributions with fixed mean E[X]=3, the exponential-family '
'solution p ∝ exp(-lambda*k) has maximum entropy — the MaxEnt principle.')
Exercise 3 ★ — Joint Entropy, Conditional Entropy, Chain Rule
Part (a). Given joint distribution:
Compute , , , , , and .
Part (b). Verify the chain rule: .
Part (c). Modify the joint distribution so that (independent). Verify that and .
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
# Exercise 3: Solution
def entropy_quantities_sol(joint):
joint = np.asarray(joint, dtype=float)
pX = joint.sum(axis=1)
pY = joint.sum(axis=0)
HX = H(pX)
HY = H(pY)
HXY = H(joint.ravel())
HXGY = HXY - HY # H(X|Y)
HYGX = HXY - HX # H(Y|X)
IXY = HX + HY - HXY
return HX, HY, HXY, HXGY, HYGX, IXY
header('Exercise 3: Joint and Conditional Entropy')
# Part (a)
joint = np.array([[0.3, 0.1], [0.2, 0.4]])
HX, HY, HXY, HXGY, HYGX, IXY = entropy_quantities_sol(joint)
print(f' H(X) = {HX:.4f} nats')
print(f' H(Y) = {HY:.4f} nats')
print(f' H(X,Y) = {HXY:.4f} nats')
print(f' H(X|Y) = {HXGY:.4f} nats')
print(f' H(Y|X) = {HYGX:.4f} nats')
print(f' I(X;Y) = {IXY:.4f} nats')
# Part (b): Chain rule
check_close('H(X,Y) = H(X) + H(Y|X)', HXY, HX + HYGX)
check_close('H(X,Y) = H(Y) + H(X|Y)', HXY, HY + HXGY)
check_true('Subadditivity: H(X,Y) <= H(X)+H(Y)', HXY <= HX + HY + 1e-10)
check_true('I(X;Y) >= 0', IXY >= -1e-10)
# Part (c): Independent version
pX_marg = joint.sum(axis=1)
pY_marg = joint.sum(axis=0)
joint_indep = np.outer(pX_marg, pY_marg)
HX_i, HY_i, HXY_i, HXGY_i, HYGX_i, IXY_i = entropy_quantities_sol(joint_indep)
print()
check_close('I(X;Y) = 0 for independent', IXY_i, 0.0, tol=1e-8)
check_close('H(X,Y) = H(X)+H(Y) for independent', HXY_i, HX_i + HY_i)
print('\nTakeaway: H(X|Y) = H(X) - I(X;Y). Conditioning reduces entropy by the '
'mutual information. When X and Y are independent, conditioning helps nothing.')
Exercise 4 ★★ — Entropy Rate of a Markov Chain
Consider a 3-state Markov chain with transition matrix:
Part (a). Find the stationary distribution satisfying and .
Part (b). Compute the entropy rate in nats.
Part (c). Compare to the i.i.d. rate . Explain why the Markov chain rate is lower.
Part (d). Simulate 10,000 steps and verify that the rolling perplexity converges to the true perplexity.
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
# Exercise 4: Solution
from scipy.special import entr
P = np.array([[0.7, 0.2, 0.1],
[0.3, 0.4, 0.3],
[0.1, 0.3, 0.6]])
def stationary_distribution_sol(P):
eigvals, eigvecs = np.linalg.eig(P.T)
idx = np.argmin(np.abs(eigvals - 1.0))
mu = np.real(eigvecs[:, idx])
return mu / mu.sum()
def markov_entropy_rate_sol(P, mu):
H_per_state = np.array([np.sum(entr(P[i])) for i in range(len(P))])
return np.dot(mu, H_per_state)
def header(t): print('\n'+'='*len(t)+'\n'+t+'\n'+'='*len(t))
def check_close(n,g,e,tol=1e-6):
ok=np.allclose(g,e,atol=tol,rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {n}")
return ok
def check_true(n,c): print(f"{'PASS' if c else 'FAIL'} — {n}"); return c
header('Exercise 4: Entropy Rate of a 3-State Markov Chain')
# Part (a)
mu = stationary_distribution_sol(P)
print(f'Stationary distribution: {mu.round(4)}')
check_close('mu @ P = mu', mu @ P, mu)
check_close('sum(mu) = 1', mu.sum(), 1.0)
# Part (b)
H_rate = markov_entropy_rate_sol(P, mu)
print(f'\nEntropy rate: H = {H_rate:.4f} nats')
print(f'Perplexity: PPL = {np.exp(H_rate):.4f}')
# Part (c)
H_iid = np.sum(entr(mu))
print(f'i.i.d. rate: H = {H_iid:.4f} nats')
check_true('Markov entropy rate < i.i.d. rate', H_rate < H_iid)
print(f'Reduction: {H_iid - H_rate:.4f} nats (temporal structure reduces uncertainty)')
# Part (d): Simulation
np.random.seed(42)
T = 10000
state = np.random.choice(3, p=mu)
nll_log = []
for _ in range(T):
next_state = np.random.choice(3, p=P[state])
nll_log.append(-np.log(P[state, next_state]))
state = next_state
ppl_estimate = np.exp(np.mean(nll_log))
ppl_true = np.exp(H_rate)
error_pct = abs(ppl_estimate - ppl_true) / ppl_true * 100
check_close('Simulated PPL ≈ true PPL', ppl_estimate, ppl_true, tol=0.1)
print(f'True PPL: {ppl_true:.4f}, Simulated PPL ({T} steps): {ppl_estimate:.4f}, Error: {error_pct:.2f}%')
print('\nTakeaway: Markov chain entropy rate is always <= i.i.d. rate because temporal '
'structure makes the next state predictable given the current state. '
'LLMs exploit this long-range structure to achieve low perplexity.')
Exercise 5 ★★ — Huffman Coding and Shannon Source Coding
Part (a). Given source distribution :
- Compute in bits
- Construct the optimal Huffman code
- Verify
Part (b). Implement huffman_code(symbols, probs) returning a dict of codewords.
Part (c). For the dyadic distribution for , with : show that Huffman coding achieves exactly.
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
# Exercise 5: Solution
import heapq
def huffman_code_sol(symbols, probs):
heap = [[p, [s, '']] for s, p in zip(symbols, probs)]
heapq.heapify(heap)
while len(heap) > 1:
lo = heapq.heappop(heap)
hi = heapq.heappop(heap)
for pair in lo[1:]: pair[1] = '0' + pair[1]
for pair in hi[1:]: pair[1] = '1' + pair[1]
heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return {pair[0]: pair[1] for pair in sorted(heap[0][1:])}
def header(t): print('\n'+'='*len(t)+'\n'+t+'\n'+'='*len(t))
def check_close(n,g,e,tol=1e-6): ok=np.allclose(g,e,atol=tol,rtol=tol); print(f"{'PASS' if ok else 'FAIL'} — {n}"); return ok
def check_true(n,c): print(f"{'PASS' if c else 'FAIL'} — {n}"); return c
header('Exercise 5: Huffman Coding and Source Coding Theorem')
# Part (a) & (b)
symbols = ['A', 'B', 'C', 'D']
probs = [0.4, 0.3, 0.2, 0.1]
code = huffman_code_sol(symbols, probs)
H_X = sum(-p * np.log2(p) for p in probs)
Lbar = sum(probs[i] * len(code[s]) for i, s in enumerate(symbols))
print('Huffman code:')
for s, p in zip(symbols, probs):
cw = code[s]
print(f' {s}: p={p:.2f}, code={cw}, len={len(cw)}, ideal={-np.log2(p):.2f}')
print(f'H(X) = {H_X:.4f} bits')
print(f'L_bar = {Lbar:.4f} bits')
check_true('H(X) <= L_bar', H_X <= Lbar + 1e-10)
check_true('L_bar < H(X) + 1', Lbar < H_X + 1 + 1e-10)
# Part (c): Dyadic distribution achieves H(X) exactly
n = 5
probs_dyadic = [2**(-k) for k in range(1, n)]
probs_dyadic.append(2**(-(n-1))) # last term same as second-to-last
symbols_dyadic = [f'x{k}' for k in range(1, n+1)]
code_dyadic = huffman_code_sol(symbols_dyadic, probs_dyadic)
H_dyadic = sum(-p * np.log2(p) for p in probs_dyadic)
Lbar_dyadic = sum(probs_dyadic[i] * len(code_dyadic[s]) for i, s in enumerate(symbols_dyadic))
print(f'\nDyadic distribution (n={n}):')
print(f' H(X) = {H_dyadic:.4f} bits')
print(f' L_bar = {Lbar_dyadic:.4f} bits')
check_close('Dyadic: L_bar = H(X) exactly', Lbar_dyadic, H_dyadic, tol=0.01)
print('\nTakeaway: Huffman coding always satisfies H <= L_bar < H+1. '
'For dyadic distributions (p = 2^-k), it achieves H exactly. '
'Arithmetic coding removes the rounding loss for non-dyadic distributions.')
Exercise 6 ★★ — Differential Entropy and Gaussian MaxEnt
Part (a). Compute the differential entropy for:
- (i) , (ii) , (iii)
- (iv) , (v)
- (vi)
Verify that has negative differential entropy.
Part (b). Prove numerically that for and .
Part (c). Numerically verify the Gaussian MaxEnt theorem: show that for , the Gaussian has strictly higher differential entropy than the Laplace and uniform distributions with the same variance.
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
# Exercise 6: Solution
from scipy.stats import norm, laplace
from scipy.special import entr
def diff_entropy_gaussian_sol(mu, sigma):
return 0.5 * np.log(2 * np.pi * np.e * sigma**2)
def diff_entropy_uniform_sol(a, b):
return np.log(b - a)
def diff_entropy_exp_sol(lam):
return 1 - np.log(lam)
def header(t): print('\n'+'='*len(t)+'\n'+t+'\n'+'='*len(t))
def check_close(n,g,e,tol=1e-4): ok=np.allclose(g,e,atol=tol,rtol=tol); print(f"{'PASS' if ok else 'FAIL'} — {n}"); return ok
def check_true(n,c): print(f"{'PASS' if c else 'FAIL'} — {n}"); return c
header('Exercise 6: Differential Entropy and Gaussian MaxEnt')
# Part (a)
cases = [
('Uniform(0,1)', diff_entropy_uniform_sol(0, 1), 0.0),
('Uniform(0,2)', diff_entropy_uniform_sol(0, 2), np.log(2)),
('Uniform(0,0.5)', diff_entropy_uniform_sol(0, 0.5), np.log(0.5)),
('Normal(0,1)', diff_entropy_gaussian_sol(0, 1), 0.5*np.log(2*np.pi*np.e)),
('Normal(0,4)', diff_entropy_gaussian_sol(0, 2), 0.5*np.log(2*np.pi*np.e*4)),
('Exp(1)', diff_entropy_exp_sol(1), 1.0),
]
for name, h_val, h_expected in cases:
print(f' {name}: h = {h_val:.4f}' + (' [NEGATIVE]' if h_val < 0 else ''))
check_close(f'h({name})', h_val, h_expected)
# Part (b): h(aX) = h(X) + log|a|
sigma = 1.0; a = 2.0
h_X = diff_entropy_gaussian_sol(0, sigma)
h_aX = diff_entropy_gaussian_sol(0, a * sigma)
check_close('h(aX) = h(X) + log|a|', h_aX, h_X + np.log(abs(a)))
# Part (c): Gaussian MaxEnt under variance constraint
sigma_sq = 2.0; sigma_val = np.sqrt(sigma_sq)
h_gaussian = diff_entropy_gaussian_sol(0, sigma_val)
# Laplace with same variance: Var = 2b^2, so b = sqrt(sigma^2/2)
b_laplace = np.sqrt(sigma_sq / 2)
h_laplace = 1 + np.log(2 * b_laplace)
# Uniform with same variance: Var = (b-a)^2/12, so b-a = sqrt(12*sigma^2)
width = np.sqrt(12 * sigma_sq)
h_uniform = np.log(width)
print(f'\nVariance = {sigma_sq}:')
print(f' h(Gaussian) = {h_gaussian:.4f} nats')
print(f' h(Laplace) = {h_laplace:.4f} nats')
print(f' h(Uniform) = {h_uniform:.4f} nats')
check_true('Gaussian > Laplace (same var)', h_gaussian > h_laplace)
check_true('Gaussian > Uniform (same var)', h_gaussian > h_uniform)
print('\nTakeaway: The Gaussian maximizes differential entropy under variance constraint. '
'This makes it the "most uncertain" distribution given only a variance budget, '
'justifying Gaussian priors in Bayesian ML and Gaussian noise in diffusion models.')
Exercise 7 ★★★ — Rényi Entropy Family
Part (a). Implement renyi_entropy(p, alpha) that computes in nats for any (including the limiting cases , , ).
Part (b). For the distribution , compute for . Verify the ordering .
Part (c). Compute the min-entropy directly and verify it matches your implementation as .
Part (d). Compute the guessing advantage for this distribution: .
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
# Exercise 7: Solution
from scipy.special import entr
def renyi_entropy_sol(p, alpha):
p = np.asarray(p, dtype=float)
p = p[p > 0] # remove zeros (don't contribute)
if np.isclose(alpha, 1.0):
return np.sum(entr(p))
if np.isinf(alpha):
return -np.log(np.max(p))
if np.isclose(alpha, 0.0):
return np.log(len(p))
return (1.0 / (1.0 - alpha)) * np.log(np.sum(p ** alpha))
def header(t): print('\n'+'='*len(t)+'\n'+t+'\n'+'='*len(t))
def check_close(n,g,e,tol=1e-6): ok=np.allclose(g,e,atol=tol,rtol=tol); print(f"{'PASS' if ok else 'FAIL'} — {n}"); return ok
def check_true(n,c): print(f"{'PASS' if c else 'FAIL'} — {n}"); return c
header('Exercise 7: Rényi Entropy Family')
p = np.array([0.5, 0.3, 0.15, 0.05])
alphas = [0, 0.5, 1.0, 2.0, 5.0, np.inf]
names = ['H_0 (Hartley)', 'H_0.5', 'H_1 (Shannon)', 'H_2 (Collision)', 'H_5', 'H_inf (Min)']
# Part (b)
H_vals = [renyi_entropy_sol(p, a) for a in alphas]
print('Rényi entropies for p = (0.5, 0.3, 0.15, 0.05):')
for a, h, name in zip(alphas, H_vals, names):
a_str = 'inf' if np.isinf(a) else str(a)
print(f' alpha={a_str:>4}: H = {h:.4f} nats [{name}]')
# Verify ordering H_0 >= H_0.5 >= H_1 >= H_2 >= H_5 >= H_inf
monotone = all(H_vals[i] >= H_vals[i+1] - 1e-10 for i in range(len(H_vals)-1))
check_true('H_0 >= H_0.5 >= H_1 >= H_2 >= H_5 >= H_inf', monotone)
# Part (c): Min-entropy direct
H_inf_direct = -np.log(p.max())
check_close('H_inf direct = renyi(p, inf)', H_inf_direct, renyi_entropy_sol(p, np.inf))
# Part (d): Guessing advantage
guessing_advantage = p.max()
guessing_from_minH = np.exp(-renyi_entropy_sol(p, np.inf))
print(f'\nGuessing advantage: max_x p(x) = {guessing_advantage:.4f}')
print(f'From min-entropy: exp(-H_inf) = {guessing_from_minH:.4f}')
check_close('Guessing advantage = exp(-H_inf)', guessing_advantage, guessing_from_minH)
# Bonus: Uniform vs concentrated distribution comparison
p_uniform = np.ones(4) / 4
print(f'\nUniform (4): all H_alpha = log(4) = {np.log(4):.4f}')
for a in [0, 1.0, 2.0, np.inf]:
h_u = renyi_entropy_sol(p_uniform, a)
h_p = renyi_entropy_sol(p, a)
a_str = 'inf' if np.isinf(a) else str(a)
print(f' alpha={a_str}: uniform={h_u:.4f}, concentrated={h_p:.4f}')
print('\nTakeaway: Rényi entropy interpolates between Hartley (pure support size), '
'Shannon (expected surprise), and min-entropy (worst-case surprise). '
'Min-entropy is the cryptographic metric: high H_inf = hard to guess.')
Exercise 8 ★★★ — MaxEnt RL: Entropy Bonus in Policy Optimization
This exercise implements a simplified version of the soft policy gradient underlying SAC.
Part (a). Implement train_bandit(r, alpha, lr, steps) for a 5-action bandit with reward vector . The objective is .
Part (b). Train for and steps. For each , verify the final policy satisfies (the MaxEnt optimal policy).
Part (c). Show that as , the optimal policy converges to greedy (argmax); as , it converges to uniform.
Part (d). Plot policy entropy over training for all values.
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
# Exercise 8: Solution
from scipy.special import entr
import matplotlib.pyplot as plt
r = np.array([0.8, 0.6, 0.4, 0.2, 0.0])
n_actions = len(r)
def softmax_vec(logits):
e = np.exp(logits - logits.max())
return e / e.sum()
def train_bandit_sol(r, alpha=0.0, lr=0.1, steps=500, seed=0):
np.random.seed(seed)
theta = np.zeros(len(r))
entropies = []
for _ in range(steps):
pi = softmax_vec(theta)
H_pi = np.sum(entr(pi))
# Soft Q-value: r(a) + alpha * (-log pi(a))
if alpha > 0:
q = r + alpha * (-np.log(np.maximum(pi, 1e-15)))
else:
q = r
baseline = np.dot(pi, q)
grad = pi * (q - baseline) # policy gradient
theta += lr * grad
entropies.append(H_pi)
return np.array(entropies), softmax_vec(theta)
def header(t): print('\n'+'='*len(t)+'\n'+t+'\n'+'='*len(t))
def check_close(n,g,e,tol=1e-2): ok=np.allclose(g,e,atol=tol,rtol=tol); print(f"{'PASS' if ok else 'FAIL'} — {n}"); return ok
def check_true(n,c): print(f"{'PASS' if c else 'FAIL'} — {n}"); return c
header('Exercise 8: MaxEnt RL — Entropy Bonus in Policy Optimization')
alphas = [0.0, 0.1, 0.5, 1.0]
colors = ['#CC3311', '#EE7733', '#009988', '#0077BB']
results = {}
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
for alpha, color in zip(alphas, colors):
entropies, final_pi = train_bandit_sol(r, alpha=alpha)
results[alpha] = final_pi
axes[0].plot(entropies, color=color, alpha=0.8, label=f'alpha={alpha}')
# Part (b): Verify optimal policy = softmax(r/alpha)
print('Final policies and MaxEnt optimal verification:')
print(f"{'alpha':>6} {'Final policy':>45} {'matches softmax(r/a)':>22}")
print('-' * 80)
for alpha in alphas:
final = results[alpha]
if alpha > 0:
pi_opt = softmax_vec(r / alpha)
match = np.allclose(final, pi_opt, atol=0.05)
else:
# alpha=0: should be near-greedy
match = final.argmax() == r.argmax()
print(f"{alpha:>6.1f} {str(final.round(3)):>45} {str(match):>22}")
# Part (c): Limiting behavior
_, pi_greedy = train_bandit_sol(r, alpha=1e-4, steps=2000)
_, pi_uniform = train_bandit_sol(r, alpha=1e4, steps=2000)
check_true('alpha->0: policy ≈ greedy (argmax=0)', pi_greedy.argmax() == 0)
check_close('alpha->inf: policy ≈ uniform', pi_uniform, np.ones(n_actions)/n_actions, tol=0.05)
axes[0].set_title('Policy entropy during MaxEnt RL training')
axes[0].set_xlabel('Training step')
axes[0].set_ylabel('$H(\\pi)$ (nats)')
axes[0].legend()
# Final policy bar chart
x = np.arange(n_actions)
width = 0.2
for i, (alpha, color) in enumerate(zip(alphas, colors)):
axes[1].bar(x + i*width, results[alpha], width, color=color,
label=f'alpha={alpha}', alpha=0.85)
axes[1].set_title('Final policy for each alpha')
axes[1].set_xlabel('Action')
axes[1].set_ylabel('$\\pi(a)$')
axes[1].set_xticks(x + width*1.5)
axes[1].set_xticklabels([f'a{i}\nr={ri}' for i, ri in enumerate(r)])
axes[1].legend()
fig.tight_layout()
plt.show()
print('\nTakeaway: Entropy regularization (alpha > 0) prevents greedy collapse. '
'The optimal MaxEnt RL policy is pi^*(a) ∝ exp(r(a)/alpha) — '
'a Boltzmann/softmax distribution. Higher alpha = more exploration = '
'higher entropy = better coverage of suboptimal actions. '
'This is the foundation of SAC in robotics and RLHF in LLMs.')
What to Review After Finishing
- Ex 1 — compute entropy for all standard distributions; understand shape
- Ex 2 — derive MaxEnt distribution via Lagrange multipliers from scratch
- Ex 3 — verify chain rule numerically
- Ex 4 — implement stationary distribution and entropy rate for Markov chains
- Ex 5 — implement Huffman coding; verify source coding theorem bound
- Ex 6 — compute differential entropy; prove Gaussian MaxEnt numerically
- Ex 7 — implement Rényi family; understand min-entropy as guessing metric
- Ex 8 — implement MaxEnt policy gradient; verify convergence to softmax optimal
References
- Cover & Thomas (2006) — Elements of Information Theory, Chapters 1-5
- MacKay (2003) — Information Theory, Inference, and Learning Algorithms, Chapters 1-5
- Haarnoja et al. (2018) — Soft Actor-Critic. arXiv:1801.01290
- Jaynes (1957) — Information Theory and Statistical Mechanics. Physical Review 106(4)
Exercise 9: Huffman Length Versus Entropy
Compute entropy and expected Huffman code length for a small source. Verify the source-coding inequality in bits.
Code cell 30
# Your Solution
print("Compare entropy with an expected prefix-code length.")
Code cell 31
# Solution
header("Exercise 9: Huffman Length Versus Entropy")
p = np.array([0.4, 0.25, 0.2, 0.15])
lengths = np.array([1, 2, 3, 3])
L = float(np.sum(p * lengths))
Hb = H_bits(p)
print("entropy bits:", round(Hb, 6))
print("expected length:", round(L, 6))
check_true("source-coding bound", Hb <= L < Hb + 1)
print("Takeaway: entropy is the ideal expected code length; prefix codes pay less than one extra bit.")
Exercise 10: Entropy Bonus in a Policy
For a softmax policy, increase the temperature and verify that policy entropy increases while the distribution becomes less peaked.
Code cell 33
# Your Solution
print("Compare policy entropy at two temperatures.")
Code cell 34
# Solution
header("Exercise 10: Entropy Bonus")
logits = np.array([3.0, 1.0, -0.5])
p_cold = softmax(logits, T=0.7)
p_warm = softmax(logits, T=2.0)
H_cold, H_warm = H(p_cold), H(p_warm)
print("cold policy:", np.round(p_cold, 4), "H=", round(float(H_cold), 6))
print("warm policy:", np.round(p_warm, 4), "H=", round(float(H_warm), 6))
check_true("higher temperature increases entropy", H_warm > H_cold)
print("Takeaway: entropy regularization encourages exploration by flattening the action distribution.")