🎯 Interview Preparation Guide
Common mathematical interview questions for ML/AI positions with detailed solutions.
Table of Contents
Linear Algebra Questions
Q1: What is the difference between eigenvalue decomposition and SVD?
Answer:
| Aspect | Eigendecomposition | SVD |
|---|---|---|
| Applies to | Square matrices only | Any matrix (m×n) |
| Formula | $A = P\Lambda P^{-1}$ | $A = U\Sigma V^T$ |
| Components | Eigenvalues, eigenvectors | Singular values, left/right singular vectors |
| Requires | Diagonalizable matrix | Always exists |
Key Insight: For symmetric positive semi-definite matrices, singular values equal eigenvalues, and $U = V$.
ML Application:
- Eigendecomposition: PCA on covariance matrix
- SVD: Recommender systems, image compression, pseudoinverse
Q2: Explain the geometric interpretation of eigenvalues and eigenvectors.
Answer:
Eigenvectors are directions that remain unchanged (except for scaling) when a linear transformation is applied. Eigenvalues are the scaling factors.
Before transformation: After transformation (A):
↑ v ↑ λv
│ │ (stretched by λ)
│ │
─────┼─────→ ─────┼─────→
│ │
ML Application:
- In PCA, eigenvectors point in directions of maximum variance
- Large eigenvalue = high variance in that direction
- We keep eigenvectors with largest eigenvalues
Q3: What makes a matrix positive definite? Why does it matter?
Answer:
A symmetric matrix $A$ is positive definite if:
- All eigenvalues are positive, OR
- $\mathbf{x}^T A \mathbf{x} > 0$ for all non-zero $\mathbf{x}$
Why it matters:
- Covariance matrices are positive semi-definite
- Hessian being PD guarantees local minimum
- Optimization: Convex quadratic functions have PD Hessians
- Numerical stability: PD matrices are invertible
Q4: Explain the null space and column space. How do they relate to linear regression?
Answer:
- Column space (range): All possible outputs $A\mathbf{x}$
- Null space (kernel): All inputs $\mathbf{x}$ where $A\mathbf{x} = \mathbf{0}$
In Linear Regression:
Finding ŷ = Xw that's closest to y
y
↗
/
/
/ ŷ (projection onto column space of X)
●─────────────────→ Column space of X
The residual (y - ŷ) is perpendicular to column space
Insight: If null space is non-trivial, there are infinitely many solutions (need regularization).
Q5: What is the rank of a matrix and why is it important?
Answer:
Rank = number of linearly independent rows (or columns)
| Rank Property | Implication |
|---|---|
| rank(A) = min(m,n) | Full rank, unique solution possible |
| rank(A) < min(m,n) | Rank deficient, singular |
| rank(X^TX) < n | Multicollinearity in regression |
ML Applications:
- Low-rank matrix approximation (compression)
- Detecting redundant features
- Matrix completion (recommender systems)
Calculus Questions
Q6: Derive the gradient of the sigmoid function.
Answer:
Given: $\sigma(x) = \frac{1}{1 + e^{-x}}$
Using quotient rule:
$$\frac{d\sigma}{dx} = \frac{0 \cdot (1+e^{-x}) - 1 \cdot (-e^{-x})}{(1+e^{-x})^2} = \frac{e^{-x}}{(1+e^{-x})^2}$$Simplify:
$$= \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = \sigma(x) \cdot \frac{1+e^{-x}-1}{1+e^{-x}} = \sigma(x)(1-\sigma(x))$$Key Result: $\sigma'(x) = \sigma(x)(1-\sigma(x))$
Why it matters: This makes backpropagation through sigmoid efficient!
Q7: Explain the chain rule and its role in backpropagation.
Answer:
Chain Rule: For $h(x) = f(g(x))$:
$$\frac{dh}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$In Neural Networks:
x → [Layer 1] → z₁ → [Activation] → a₁ → [Layer 2] → z₂ → [Loss] → L
∂L/∂W₁ = ∂L/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁
Key Insight: Backprop is just repeated application of chain rule, computed efficiently by caching intermediate values.
Q8: What is the Jacobian matrix? When do you need it?
Answer:
The Jacobian is the matrix of all first-order partial derivatives for a vector-valued function:
$$ J = $$When needed:
- Backprop through layers with multiple outputs
- Change of variables in probability (normalizing flows)
- Sensitivity analysis: How outputs change with inputs
Q9: What is the Hessian and how is it used in optimization?
Answer:
The Hessian is the matrix of second-order partial derivatives:
$$ H = \nabla^2 f = $$Uses: | Condition | Meaning | |-----------|---------| | $H$ positive definite | Local minimum | | $H$ negative definite | Local maximum | | $H$ indefinite | Saddle point |
In Optimization:
- Newton's method: $x_{n+1} = x_n - H^{-1} \nabla f$
- Approximate Hessian: Adam, BFGS
Q10: Explain Taylor series and its application in optimization.
Answer:
Taylor expansion around point $a$:
$$f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots$$In Optimization:
First-order approximation (gradient descent):
$$f(\theta + \Delta\theta) \approx f(\theta) + \nabla f(\theta)^T \Delta\theta$$Second-order approximation (Newton's method):
$$f(\theta + \Delta\theta) \approx f(\theta) + \nabla f^T \Delta\theta + \frac{1}{2}\Delta\theta^T H \Delta\theta$$Probability & Statistics Questions
Q11: Derive Bayes' theorem and explain its components.
Answer:
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$| Component | Name | Meaning |
|---|---|---|
| $P(A\|B)$ | Posterior | Updated belief after seeing B |
| $P(B\|A)$ | Likelihood | Probability of B given A |
| $P(A)$ | Prior | Initial belief about A |
| $P(B)$ | Evidence | Normalizing constant |
ML Example - Naive Bayes:
$$P(\text{spam}|\text{words}) = \frac{P(\text{words}|\text{spam}) \cdot P(\text{spam})}{P(\text{words})}$$Q12: What is the difference between MLE and MAP?
Answer:
Maximum Likelihood Estimation (MLE):
$$\hat{\theta}_{MLE} = \arg\max_\theta P(D|\theta)$$- Only considers data likelihood
- Can overfit
Maximum A Posteriori (MAP):
$$\hat{\theta}_{MAP} = \arg\max_\theta P(D|\theta) \cdot P(\theta)$$- Includes prior belief $P(\theta)$
- Acts as regularization
Key Connection:
- Gaussian prior → L2 regularization
- Laplace prior → L1 regularization
Q13: Explain the bias-variance tradeoff mathematically.
Answer:
For any estimator, the expected error can be decomposed:
$$E[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$$| Component | Meaning | High when... |
|---|---|---|
| Bias² | Systematic error | Model too simple (underfitting) |
| Variance | Sensitivity to training data | Model too complex (overfitting) |
| σ² | Irreducible noise | Inherent in data |
Error
│
│╲ ╱
│ ╲ Total Error ╱
│ ╲ ┌────┐ ╱
│ ╲ │ │ ╱
│ ╲──│────│──╱
│ Bias² │
│ │ Variance
└────────────┴──────────→ Model Complexity
Simple Complex
Q14: What is the Central Limit Theorem and why does it matter?
Answer:
CLT: The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distribution.
$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$$Why it matters:
- Confidence intervals: We can use normal distribution
- Hypothesis testing: t-tests, z-tests
- SGD convergence: Gradient estimates are approximately normal
- Batch normalization: Large batch → normal activations
Q15: Explain covariance and correlation. What's the difference?
Answer:
Covariance: Measures how two variables change together
$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$$- Unbounded
- Units depend on X and Y
Correlation: Normalized covariance
$$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$- Always between -1 and 1
- Unitless
| Value | Meaning |
|---|---|
| ρ = 1 | Perfect positive linear relationship |
| ρ = 0 | No linear relationship |
| ρ = -1 | Perfect negative linear relationship |
Optimization Questions
Q16: Explain gradient descent and its variants.
Answer:
Basic Gradient Descent:
$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$| Variant | Update Rule | Pros/Cons |
|---|---|---|
| Batch GD | Full dataset gradient | Stable but slow |
| SGD | Single sample gradient | Fast but noisy |
| Mini-batch | Batch of samples | Best of both |
| Momentum | Accumulate velocity | Faster convergence |
| Adam | Adaptive learning rates | Works well in practice |
Q17: What makes a function convex? Why does convexity matter?
Answer:
Convexity Definition:
$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$$for all $x$, $y$ and $\lambda \in [0, 1]$.
Visual:
Convex: Non-convex:
╱╲ ╱╲
╱ ╲ ╱ ╲
╱ ╲ ╱ ╲
╱ ★ ╲ ╱ ★ ╲ ★
╱ ╲ ╱ ╲ ╱╲
local minima!
Why it matters:
- Convex functions have global minimum = local minimum
- Gradient descent guaranteed to converge
- Linear regression loss is convex
- Neural network loss is non-convex (hence harder to optimize)
Q18: Explain the Adam optimizer mathematically.
Answer:
Adam combines momentum and RMSprop:
# Initialize
m = 0 # First moment (momentum)
v = 0 # Second moment (RMSprop)
for t in range(1, num_iterations):
g = compute_gradient(theta)
# Update moments
m = beta1 * m + (1 - beta1) * g # Momentum
v = beta2 * v + (1 - beta2) * g**2 # RMSprop
# Bias correction
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
# Update parameters
theta = theta - lr * m_hat / (sqrt(v_hat) + epsilon)
Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$
Q19: What are the KKT conditions?
Answer:
For constrained optimization:
$$\min f(x) \text{ subject to } g_i(x) \leq 0, h_j(x) = 0$$KKT Conditions:
- Stationarity: $\nabla f + \sum_i \lambda_i \nabla g_i + \sum_j \nu_j \nabla h_j = 0$
- Primal feasibility: $g_i(x) \leq 0$, $h_j(x) = 0$
- Dual feasibility: $\lambda_i \geq 0$
- Complementary slackness: $\lambda_i g_i(x) = 0$
ML Application: SVM optimization satisfies KKT conditions.
Information Theory Questions
Q20: What is entropy and how is it used in ML?
Answer:
Entropy: Measure of uncertainty/randomness
$$H(X) = -\sum_x P(x) \log P(x)$$| Distribution | Entropy |
|---|---|
| Certain (one outcome) | 0 |
| Uniform | Maximum |
| More spread out | Higher |
ML Uses:
- Decision tree splitting (maximize information gain)
- Regularization in neural networks
- Measuring model confidence
Q21: Explain cross-entropy loss. Why is it used for classification?
Answer:
Cross-Entropy:
$$H(p, q) = -\sum_x p(x) \log q(x)$$For classification:
- $p$ = true distribution (one-hot)
- $q$ = predicted probabilities
Why use it:
- Gradients: Better gradients than MSE for classification
- Probability interpretation: Natural for probability outputs
- Convexity: Convex for logistic regression
Q22: What is KL divergence and how is it related to cross-entropy?
Answer:
KL Divergence: Measures how different distribution Q is from P
$$D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$Relation to Cross-Entropy:
$$D_{KL}(P || Q) = H(P, Q) - H(P)$$ $$\text{KL Divergence} = \text{Cross-Entropy} - \text{Entropy of P}$$Key Properties:
- $D_{KL} \geq 0$ (Gibbs' inequality)
- $D_{KL} = 0$ iff $P = Q$
- Not symmetric: $D_{KL}(P||Q) \neq D_{KL}(Q||P)$
ML Uses:
- VAE loss function
- Knowledge distillation
- Policy gradient (KL constraint)
Applied ML Math Questions
Q23: Derive the gradient for logistic regression.
Answer:
Model: $\hat{y} = \sigma(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}}}$
Loss (single sample): $L = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]$
Gradient derivation:
$$\frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}$$Where $z = \mathbf{w}^T\mathbf{x}$:
- $\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$
- $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$
- $\frac{\partial z}{\partial \mathbf{w}} = \mathbf{x}$
Result: $\nabla_\mathbf{w} L = (\hat{y} - y)\mathbf{x}$
Q24: Explain the math behind batch normalization.
Answer:
Forward Pass:
- Compute batch statistics:
- $\mu_B = \frac{1}{m}\sum_{i=1}^m x_i$
- $\sigma_B^2 = \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2$
-
Normalize:
- $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
-
Scale and shift:
- $y_i = \gamma \hat{x}_i + \beta$
Why it works:
- Reduces internal covariate shift
- Allows higher learning rates
- Acts as regularization
- Smoother loss landscape
Q25: Derive the attention mechanism math in Transformers.
Answer:
Scaled Dot-Product Attention:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$Step by step:
- Compute similarities: $QK^T$ gives attention scores
- Scale: Divide by $\sqrt{d_k}$ to prevent softmax saturation
- Softmax: Convert to probability distribution
- Weighted sum: Multiply by V
Multi-Head Attention:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
Deep Learning Math Questions
Q21: Explain the math behind the Transformer's attention mechanism.
Answer:
Scaled dot-product attention:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$Why $\sqrt{d_k}$? Without scaling, dot products grow with dimension $d_k$, pushing softmax into saturation (near 0 or 1 gradients). Scaling keeps variance ~1.
Multi-head attention allows attending to information from different representation subspaces:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$Complexity: $O(n^2 d)$ where $n$ is sequence length — the quadratic bottleneck motivating efficient attention.
Q22: Derive the backpropagation equations for a simple 2-layer network.
Answer:
Network: $z_1 = W_1x + b_1$, $a_1 = \sigma(z_1)$, $z_2 = W_2a_1 + b_2$, $\hat{y} = \text{softmax}(z_2)$
Loss: $L = -\sum_k y_k \log \hat{y}_k$ (cross-entropy)
Backward pass (using chain rule): 1. $\delta_2 = \hat{y} - y$ (softmax + cross-entropy simplification) 2. $\frac{\partial L}{\partial W_2} = \delta_2 a_1^T$ 3. $\frac{\partial L}{\partial b_2} = \delta_2$ 4. $\delta_1 = (W_2^T \delta_2) \odot \sigma'(z_1)$ (element-wise) 5. $\frac{\partial L}{\partial W_1} = \delta_1 x^T$ 6. $\frac{\partial L}{\partial b_1} = \delta_1$
Key insight: Each layer's gradient depends on the downstream gradient $(W^T \delta)$ modulated by the local derivative $\sigma'(z)$.
Q23: What causes vanishing/exploding gradients and how are they addressed?
Answer:
For deep network with $L$ layers, gradient magnitude scales as:
$$\prod_{l=1}^L \|W_l\| \cdot |\sigma'(z_l)|$$- Vanishing: $\|W_l\| \cdot |\sigma'| < 1$ repeatedly → gradient → 0
- Exploding: $\|W_l\| \cdot |\sigma'| > 1$ repeatedly → gradient → ∞
Solutions: | Technique | How it helps | |-----------|-------------| | ReLU | $\sigma'(x) = 1$ for $x > 0$ (no shrinkage) | | Residual connections | Gradient flows through skip: $\frac{\partial}{\partial x}(x + F(x)) = 1 + F'(x)$ | | Batch normalization | Keeps activations well-conditioned | | Xavier/He initialization | Sets $\text{Var}(W) = 1/n_{\text{in}}$ or $2/n_{\text{in}}$ | | Gradient clipping | Caps gradient norm to threshold |
Q24: Explain batch normalization mathematically. Why does it help?
Answer:
Forward pass (for mini-batch $\mathcal{B}$):
$$\mu_\mathcal{B} = \frac{1}{m}\sum_i x_i, \quad \sigma^2_\mathcal{B} = \frac{1}{m}\sum_i(x_i - \mu_\mathcal{B})^2$$ $$\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma^2_\mathcal{B} + \epsilon}}, \quad y_i = \gamma\hat{x}_i + \beta$$Why it helps: 1. Reduces internal covariate shift (distribution of inputs to each layer stabilizes) 2. Allows higher learning rates (smoother loss landscape) 3. Acts as regularizer (batch statistics add noise) 4. Makes the loss landscape smoother: $\|\nabla L\|$ varies less
Generative Models Math Questions
Q25: Derive the ELBO for VAEs.
Answer:
Start with log-likelihood:
$$\log p(x) = \log \int p(x|z)p(z)dz$$Introduce variational distribution $q(z|x)$:
$$\log p(x) = \underbrace{E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))}_{\text{ELBO}} + D_{KL}(q(z|x) \| p(z|x))$$Since KL ≥ 0: $\log p(x) \geq \text{ELBO}$
ELBO = Reconstruction - KL: - Reconstruction: How well can we decode $z$ back to $x$? - KL: How close is the encoder to the prior?
Q26: What is the GAN objective and what does the optimal discriminator look like?
Answer:
$$\min_G \max_D \; E_{x \sim p_{\text{data}}}[\log D(x)] + E_{z \sim p_z}[\log(1 - D(G(z)))]$$Optimal discriminator (for fixed $G$):
$$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}$$With $D^*$, the generator minimizes:
$$2 \cdot D_{JS}(p_{\text{data}} \| p_G) - \log 4$$where $D_{JS}$ is the Jensen-Shannon divergence.
Problem: When $D$ is too good, $\log(1 - D(G(z)))$ saturates → vanishing gradients for $G$.
Quick Review Checklist
Before Your Interview, Make Sure You Can:
Linear Algebra:
- [ ] Multiply matrices and explain the dimensions
- [ ] Explain eigenvalue decomposition geometrically
- [ ] Describe SVD and its applications
- [ ] Define and identify positive definite matrices
- [ ] Explain rank and its implications
Calculus:
- [ ] Derive common function derivatives (sigmoid, softmax)
- [ ] Apply the chain rule for backpropagation
- [ ] Explain gradient, Jacobian, and Hessian
- [ ] Use Taylor series for approximations
Probability:
- [ ] Derive and apply Bayes' theorem
- [ ] Explain MLE vs MAP
- [ ] Describe common distributions
- [ ] Explain bias-variance tradeoff
- [ ] Define and compute expectation, variance, covariance
Optimization:
- [ ] Explain gradient descent variants
- [ ] Describe convexity and its importance
- [ ] Walk through Adam optimizer
- [ ] Explain regularization mathematically
Information Theory:
- [ ] Define and compute entropy
- [ ] Explain cross-entropy loss
- [ ] Describe KL divergence and its properties
Applied:
- [ ] Derive gradients for logistic regression
- [ ] Explain batch normalization math
- [ ] Describe attention mechanism math
Tips for Math Interviews
- Start with intuition before diving into formulas
- Draw pictures - geometric intuition is powerful
- Connect to ML applications - show you understand why it matters
- Be honest if you don't know something
- Practice derivations by hand
- Understand, don't memorize - interviewers test understanding
📅 Study Plan
Week 1: Foundations
- Day 1-2: Linear algebra (vectors, matrices, eigenvalues)
- Day 3-4: Calculus (derivatives, gradients, chain rule)
- Day 5-6: Probability (Bayes, distributions, expectation)
- Day 7: Review + practice problems
Week 2: Core ML Math
- Day 1-2: Optimization (gradient descent, convexity, Adam)
- Day 3-4: Information theory (entropy, KL, cross-entropy)
- Day 5-6: Statistics (MLE, MAP, hypothesis testing)
- Day 7: Review + mock interview
Week 3: Advanced Topics
- Day 1-2: Deep learning math (backprop, batch norm, attention)
- Day 3-4: Generative models (VAE ELBO, GAN theory, diffusion)
- Day 5-6: Graph theory + kernel methods
- Day 7: Full review + timed practice
Daily Practice Routine
- ⏰ 10 min: Review 5 formulas from cheatsheet
- 📝 20 min: Derive one key result by hand
- 💻 20 min: Implement one concept in code
- 🎯 10 min: Answer one interview question aloud
Good luck with your interviews! 🍀