Lab2-4. Softmax Regression via Gradient Descent

Author

Byungkook Oh

Modified

2026.03.22

Objectives

Extend binary logistic regression to multi-class classification using softmax regression (multinomial logistic regression).
Implement the cross-entropy loss and its gradient in both element-wise and matrix form.
Visualize how decision regions evolve during training, partitioning the feature space into \(K\) classes.

Background

From Sigmoid to Softmax

Logistic regression handles two classes by squeezing a single linear score through the sigmoid. When we have \(K > 2\) classes, the natural generalization is softmax regression. Instead of one weight vector \(\mathbf{w} \in \mathbb{R}^d\), we learn a weight matrix \(\mathbf{W} \in \mathbb{R}^{d \times K}\) — one column per class. The model computes \(K\) raw scores (logits) for each sample and converts them into a probability distribution using the softmax function.

Softmax Function

Given a logit vector \(\mathbf{z} = [z_1, \dots, z_K]^\top\), the softmax function assigns each class a probability:

\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K\]

All outputs are positive and sum to 1, forming a valid probability distribution. When \(K = 2\), softmax reduces exactly to sigmoid applied to the difference \(z_1 - z_2\).

Cross-Entropy Loss

Let \(\mathbf{y}_i\) be the one-hot encoding of the label for sample \(i\) — a vector of length \(K\) with a single 1 at the true class position and 0 elsewhere. The cross-entropy loss over \(n\) samples is:

\[\mathcal{L}_{\text{CE}}(\mathbf{W}) = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log p_{ik}\]

where \(\mathbf{p}_i = \text{softmax}(\mathbf{X}_i \mathbf{W})\). Since \(\mathbf{y}_i\) is one-hot, only the term for the true class survives in the inner sum:

\[\mathcal{L}_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \log p_{i,\, c_i}\]

where \(c_i\) is the true class of sample \(i\). This is the negative log-probability of the correct class, averaged over all samples.

Gradient (Not included in the exam)

In logistic regression, the gradient is

\[ \nabla_\mathbf{w} \mathcal{L}_{\text{BCE}} = \frac{1}{n}\mathbf{X}^\top (\mathbf{p} - \mathbf{y}) \]

Softmax regression keeps the same feature-weighted error pattern, but now there is one probability/error term per class, so the gradient becomes a matrix rather than a vector. We will not derive this from scratch here, but the stacking structure is useful to see.

If we define the error matrix \(\mathbf{E} = \mathbf{P} - \mathbf{Y}\), then each entry of the gradient is

\[\frac{\partial \mathcal{L}}{\partial W_{jk}} = \frac{1}{n}\sum_{i=1}^{n} e_{ik} \, x_{ij}\]

For a fixed class \(k\), collecting the partial derivatives with respect to all feature weights gives one gradient column:

\[ \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial W_{1k}} \\ \frac{\partial \mathcal{L}}{\partial W_{2k}} \\ \vdots \\ \frac{\partial \mathcal{L}}{\partial W_{dk}} \end{bmatrix} = \frac{1}{n} \begin{bmatrix} \sum_{i=1}^{n} e_{ik} x_{i1} \\ \sum_{i=1}^{n} e_{ik} x_{i2} \\ \vdots \\ \sum_{i=1}^{n} e_{ik} x_{id} \end{bmatrix} \]

Doing this for every class and placing those columns side by side gives the full gradient matrix:

\[ \nabla_\mathbf{W}\mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial W_{11}} & \frac{\partial \mathcal{L}}{\partial W_{12}} & \cdots & \frac{\partial \mathcal{L}}{\partial W_{1K}} \\ \frac{\partial \mathcal{L}}{\partial W_{21}} & \frac{\partial \mathcal{L}}{\partial W_{22}} & \cdots & \frac{\partial \mathcal{L}}{\partial W_{2K}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial \mathcal{L}}{\partial W_{d1}} & \frac{\partial \mathcal{L}}{\partial W_{d2}} & \cdots & \frac{\partial \mathcal{L}}{\partial W_{dK}} \end{bmatrix} = \frac{1}{n}\mathbf{X}^\top \mathbf{E} = \frac{1}{n}\mathbf{X}^\top (\mathbf{P} - \mathbf{Y}) \]

where \(\mathbf{P}\) is the \(n \times K\) matrix of predicted probabilities and \(\mathbf{Y}\) is the \(n \times K\) one-hot label matrix.

This is the same stacking idea as before, now lifted from a vector to a matrix: in logistic regression we stacked partial derivatives into one gradient vector, while in softmax regression we stack one such vector per class to form the gradient matrix. Multiplying by \(\mathbf{X}^\top\) computes the feature-weighted average of these class-wise errors — exactly the same pattern as the binary case.

0. Setup

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
import seaborn as sns

sns.set_theme(style='whitegrid', font_scale=1.2)
np.random.seed(42)

1. Softmax Regression via Gradient Descent

1.1 Generating Synthetic Data

We create the same 2-feature dataset from the logistic regression lab, but this time we keep all three original cluster labels instead of collapsing them to binary.

np.random.seed(42)

X_raw, y_multi = make_blobs(
    n_samples=250,
    centers=[(0, 0), (1.0, 0.0), (0.0, 1.0)],
    cluster_std=[0.3, 0.3, 0.3],
    random_state=42
)

K = 3   # number of classes
n = len(y_multi)

# Build the design matrix: prepend a column of ones for the bias term
# X_aug has shape (250, 3): each row is [1, x1_i, x2_i]
X_aug = np.column_stack([np.ones(n), X_raw])

# One-hot encode: Y_onehot has shape (n, K)
Y_onehot = np.zeros((n, K))
Y_onehot[np.arange(n), y_multi] = 1.0

print(f"X_aug shape    : {X_aug.shape}")       # (250, 3)
print(f"Y_onehot shape : {Y_onehot.shape}")     # (250, 3)
print(f"Classes        : {np.unique(y_multi)}")
print(f"Samples/class  : {[int(np.sum(y_multi == k)) for k in range(K)]}")

X_aug shape    : (250, 3)
Y_onehot shape : (250, 3)
Classes        : [0 1 2]
Samples/class  : [84, 83, 83]

The design matrix X_aug is identical to the logistic regression lab: each row is \([1, x_{i1}, x_{i2}]\). The key difference is the label representation — from a binary vector to a one-hot matrix \(\mathbf{Y}\) of shape \((n, K)\).

fig, ax = plt.subplots(figsize=(7, 5))

markers = ['s', 'o', '^']
colors  = ['tomato', 'steelblue', 'seagreen']
labels  = ['Class 0', 'Class 1', 'Class 2']

for k in range(K):
    mask = y_multi == k
    ax.scatter(X_raw[mask, 0], X_raw[mask, 1],
               marker=markers[k], color=colors[k], edgecolors='k',
               linewidths=0.4, label=labels[k], alpha=0.7)

ax.set_title("3-Class Classification Dataset")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.legend()
ax.grid(True, linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

The three clusters are well-separated: Class 0 near the origin, Class 1 along the positive \(x_1\) axis, and Class 2 along the positive \(x_2\) axis. A linear model should be able to partition these three regions with high accuracy.

1.2 Implementing the Loss and Gradient

We implement the softmax, cross-entropy loss, and gradient following the formulas from the Background section. The weight matrix \(\mathbf{W}\) has shape \((d, K) = (3, 3)\): three input dimensions (bias, \(x_1\), \(x_2\)) times three classes.

Element-wise view. For a single sample \(i\) with true class \(c_i\), the predicted probability vector is \(\mathbf{p}_i = \text{softmax}(\mathbf{X}_i \mathbf{W})\). The error for class \(k\) is:

\[e_{ik} = p_{ik} - y_{ik} = \begin{cases} p_{ik} - 1 & \text{if } k = c_i \\ p_{ik} & \text{otherwise} \end{cases}\]

The gradient contribution from sample \(i\) to the weight connecting feature \(j\) to class \(k\) is \(e_{ik} \cdot x_{ij}\). Averaging over all samples gives \(\frac{\partial \mathcal{L}}{\partial W_{jk}} = \frac{1}{n} \sum_i e_{ik} \, x_{ij}\), which in matrix form is \(\frac{1}{n} \mathbf{X}^\top (\mathbf{P} - \mathbf{Y})\).

def softmax(Z):
    """
    Numerically stable softmax over rows.

    Subtracting the row-max prevents overflow in exp.
    Z has shape (n, K); returns P with the same shape.
    """
    Z_stable = Z - Z.max(axis=1, keepdims=True)
    E = np.exp(Z_stable)
    return E / E.sum(axis=1, keepdims=True)


def ce_loss(X, Y, W):
    """
    Cross-entropy loss for softmax regression.

    X: (n, d),  Y: (n, K) one-hot,  W: (d, K)
    L = -(1/n) * sum_i sum_k Y_ik * log(P_ik)
    """
    P = softmax(X @ W)                   # (n, K)
    eps = 1e-12
    return -float(np.mean(np.sum(Y * np.log(P + eps), axis=1)))


def ce_gradient_elementwise(X, Y, W):
    """
    Compute the gradient by iterating over samples and features.
    This makes each step explicit for pedagogical clarity.

    Returns dL/dW of shape (d, K).
    """
    n, d = X.shape
    K = Y.shape[1]
    P = softmax(X @ W)         # (n, K)
    E = P - Y                  # (n, K) error matrix

    grad = np.zeros((d, K))
    for j in range(d):         # for each feature (including bias)
        for k in range(K):     # for each class
            grad[j, k] = (1.0 / n) * np.sum(E[:, k] * X[:, j])
    return grad


def ce_gradient(X, Y, W):
    """
    Same result in one line using matrix multiplication.
    (1/n) * X^T @ (P - Y) computes all d×K partial derivatives at once.
    """
    n = X.shape[0]
    P = softmax(X @ W)                    # (n, K)
    return (1.0 / n) * (X.T @ (P - Y))   # (d, K)

The structure mirrors the binary case almost exactly:

Binary (Logistic)	Multi-class (Softmax)
`sigmoid(X @ w)` → probabilities (scalar per sample)	`softmax(X @ W)` → probabilities (vector per sample)
`bce_loss`: \(-\frac{1}{n}\sum[y\log p + (1{-}y)\log(1{-}p)]\)	`ce_loss`: \(-\frac{1}{n}\sum_i \log p_{i,c_i}\)
`bce_gradient`: \(\frac{1}{n}\mathbf{X}^\top(\mathbf{p} - \mathbf{y})\)	`ce_gradient`: \(\frac{1}{n}\mathbf{X}^\top(\mathbf{P} - \mathbf{Y})\)
\(\mathbf{w} \in \mathbb{R}^d\) (vector)	\(\mathbf{W} \in \mathbb{R}^{d \times K}\) (matrix)

Let us verify that both gradient implementations produce the same result:

W_init = np.zeros((X_aug.shape[1], K))  # (3, 3)

grad_elem   = ce_gradient_elementwise(X_aug, Y_onehot, W_init)
grad_matrix = ce_gradient(X_aug, Y_onehot, W_init)

print(f"Initial CE loss : {ce_loss(X_aug, Y_onehot, W_init):.4f}")
print(f"Expected        : {-np.log(1.0 / K):.4f}  (= -log(1/K), uniform prediction)")
print(f"\nGradient (element-wise):\n{grad_elem.round(4)}")
print(f"\nGradient (matrix):\n{grad_matrix.round(4)}")
print(f"\nMatch: {np.allclose(grad_elem, grad_matrix)}")

Initial CE loss : 1.0986
Expected        : 1.0986  (= -log(1/K), uniform prediction)

Gradient (element-wise):
[[-0.0027  0.0013  0.0013]
 [ 0.1211 -0.2308  0.1097]
 [ 0.1095  0.1073 -0.2168]]

Gradient (matrix):
[[-0.0027  0.0013  0.0013]
 [ 0.1211 -0.2308  0.1097]
 [ 0.1095  0.1073 -0.2168]]

Match: True

At \(\mathbf{W} = \mathbf{0}\), softmax outputs \(p_k = 1/K = 1/3\) for every class and every sample, so the initial loss is \(-\log(1/3) \approx 1.0986\). The gradient for each class reflects the imbalance between the uniform prediction (\(1/3\)) and the actual class proportions, weighted by the mean feature values.

1.3 Writing the Training Loop

The training loop is structurally identical to train_logistic — the only difference is that W is a matrix instead of a vector.

def train_softmax(X, Y, lr=0.5, epochs=500):
    """
    Full-batch gradient descent for softmax regression.

    Parameters
    ----------
    X  : (n, d)  design matrix (with bias column)
    Y  : (n, K)  one-hot labels
    lr : float   learning rate
    epochs : int number of passes

    Returns
    -------
    W_history    : list of (d, K) arrays, length epochs+1
    loss_history : list of float, length epochs+1
    """
    d, K = X.shape[1], Y.shape[1]
    W = np.zeros((d, K))
    W_history    = [W.copy()]
    loss_history = [ce_loss(X, Y, W)]

    for _ in range(epochs):
        grad = ce_gradient(X, Y, W)
        W    = W - lr * grad          # gradient descent update
        W_history.append(W.copy())
        loss_history.append(ce_loss(X, Y, W))

    return W_history, loss_history

Comparing the Binary and Multi-Class Training Loops

The update rule \(\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_\mathbf{W}\mathcal{L}\) is exactly the same — NumPy handles the matrix subtraction element-wise. The .copy() is still necessary for the same reason as before: without it, every entry in W_history would point to the same (final) matrix.

1.4 Running the Optimizer

W_hist, loss_hist = train_softmax(X_aug, Y_onehot, lr=0.5, epochs=300)

W_final    = W_hist[-1]
P_final    = softmax(X_aug @ W_final)
pred_multi = np.argmax(P_final, axis=1)
accuracy   = float(np.mean(pred_multi == y_multi))

print(f"Initial loss : {loss_hist[0]:.4f}")
print(f"Final loss   : {loss_hist[-1]:.4f}")
print(f"Accuracy     : {accuracy * 100:.1f}%")
print(f"\nLearned W (rows = [bias, w1, w2], cols = classes 0, 1, 2):")
print(W_final.round(4))

Initial loss : 1.0986
Final loss   : 0.1788
Accuracy     : 95.6%

Learned W (rows = [bias, w1, w2], cols = classes 0, 1, 2):
[[ 2.1905 -1.0978 -1.0928]
 [-2.4749  4.3155 -1.8406]
 [-2.477  -1.7428  4.2199]]

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# --- Left: loss curve ---
axes[0].plot(loss_hist, color='steelblue', lw=1.5)
axes[0].set_title("Cross-Entropy Loss over Epochs")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("CE Loss")
axes[0].grid(True, linestyle='--', alpha=0.5)

# --- Right: weight convergence ---
param_names = ['b', '$w_1$', '$w_2$']
class_colors = ['tomato', 'steelblue', 'seagreen']
line_styles  = ['-', '--', ':']
W_arr = np.array(W_hist)   # (epochs+1, d, K)

for k in range(K):
    for j in range(W_arr.shape[1]):
        axes[1].plot(W_arr[:, j, k], color=class_colors[k],
                     ls=line_styles[j], lw=1.2,
                     label=f"Class {k}: {param_names[j]}")

axes[1].set_title("Weight Convergence (per class)")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Weight value")
axes[1].legend(fontsize=7, ncol=3, loc='best')
axes[1].grid(True, linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

The loss decreases monotonically — cross-entropy is convex, so gradient descent is guaranteed to converge (for a small enough learning rate). The weight matrix has \(d \times K = 9\) parameters evolving simultaneously. Each class develops its own bias and feature weights:

Class 0 (origin): should have small or negative weights for \(w_1\) and \(w_2\).
Class 1 (right): should have a large positive \(w_1\) and a small \(w_2\).
Class 2 (upper): should have a small \(w_1\) and a large positive \(w_2\).

1.5 Visualizing the Decision Regions

In the binary case, the decision boundary was a single line separating two half-planes. With \(K = 3\) classes, each pair of classes has a boundary where \(p_j = p_k\), and the feature space is partitioned into three regions. We visualize this by coloring every point in the plane according to the predicted class.

def draw_softmax_regions(X_aug, X_raw, y, W, ax, title="", resolution=200):
    """
    Fill the background with predicted class regions and overlay data points.
    """
    x1_min, x1_max = X_raw[:, 0].min() - 0.5, X_raw[:, 0].max() + 0.5
    x2_min, x2_max = X_raw[:, 1].min() - 0.5, X_raw[:, 1].max() + 0.5

    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, resolution),
                           np.linspace(x2_min, x2_max, resolution))
    X_grid = np.column_stack([np.ones(xx1.size), xx1.ravel(), xx2.ravel()])
    preds = np.argmax(softmax(X_grid @ W), axis=1).reshape(xx1.shape)

    # Background regions
    cmap_bg = ListedColormap(['#FADADD', '#D0E8FF', '#D5F5D5'])
    ax.contourf(xx1, xx2, preds, levels=[-0.5, 0.5, 1.5, 2.5],
                cmap=cmap_bg, alpha=0.6)

    # Data points
    markers = ['s', 'o', '^']
    colors  = ['tomato', 'steelblue', 'seagreen']
    labels  = ['Class 0', 'Class 1', 'Class 2']
    for k in range(3):
        mask = y == k
        ax.scatter(X_raw[mask, 0], X_raw[mask, 1],
                   marker=markers[k], color=colors[k], edgecolors='k',
                   linewidths=0.4, alpha=0.7, label=labels[k])

    ax.set_xlabel("$x_1$")
    ax.set_ylabel("$x_2$")
    ax.set_title(title)
    ax.legend(fontsize=8)
    ax.grid(True, linestyle='--', alpha=0.4)

fig, axes = plt.subplots(1, 4, figsize=(18, 4), sharex=True, sharey=True)

snapshot_epochs = [1, 5, 30, 300]
for ax, ep in zip(axes, snapshot_epochs):
    draw_softmax_regions(X_aug, X_raw, y_multi, W_hist[ep], ax,
                         title=f"Epoch {ep}")

plt.suptitle("Softmax Decision Regions during Training", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

At epoch 1, the regions are nearly uniform — the weights have barely moved from zero, so every point receives roughly \(p_k \approx 1/3\) for all classes. As training progresses, the three regions tighten around their respective clusters. By epoch 300, each cluster is enclosed in its own region, with the boundaries meeting at a central point — the Voronoi-like partition that softmax regression produces in feature space.

Unlike the binary case where the boundary was a single line, here we see three pairwise boundaries (one per pair of classes), each defined by the hyperplane \((\mathbf{w}_j - \mathbf{w}_k)^\top \mathbf{x} = 0\). All three boundaries intersect at a single point, creating a three-way junction.

Summary

	Logistic Regression	Softmax Regression
Classes	\(K = 2\)	\(K \geq 2\)
Activation	Sigmoid \(\sigma(z)\)	Softmax \(\text{softmax}(\mathbf{z})\)
Parameters	\(\mathbf{w} \in \mathbb{R}^d\) (vector)	\(\mathbf{W} \in \mathbb{R}^{d \times K}\) (matrix)
Loss	BCE: \(-\frac{1}{n}\sum[y\log p + (1{-}y)\log(1{-}p)]\)	CE: \(-\frac{1}{n}\sum_i \log p_{i,c_i}\)
Gradient	\(\frac{1}{n}\mathbf{X}^\top(\mathbf{p} - \mathbf{y})\)	\(\frac{1}{n}\mathbf{X}^\top(\mathbf{P} - \mathbf{Y})\)
Boundary	Single hyperplane	\(\binom{K}{2}\) pairwise hyperplanes

Softmax regression generalizes logistic regression to \(K\) classes. The sigmoid becomes softmax, the weight vector becomes a matrix, and BCE becomes cross-entropy — but the gradient retains the same \(\frac{1}{n}\mathbf{X}^\top(\mathbf{P} - \mathbf{Y})\) structure.
The decision boundary partitions the feature space into \(K\) convex regions. Each pairwise boundary is a hyperplane, and all boundaries meet at a central junction — a Voronoi-like partition induced by the linear model.
The gradient formula is identical in structure to the binary case. This is not a coincidence: softmax with cross-entropy is the canonical link for the multinomial distribution, just as sigmoid with BCE is canonical for the Bernoulli.

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Chapter 4.3: Multi-class logistic regression)
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. (Section 6.2.2.3: Softmax)