當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

三分钟带你对 Softmax 划重点

發布時間：2025/3/15 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了三分钟带你对 Softmax 划重点小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

個人網站：紅色石頭的機器學習之路
CSDN博客：紅色石頭的專欄
知乎：紅色石頭
微博：RedstoneWill的微博
GitHub：RedstoneWill的GitHub
微信公眾號：AI有道（ID：redstonewill）

1. 什么是Softmax

Softmax 在機器學習和深度學習中有著非常廣泛的應用。尤其在處理多分類（C > 2）問題，分類器最后的輸出單元需要Softmax 函數進行數值處理。關于Softmax 函數的定義如下所示：

Si=eVi∑CieViSi=eVi∑iCeVi

其中，Vi 是分類器前級輸出單元的輸出。i 表示類別索引，總的類別個數為 C。Si 表示的是當前元素的指數與所有元素指數和的比值。Softmax 將多分類的輸出數值轉化為相對概率，更容易理解和比較。我們來看下面這個例子。

一個多分類問題，C = 4。線性分類器模型最后輸出層包含了四個輸出值，分別是：

V=??????32?10?????V=[?32?10]

經過Softmax處理后，數值轉化為相對概率：

S=?????0.00570.83900.04180.1135?????S=[0.00570.83900.04180.1135]

很明顯，Softmax 的輸出表征了不同類別之間的相對概率。我們可以清晰地看出，S1 = 0.8390，對應的概率最大，則更清晰地可以判斷預測為第1類的可能性更大。Softmax 將連續數值轉化成相對概率，更有利于我們理解。

實際應用中，使用 Softmax 需要注意數值溢出的問題。因為有指數運算，如果 V 數值很大，經過指數運算后的數值往往可能有溢出的可能。所以，需要對 V 進行一些數值處理：即 V 中的每個元素減去 V 中的最大值。

D=max(V)D=max(V)

Si=eVi?D∑CieVi?DSi=eVi?D∑iCeVi?D

相應的python示例代碼如下：

scores = np.array([123, 456, 789]) # example with 3 classes and each having large scores scores -= np.max(scores) # scores becomes [-666, -333, 0] p = np.exp(scores) / np.sum(np.exp(scores))

2. Softmax 損失函數

我們知道，線性分類器的輸出是輸入 x 與權重系數的矩陣相乘：s = Wx。對于多分類問題，使用 Softmax 對線性輸出進行處理。這一小節我們來探討下 Softmax 的損失函數。

Si=eSyi∑Cj=1eSjSi=eSyi∑j=1CeSj

其中，Syi是正確類別對應的線性得分函數，Si 是正確類別對應的 Softmax輸出。

由于 log 運算符不會影響函數的單調性，我們對 Si 進行 log 操作：

Si=logeSyi∑Cj=1eSjSi=logeSyi∑j=1CeSj

我們希望 Si 越大越好，即正確類別對應的相對概率越大越好，那么就可以對 Si 前面加個負號，來表示損失函數：

Li=?Si=?logeSyi∑Cj=1eSjLi=?Si=?logeSyi∑j=1CeSj

對上式進一步處理，把指數約去：

Li=?logeSyi∑Cj=1eSj=?(syi?log∑j=1Cesj)=?syi+log∑j=1CesjLi=?logeSyi∑j=1CeSj=?(syi?log∑j=1Cesj)=?syi+log∑j=1Cesj

這樣，Softmax 的損失函數就轉換成了簡單的形式。

舉個簡單的例子，上一小節中得到的線性輸出為：

V=??????32?10?????V=[?32?10]

假設 i = 1 為真實樣本，計算其損失函數為：

Li=?2+log(e?3+e2+e?1+e0)=0.1755Li=?2+log(e?3+e2+e?1+e0)=0.1755

Li=3+log(e?3+e2+e?1+e0)=5.1755Li=3+log(e?3+e2+e?1+e0)=5.1755

3. Softmax 反向梯度

推導了 Softmax 的損失函數之后，接下來繼續對權重參數進行反向求導。

Softmax 線性分類器中，線性輸出為：

Si=WxiSi=Wxi

其中，下標 i 表示第 i 個樣本。

求導過程的程序設計分為兩種方法：一種是使用嵌套 for 循環，另一種是直接使用矩陣運算。

使用嵌套 for 循環，對權重 W 求導函數定義如下：

def softmax_loss_naive(W, X, y, reg):"""Softmax loss function, naive implementation (with loops)Inputs have dimension D, there are C classes, and we operate on minibatchesof N examples.Inputs:- W: A numpy array of shape (D, C) containing weights.- X: A numpy array of shape (N, D) containing a minibatch of data.- y: A numpy array of shape (N,) containing training labels; y[i] = c meansthat X[i] has label c, where 0 <= c < C.- reg: (float) regularization strengthReturns a tuple of:- loss as single float- gradient with respect to weights W; an array of same shape as W"""# Initialize the loss and gradient to zero.loss = 0.0dW = np.zeros_like(W)num_train = X.shape[0]num_classes = W.shape[1]for i in xrange(num_train):scores = X[i,:].dot(W)scores_shift = scores - np.max(scores)right_class = y[i]loss += -scores_shift[right_class] + np.log(np.sum(np.exp(scores_shift)))for j in xrange(num_classes):softmax_output = np.exp(scores_shift[j]) / np.sum(np.exp(scores_shift))if j == y[i]:dW[:,j] += (-1 + softmax_output) * X[i,:]else:dW[:,j] += softmax_output * X[i,:]loss /= num_trainloss += 0.5 * reg * np.sum(W * W)dW /= num_traindW += reg * Wreturn loss, dW

使用矩陣運算，對權重 W 求導函數定義如下：

def softmax_loss_vectorized(W, X, y, reg):"""Softmax loss function, vectorized version.Inputs and outputs are the same as softmax_loss_naive."""# Initialize the loss and gradient to zero.loss = 0.0dW = np.zeros_like(W)num_train = X.shape[0]num_classes = W.shape[1]scores = X.dot(W)scores_shift = scores - np.max(scores, axis = 1).reshape(-1,1)softmax_output = np.exp(scores_shift) / np.sum(np.exp(scores_shift), axis=1).reshape(-1,1)loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))loss /= num_trainloss += 0.5 * reg * np.sum(W * W)dS = softmax_output.copy()dS[range(num_train), list(y)] += -1dW = (X.T).dot(dS)dW = dW / num_train + reg * W return loss, dW

實際驗證表明，矩陣運算速度要比嵌套循環快很多，特別是在訓練樣本數量多的情況下。我們使用 CIFAR-10 數據集中約5000個樣本對兩種求導方式進行測試對比：

tic = time.time() loss_naive, grad_naive = softmax_loss_naive(W, X_train, y_train, 0.000005) toc = time.time() print('naive loss: %e computed in %fs' % (loss_naive, toc - tic))tic = time.time() loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_train, y_train, 0.000005) toc = time.time() print('vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro') print('Loss difference: %f' % np.abs(loss_naive - loss_vectorized)) print('Gradient difference: %f' % grad_difference)

結果顯示為：

naive loss: 2.362135e+00 computed in 14.680000s

vectorized loss: 2.362135e+00 computed in 0.242000s

Loss difference: 0.000000

Gradient difference: 0.000000

顯然，此例中矩陣運算的速度要比嵌套循環快60倍。所以，當我們在編寫機器學習算法模型時，盡量使用矩陣運算，少用嵌套循環，以提高運算速度。

4. Softmax 與 SVM

Softmax線性分類器的損失函數計算相對概率，又稱交叉熵損失「Cross Entropy Loss」。線性 SVM 分類器和 Softmax 線性分類器的主要區別在于損失函數不同。SVM 使用 hinge loss，更關注分類正確樣本和錯誤樣本之間的距離「Δ = 1」，只要距離大于 Δ，就不在乎到底距離相差多少，忽略細節。而 Softmax 中每個類別的得分函數都會影響其損失函數的大小。舉個例子來說明，類別個數 C = 3，兩個樣本的得分函數分別為[10, -10, -10]，[10, 9, 9]，真實標簽為第0類。對于 SVM 來說，這兩個 Li 都為0；但對于Softmax來說，這兩個 Li 分別為0.00和0.55，差別很大。

關于 SVM 線性分類器，我在上篇文章里有所介紹，傳送門：

基于線性SVM的CIFAR-10圖像集分類

接下來，談一下正則化參數 λ 對 Softmax 的影響。我們知道正則化的目的是限制權重參數 W 的大小，防止過擬合。正則化參數 λ 越大，對 W 的限制越大。例如，某3分類的線性輸出為 [1, -2, 0]，相應的 Softmax 輸出為[0.7, 0.04, 0.26]。假設，正類類別是第0類，顯然，0.7遠大于0.04和0.26。

若使用正則化參數 λ，由于限制了 W 的大小，得到的線性輸出也會等比例縮小：[0.5, -1, 0]，相應的 Softmax 輸出為[0.55, 0.12, 0.33]。顯然，正確樣本和錯誤樣本之間的相對概率差距變小了。

也就是說，正則化參數 λ 越大，Softmax 各類別輸出越接近。大的 λ 實際上是「均勻化」正確樣本與錯誤樣本之間的相對概率。但是，概率大小的相對順序并沒有改變，這點需要留意。因此，也不會影響到對 Loss 的優化算法。

5. Softmax 實際應用

使用 Softmax 線性分類器，對 CIFAR-10 圖片集進行分類。

使用交叉驗證，選擇最佳的學習因子和正則化參數：

# Use the validation set to tune hyperparameters (regularization strength and # learning rate). You should experiment with different ranges for the learning # rates and regularization strengths; if you are careful you should be able to # get a classification accuracy of over 0.35 on the validation set. results = {} best_val = -1 best_softmax = None learning_rates = [1.4e-7, 1.5e-7, 1.6e-7] regularization_strengths = [8000.0, 9000.0, 10000.0, 11000.0, 18000.0, 19000.0, 20000.0, 21000.0]for lr in learning_rates:for reg in regularization_strengths:softmax = Softmax()loss = softmax.train(X_train, y_train, learning_rate=lr, reg=reg, num_iters=3000)y_train_pred = softmax.predict(X_train)training_accuracy = np.mean(y_train == y_train_pred)y_val_pred = softmax.predict(X_val)val_accuracy = np.mean(y_val == y_val_pred)if val_accuracy > best_val:best_val = val_accuracybest_softmax = softmaxresults[(lr, reg)] = training_accuracy, val_accuracy# Print out results. for lr, reg in sorted(results):train_accuracy, val_accuracy = results[(lr, reg)]print('lr %e reg %e train accuracy: %f val accuracy: %f' % (lr, reg, train_accuracy, val_accuracy))print('best validation accuracy achieved during cross-validation: %f' % best_val)

訓練結束后，在測試圖片集上進行驗證：

# evaluate on test set # Evaluate the best softmax on test set y_test_pred = best_softmax.predict(X_test) test_accuracy = np.mean(y_test == y_test_pred) print('softmax on raw pixels final test set accuracy: %f' % (test_accuracy, ))

softmax on raw pixels final test set accuracy: 0.386000

權重參數 W 可視化代碼如下：

# Visualize the learned weights for each class w = best_softmax.W[:-1,:] # strip out the bias w = w.reshape(32, 32, 3, 10)w_min, w_max = np.min(w), np.max(w)classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] for i in range(10):plt.subplot(2, 5, i + 1)# Rescale the weights to be between 0 and 255wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)plt.imshow(wimg.astype('uint8'))plt.axis('off')plt.title(classes[i])

很明顯，經過訓練學習，W 包含了相應類別的某些簡單色調和輪廓特征。

本文完整代碼，點擊「源碼」獲取。

源碼

參考文獻：

http://cs231n.github.io/linear-classify/

總結

以上是生活随笔為你收集整理的三分钟带你对 Softmax 划重点的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。