當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

深度学习中的优化算法之MBGD

發(fā)布時間：2023/11/27 生活经验 32 豆豆

生活随笔收集整理的這篇文章主要介紹了深度学习中的优化算法之MBGD 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

? ? ? 之前在https://blog.csdn.net/fengbingchun/article/details/75351323 介紹過梯度下降，常見的梯度下降有三種形式：BGD、SGD、MBGD，它們的不同之處在于我們使用多少數(shù)據(jù)來計算目標(biāo)函數(shù)的梯度。

? ? ? 大多數(shù)深度學(xué)習(xí)算法都涉及某種形式的優(yōu)化。優(yōu)化指的是改變x以最小化或最大化某個函數(shù)f(x)的任務(wù)。我們通常以最小化f(x)指代大多數(shù)最優(yōu)化問題。我們把要最小化或最大化的函數(shù)稱為目標(biāo)函數(shù)(objective function)或準(zhǔn)則(criterion)。當(dāng)我們對其進(jìn)行最小化時，我們也把它稱為成本函數(shù)(cost function)、損失函數(shù)(loss function)或誤差函數(shù)(error function)。

? ? ? 梯度下降是深度學(xué)習(xí)中一種常用的優(yōu)化技術(shù)。梯度是函數(shù)的斜率。它衡量一個變量響應(yīng)另一個變量的變化而變化的程度。在數(shù)學(xué)上，梯度下降是一個凸函數(shù)，其輸出是輸入的一組參數(shù)的偏導(dǎo)數(shù)。梯度越大，坡度越陡(the greater the gradient, the steeper the slope)。從初始值開始，迭代運(yùn)行梯度下降以找到參數(shù)的最佳值，以找到給定成本函數(shù)的最小可能值。

? ? ? 梯度下降是一種優(yōu)化算法，通常用于尋找深度學(xué)習(xí)算法中的權(quán)值及系數(shù)(weights or coefficients)，如邏輯回歸。它的工作原理是讓模型對訓(xùn)練數(shù)據(jù)進(jìn)行預(yù)測，并使用預(yù)測中的error來更新模型從而減少error(It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error)。

? ? ? 該算法的目標(biāo)是找到使模型在訓(xùn)練數(shù)據(jù)集上的誤差最小化的模型參數(shù)(e.g. coefficients or weights)。它通過對模型進(jìn)行更改，使其沿著誤差的梯度或斜率向下移動到最小誤差值來實(shí)現(xiàn)這一點(diǎn)。這使該算法獲得了"梯度下降"的名稱。

? ? ? 梯度下降是深度學(xué)習(xí)中非常流行的優(yōu)化算法。它的目標(biāo)是搜索目標(biāo)函數(shù)或成本函數(shù)(objective function or cost function)的全局最小值。這只有在目標(biāo)函數(shù)是凸函數(shù)時才有可能，這間接意味著該函數(shù)將是碗形的。在非凸函數(shù)的情況下，梯度下降會找到最近的最小值，這個函數(shù)的最小值稱為局部最小值。

? ? ? 梯度下降是一種一階優(yōu)化算法。這意味著在更新參數(shù)時它只考慮函數(shù)的一階導(dǎo)數(shù)。我們的主要目標(biāo)是在每次迭代中使梯度沿最陡斜率的方向行進(jìn)，我們在與目標(biāo)函數(shù)的梯度相反的方向上更新參數(shù)。

? ? ? 圖解說明：假設(shè)只有weight沒有bias。如果weight(w)的特定值的斜率>0，則表示我們在最優(yōu)w*的右側(cè)，在這種情況下，更新將是負(fù)數(shù)，并且w將開始接近最優(yōu)w*。但是，如果weight(w)的特定值的斜率<0，則更新將為正值，并將當(dāng)前值增加到w以收斂到w*的最佳值。以下截圖來自于https://www.machinelearningman.com：重復(fù)該方法，直到成本函數(shù)收斂。

? ? ? 在https://blog.csdn.net/fengbingchun/article/details/79370310中有梯度下降應(yīng)用于二分類的公式推導(dǎo)。

? ? ? MBGD(Mini-Batch Gradient Descent, MBGD)：小批量梯度下降，它將訓(xùn)練數(shù)據(jù)集分成小批量用于計算模型誤差和更新模型參數(shù)。小批量梯度下降尋求在隨機(jī)梯度下降的魯棒性和批量梯度下降的效率之間找到平衡。它是深度學(xué)習(xí)領(lǐng)域中最常見的梯度下降實(shí)現(xiàn)。

? ? ? 梯度下降是一種最小化目標(biāo)函數(shù)的方法：θ為模型的參數(shù)，J(θ)為目標(biāo)函數(shù)，以下截圖來自:https://arxiv.org/pdf/1609.04747.pdf

? ? ?

? ? ? ?有時提到SGD的時候，其實(shí)指的是MBGD。

? ? ? 小批量的大小通常由以下幾個因素決定：

? ? ? (1).更大的批量會計算更精確的梯度估計，但是回報卻是小于線性的。

? ? ? (2).極小批量通常難以充分利用多核架構(gòu)。這促使我們使用一些絕對最小批量，低于這個值的小批量處理不會減少計算時間。

? ? ? (3).如果批量處理中的所有樣本可以并行地處理(通常確是如此)，那么內(nèi)存消耗和批量大小會正比。對于很多硬件設(shè)施，這是批量大小的限制因素。

? ? ? (4).在某些硬件上使用特定大小的數(shù)組時，運(yùn)行時間會更少。尤其是在使用GPU時，通常使用2的冪數(shù)作為批量大小可以獲得更少的運(yùn)行時間。一般，2的冪數(shù)的取值范圍是32到256，16有時>在嘗試大模型時使用。

? ? ? (5).可能是由于小批量在學(xué)習(xí)過程中加入了噪聲，它們會有一些正則化效果。泛化誤差通常在批量大小為1時最好。因?yàn)樘荻裙烙嫷母叻讲?#xff0c;小批量訓(xùn)練需要較小的學(xué)習(xí)率以保持穩(wěn)定性。因

? ? ? 為降低的學(xué)習(xí)率和消耗更多步驟來遍歷整個訓(xùn)練集都會產(chǎn)生更多的步驟，所以會導(dǎo)致總的運(yùn)行時間非常大。

? ? ? 小批量是隨機(jī)抽取的這點(diǎn)也很重要。從一組樣本中計算出梯度期望的無偏估計要求這些樣本是獨(dú)立的。我們也希望兩個連續(xù)的梯度估計是互相獨(dú)立的，因此兩個連續(xù)的小批量樣本也應(yīng)該是>彼此獨(dú)立的。很多現(xiàn)實(shí)的數(shù)據(jù)集自然排列，從而使得連續(xù)的樣本之間具有高度相關(guān)性。實(shí)踐中通常將樣本順序打亂一次，然后按照這個順序存儲起來就足夠了。之后訓(xùn)練模型時會用到的一>組組小批量連續(xù)樣本是固定的，每個獨(dú)立的模型每次遍歷訓(xùn)練數(shù)據(jù)時都會重復(fù)使用這個順序。

? ? ? 優(yōu)點(diǎn)：

? ? ? (1).模型更新頻率高于批量梯度下降，這允許更穩(wěn)健的收斂，避免局部極小值。

? ? ? (2).批量更新提供了比隨機(jī)梯度下降計算上更有效的過程。

? ? ? (3).批量處理既可以提高內(nèi)存中沒有所有訓(xùn)練數(shù)據(jù)的效率，也可以實(shí)現(xiàn)算法。

? ? ? 缺點(diǎn)：

? ? ? (1).小批量需要為學(xué)習(xí)算法配置一個額外的"mini-batch size"超參數(shù)。

? ? ? (2).錯誤信息(error information)必須在像批量梯度下降這樣的小批量訓(xùn)練樣本中累積。

? ? ? 一般"batch size"為32、64、128、256等的2的冪。"batch size"是學(xué)習(xí)過程中的一個滑塊(slider)。較小的值會提供一個快速收斂的學(xué)習(xí)過程，但會以訓(xùn)練過程中的噪聲為代價。較大的值會給出一個緩慢收斂的學(xué)習(xí)過程并準(zhǔn)確估計誤差梯度。

? ? ? 以上內(nèi)容主要參考：

? ? ? 1. https://arxiv.org/pdf/1609.04747.pdf?

? ? ? 2. https://machinelearningmastery.com/

? ? ? 3.?https://www.machinelearningman.com

? ? ? 以下的測試代碼以https://blog.csdn.net/fengbingchun/article/details/79346691中邏輯回歸實(shí)現(xiàn)的基礎(chǔ)上進(jìn)行調(diào)整：

? ? ? logistic_regression2.hpp：

#ifndef FBC_SRC_NN_LOGISTIC_REGRESSION2_HPP_
#define FBC_SRC_NN_LOGISTIC_REGRESSION2_HPP_#include <cstdlib>
#include <ctime>
#include <vector>
#include <string>
#include <memory>namespace ANN {enum class ActivationFunction {Sigmoid // logistic sigmoid function
};enum class LossFunction {MSE // Mean Square Error
};enum class Optimization {BGD, // Batch Gradient DescentSGD, // Stochastic Gradient DescentMBGD // Mini-batch Gradient Descent
};struct Database {Database() = default;std::vector<std::vector<float>> samples; // training setstd::vector<int> labels; // ground truth labels
};class LogisticRegression2 { // two categories
public:LogisticRegression2(Optimization optim = Optimization::BGD, int batch_size = 1) : optim_(optim), batch_size_(batch_size) {}int init(std::unique_ptr<Database> data, int feature_length, float learning_rate = 0.00001, int epochs = 1000);int train(const std::string& model);int load_model(const std::string& model);float predict(const float* data, int feature_length) const; // y = 1/(1+exp(-(wx+b)))void set_error(float error) { error_ = error; }private:int store_model(const std::string& model) const;float calculate_z(const std::vector<float>& feature) const;  // z(i)=w^T*x(i)+bfloat calculate_cost_function() const;static int generate_random(int i) { return std::rand()%i; }float calculate_activation_function(float value) const;float calculate_loss_function() const;float calculate_loss_function_derivative() const;float calculate_loss_function_derivative(float predictive_value, float true_value) const;void calculate_gradient_descent(int start = 0, int end = 0);std::unique_ptr<Database> data_; // train data(images, labels)std::vector<int> random_shuffle_; // shuffle the training data at every epochstd::vector<float> o_; // predict valueint epochs_ = 100; // epochsint m_ = 0; // train samples numint feature_length_ = 0; // weights lengthfloat alpha_ = 0.00001; // learning ratestd::vector<float> w_; // weightsfloat b_ = 0.; // thresholdfloat error_ = 0.00001;int batch_size_ = 1;ActivationFunction activation_func_ = ActivationFunction::Sigmoid;LossFunction loss_func_ = LossFunction::MSE;Optimization optim_ = Optimization::BGD;
}; // class LogisticRegression2} // namespace ANN#endif // FBC_SRC_NN_LOGISTIC_REGRESSION2_HPP_

? ? ? logistic_regression2.cpp:

#include "logistic_regression2.hpp"
#include <fstream>
#include <algorithm>
#include <random>
#include <cmath>
#include "common.hpp"namespace ANN {int LogisticRegression2::init(std::unique_ptr<Database> data, int feature_length, float learning_rate, int epochs)
{CHECK(data->samples.size() == data->labels.size());m_ = data->samples.size();if (m_ < 2) {fprintf(stderr, "logistic regression train samples num is too little: %d\n", m_);return -1;}if (learning_rate <= 0) {fprintf(stderr, "learning rate must be greater 0: %f\n", learning_rate);return -1;}if (epochs < 1) {fprintf(stderr, "number of epochs cannot be zero or a negative number: %d\n", epochs);return -1;}alpha_ = learning_rate;epochs_ = epochs;feature_length_ = feature_length;data_ = std::move(data);o_.resize(m_);return 0;
}int LogisticRegression2::train(const std::string& model)
{w_.resize(feature_length_, 0.);generator_real_random_number(w_.data(), feature_length_, -0.01f, 0.01f, true);generator_real_random_number(&b_, 1, -0.01f, 0.01f);if (optim_ == Optimization::BGD) {for (int iter = 0; iter < epochs_; ++iter) {calculate_gradient_descent();auto cost_value = calculate_cost_function();fprintf(stdout, "epochs: %d, cost function: %f\n", iter, cost_value);if (cost_value < error_) break;}} else {random_shuffle_.resize(data_->samples.size(), 0);for (int i = 0; i < data_->samples.size(); ++i)random_shuffle_[i] = i;float cost_value = 0.;for (int iter = 0; iter < epochs_; ++iter) {std::srand(unsigned(std::time(0)));std::random_shuffle(random_shuffle_.begin(), random_shuffle_.end(), generate_random);int loop = (m_ + batch_size_ - 1) / batch_size_;for (int i = 0; i < loop; ++i) {int start = i * batch_size_;int end = start + batch_size_ > m_ ? m_ : start + batch_size_;calculate_gradient_descent(start, end);for (int i = 0; i < m_; ++i)o_[i] = calculate_activation_function(calculate_z(data_->samples[i]));cost_value = calculate_cost_function();fprintf(stdout, "epochs: %d, loop: %d, cost function: %f\n", iter, i, cost_value);if (cost_value < error_) break;}if (cost_value < error_) break;}}CHECK(store_model(model) == 0);return 0;
}int LogisticRegression2::load_model(const std::string& model)
{std::ifstream file;file.open(model.c_str(), std::ios::binary);if (!file.is_open()) {fprintf(stderr, "open file fail: %s\n", model.c_str());return -1;}int length{ 0 };file.read((char*)&length, sizeof(length));w_.resize(length);feature_length_ = length;file.read((char*)w_.data(), sizeof(float)*w_.size());file.read((char*)&b_, sizeof(float));file.close();return 0;
}float LogisticRegression2::predict(const float* data, int feature_length) const
{CHECK(feature_length == feature_length_);float value{0.};for (int t = 0; t < feature_length_; ++t) {value += data[t] * w_[t];}value += b_;return (calculate_activation_function(value));
}int LogisticRegression2::store_model(const std::string& model) const
{std::ofstream file;file.open(model.c_str(), std::ios::binary);if (!file.is_open()) {fprintf(stderr, "open file fail: %s\n", model.c_str());return -1;}int length = w_.size();file.write((char*)&length, sizeof(length));file.write((char*)w_.data(), sizeof(float) * w_.size());file.write((char*)&b_, sizeof(float));file.close();return 0;
}float LogisticRegression2::calculate_z(const std::vector<float>& feature) const
{float z{0.};for (int i = 0; i < feature_length_; ++i) {z += w_[i] * feature[i];}z += b_;return z;
}float LogisticRegression2::calculate_cost_function() const
{/*// J+=-1/m([y(i)*loga(i)+(1-y(i))*log(1-a(i))])// Note: log0 is not definedfloat J{0.};for (int i = 0; i < m_; ++i)J += -(data_->labels[i] * std::log(o_[i]) + (1 - labels[i]) * std::log(1 - o_[i]) );return J/m_;*/float J{0.};for (int i = 0; i < m_; ++i)J += 1./2*std::pow(data_->labels[i] - o_[i], 2);return J/m_;
}float LogisticRegression2::calculate_activation_function(float value) const
{switch (activation_func_) {case ActivationFunction::Sigmoid:default: // Sigmoidreturn (1. / (1. + std::exp(-value))); // y = 1/(1+exp(-value))}
}float LogisticRegression2::calculate_loss_function() const
{switch (loss_func_) {case LossFunction::MSE:default: // MSEfloat value = 0.;for (int i = 0; i < m_; ++i) {value += 1/2.*std::pow(data_->labels[i] - o_[i], 2);}return value/m_;}
}float LogisticRegression2::calculate_loss_function_derivative() const
{switch (loss_func_) {case LossFunction::MSE:default: // MSEfloat value = 0.;for (int i = 0; i < m_; ++i) {value += o_[i] - data_->labels[i];}return value/m_;}
}float LogisticRegression2::calculate_loss_function_derivative(float predictive_value, float true_value) const
{switch (loss_func_) {case LossFunction::MSE:default: // MSEreturn (predictive_value - true_value);}
}void LogisticRegression2::calculate_gradient_descent(int start, int end)
{float db = 0.;std::vector<float> dw(feature_length_, 0.);switch (optim_) {case Optimization::SGD:case Optimization::MBGD: {int len = end - start;std::vector<float> z(len, 0), dz(len, 0);for (int i = start, x = 0; i < end; ++i, ++x) {z[x] = calculate_z(data_->samples[random_shuffle_[i]]);dz[x] = calculate_loss_function_derivative(calculate_activation_function(z[x]), data_->labels[random_shuffle_[i]]);for (int j = 0; j < feature_length_; ++j) {dw[j] += data_->samples[random_shuffle_[i]][j] * dz[x]; // dw(i)+=x(i)(j)*dz(i)}db += dz[x]; // db+=dz(i)}for (int j = 0; j < feature_length_; ++j) {dw[j] /= len;w_[j] -= alpha_ * dw[j];}b_ -= alpha_*(db/len);}break;case Optimization::BGD:default: // BGDstd::vector<float> z(m_, 0), dz(m_, 0);for (int i = 0; i < m_; ++i) {z[i] = calculate_z(data_->samples[i]);o_[i] = calculate_activation_function(z[i]);dz[i] = calculate_loss_function_derivative(o_[i], data_->labels[i]);for (int j = 0; j < feature_length_; ++j) {dw[j] += data_->samples[i][j] * dz[i]; // dw(i)+=x(i)(j)*dz(i)}db += dz[i]; // db+=dz(i)}for (int j = 0; j < feature_length_; ++j) {dw[j] /= m_;w_[j] -= alpha_ * dw[j];}b_ -= alpha_*(db/m_);}
}} // namespace ANN

? ? ? test_logistic_regression2_gradient_descent：以MNIST為數(shù)據(jù)集，取0和1，在訓(xùn)練時取訓(xùn)練集各5000張，預(yù)測時取測試集各900張

int test_logistic_regression2_gradient_descent()
{fprintf(stdout,"Warning: first generate test images: execute demo/DatasetToImage/DatasetToImage: MNISTtoImage\n");fprintf(stdout, "load train images ...\n");
#ifdef _MSC_VERconst std::vector<std::string> image_path{ "E:/GitCode/NN_Test/data/tmp/MNIST/train_images/", "E:/GitCode/NN_Test/data/tmp/MNIST/test_images/"};const std::string model{ "E:/GitCode/NN_Test/data/logistic_regression2.model" };
#elseconst std::vector<std::string> image_path{ "data/tmp/MNIST/train_images/", "data/tmp/MNIST/test_images/"};const std::string model{ "data/logistic_regression2.model" };
#endifconst int image_size = 28*28;const int samples_single_class_num = 5000;auto data1 = std::make_unique<ANN::Database>();data1->samples.resize(samples_single_class_num*2);data1->labels.resize(samples_single_class_num*2);if (read_images(image_path[0], samples_single_class_num, image_size, data1) == -1) return -1;fprintf(stdout, "start train ...\n");auto start = std::chrono::steady_clock::now();//ANN::LogisticRegression2 lr(ANN::Optimization::BGD, samples_single_class_num * 2); // Batch Gradient Descent, epochs = 10000, correct rete: 0.997778//ANN::LogisticRegression2 lr(ANN::Optimization::SGD, 1); // Stochastic Gradient Descent,  epochs = 5, correct rete: 0.998889ANN::LogisticRegression2 lr(ANN::Optimization::MBGD, 128); // Mini-batch Gradient Descent,  epochs = 100, correct rete: 0.997778lr.set_error(0.0002);int ret = lr.init(std::move(data1), image_size, 0.00001, 5);if (ret != 0) {fprintf(stderr, "logistic regression init fail: %d\n", ret);return -1;}ret = lr.train(model);if (ret != 0) {fprintf(stderr, "logistic regression train fail: %d\n", ret);return -1;}auto end = std::chrono::steady_clock::now();fprintf(stdout, "train elapsed time: %d seconds\n", std::chrono::duration_cast<std::chrono::seconds>(end - start).count());fprintf(stdout, "start predict ...\n");const int test_single_class_num = 900;const std::vector<std::string> prefix_name {"0_", "1_"};ANN::LogisticRegression2 lr2;lr2.load_model(model);int count = 0;for (int i = 1; i <= test_single_class_num; ++i) {for (const auto& prefix : prefix_name) {std::string name = std::to_string(i);if (i < 10) {name = "0000" + name;} else if (i < 100) {name = "000" + name;} else if (i < 1000) {name = "00" + name;}name = image_path[1] + prefix + name + ".jpg";cv::Mat mat = cv::imread(name, 0);if (mat.empty()) {fprintf(stderr, "read image fail: %s\n", name.c_str());return -1;}if (mat.cols * mat.rows != image_size || mat.channels() != 1) {fprintf(stderr, "image size fail: width: %d, height: %d, channels: %d\n", mat.cols, mat.rows, mat.channels());return -1;}mat.convertTo(mat, CV_32F);float probability = lr2.predict((float*)mat.data, image_size);int label = prefix == "0_" ? 0 : 1;if ((probability > 0.5 &&  label == 1) || (probability < 0.5 && label == 0)) ++count;}}float correct_rate = count / (test_single_class_num * 2.);fprintf(stdout, "correct rate: %f\n", correct_rate);return 0;
}

? ? ? ?執(zhí)行結(jié)果如下：訓(xùn)練時，MBGD成本函數(shù)error值并不向BGD一樣逐漸減少，偶爾會波動，但是總體上還是逐漸減少；設(shè)置相同的error，MBGD比SGD訓(xùn)練時間少好多；預(yù)測準(zhǔn)確率為99.83%

? ? ? GitHub：?https://github.com/fengbingchun/NN_Test

總結(jié)

以上是生活随笔為你收集整理的深度学习中的优化算法之MBGD的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。