當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

深度学习cnn人脸检测_用于对象检测的深度学习方法：解释了R-CNN

發布時間：2023/12/15 pytorch 46 豆豆

生活随笔收集整理的這篇文章主要介紹了深度学习cnn人脸检测_用于对象检测的深度学习方法：解释了R-CNN 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

深度學習cnn人臉檢測

介紹 (Introduction)

CNN’s have been extensively used to classify images. But to detect an object in an image and to draw bounding boxes around them is a tough problem to solve. To solve this problem, R-CNN algorithm was published in 2014. After R-CNN, many of its variants like Fast-R-CNN, Faster-R-CNN and Mask-R-CNN came which improvised the task of object detection. To understand the latest R-CNN variants, it is important to have a clear understanding of R-CNN. Once this is understood, then all other variations can be understood easily.

CNN已被廣泛用于圖像分類。但是要檢測圖像中的物體并在其周圍繪制邊界框是一個難題。為了解決這個問題，R-CNN算法于2014年發布。繼R-CNN之后，它的許多變體，如Fast-R-CNN ， Faster-R-CNN和Mask-R-CNN都改進了目標檢測的任務。要了解最新的R-CNN變體，必須對R-CNN有清晰的了解。一旦理解了這一點，便可以輕松理解所有其他變體。

This post will assume that the reader has familiarity with SVM, image classification using CNNs and linear regression.

本文將假定讀者熟悉SVM，使用CNN進行圖像分類和線性回歸。

總覽 (Overview)

The R-CNN paper[1] was published in 2014. It was the first paper to show that CNN can lead to high performance in object detection. This algorithm does object detection in the following way:

R-CNN論文 [1]于2014年發表。這是第一篇表明CNN可以提高目標檢測性能的論文。該算法通過以下方式進行對象檢測：

Original paper原論文

The method takes an image as input and extracts around 2000 region proposals from the image(Step 2 in the above image).

該方法以圖像為輸入，并從圖像中提取大約2000個區域建議(上圖中的步驟2)。

Each region proposal is then warped(reshaped) to a fixed size to be passed on as an input to a CNN.

然后，將每個區域提議彎曲(變形)為固定大小，以作為CNN的輸入傳遞。

The CNN extracts a fixed-length feature vector for each region proposal(Step 3 in the above image).

CNN為每個區域建議提取固定長度的特征向量(上圖中的步驟3)。

These features are used to classify region proposals using category-specific linear SVM(Step 4 in the above image).

這些功能用于使用類別特定的線性SVM對區域提議進行分類(上圖中的步驟4)。

The bounding boxes are refined using bounding box regression so that the object is properly captured by the box.

邊界框使用邊界框回歸進行精煉，以使對象可以被框正確捕獲。

Now the post will dive into details explaining how the model is trained and how it predicts the bounding boxes.

現在，該帖子將深入探討如何訓練模型以及如何預測邊界框的細節。

計算區域提案 (Calculating region Proposals)

Region Proposals | Source: Image by author地區提案| 資料來源：作者提供的圖片

Region proposals are the bounding boxes that may contain an object. These are represented by a tuple of 4 numbers(x,y,h,w). The (x,y) are the coordinates of the centre of the bounding box and (h,w) are the height and width of the bounding box respectively. These region proposals are calculated by an algorithm called selective search[2]. For an image, approximately 2000 region proposals are extracted.

區域投標是可能包含對象的邊界框。這些由4個數字(x，y，h，w)的元組表示。 (x，y)是邊界框中心的坐標，(h，w)分別是邊界框的高度和寬度。這些區域提議是通過稱為選擇性搜索的算法計算的[2]。對于圖像，提取了大約2000個區域建議。

訓練CNN功能提取器： (Training CNN feature extractor:)

Pinterest| Edited by authorPinterest | 由作者編輯

Pretrained network: To train the CNN for feature extraction, an architecture like VGG-16 is initialized with the pre-trained weights from imagenet data. The output layer having 1000 classes is chopped off. So when a region proposal image(warped to size 224x224) is passed to the network we get a 4096-dimensional feature vector as shown in the above image. In this way, each region proposal is represented by a 4096-dimensional feature vector.

預訓練網絡：為了訓練CNN以進行特征提取，使用來自Imagenet數據的預訓練權重初始化VGG-16之類的體系結構。具有1000個類別的輸出層被切掉。因此，當區域投標圖像(變形為224x224大小)傳遞到網絡時，我們將獲得4096維特征向量，如上圖所示。這樣，每個區域提議都由4096維特征向量表示。

The next step is to fine-tune the weights of the network with the region proposal images. To understand this a new metric called intersection-over-union or IoU score will be introduced.

下一步是使用區域建議圖像微調網絡的權重。為了理解這一點，將引入一個稱為“交集超過工會”或“ IoU得分”的新指標。

聯合路口(IoU) (Intersection-Over-Union(IoU))

To measure the performance of a classification model, we generally use metrics like accuracy, recall, precision etc. But how to measure the performance of object detection. In object detection we have to evaluate two things:

為了衡量分類模型的性能，我們通常使用諸如準確性，召回率，準確性等指標。但是如何衡量對象檢測的性能。在對象檢測中，我們必須評估兩件事：

How well the bounding box can locate the object in the image. In other words, how close the predicted bounding box is to the ground truth.

邊界框在圖像中定位對象的程度。換句話說，預測的邊界框離地面真值有多近。

Whether the bounding box is classifying the enclosed object correctly

邊界框是否正確分類了封閉對象

Blog by Adrian RosebrockAdrian Rosebrock撰寫的博客

The IoU score measures how close the predicted box is to the ground truth. It is the ratio of the area common to ground truth and predicted box to the total area enclosed by both the boxes.

IoU分數用于衡量預測框與地面實況的接近程度。它是基本事實和預測框的公共面積與兩個框所包圍的總面積之比。

GitbookGitbook

In the left-most image, it can be seen that the predicted box is not close to the ground truth so the IoU score is only 35% while in the right-most image the predicted box completely overlaps the ground truth box, hence a very high value of 95% is obtained. The IoU value varies from 0 to 1.

在最左側的圖像中，可以看到預測框與地面實況不接近，因此IoU分數僅為35％，而在最右側的圖像中，預測框與地面真實框完全重疊，因此非常獲得了95％的高值。 IoU值從0到1。

Fine-tuning the network: To fine-tune the model the output layer having 1000 classes is replaced with N+1 classes(softmax layers) and the rest of the model is kept unchanged. N is the number of the distinct classes the objects be classified and plus 1 for the background class.

網絡的微調：為了微調模型，將具有1000個類的輸出層替換為N + 1個類(softmax層)，其余模型保持不變。 N是對象分類的不同類別的數量，背景類別則加1。

Next, data is required for fine-tuning. The region proposals whose IoU > 50% are considered a positive class for that object and rest are considered as background. In the above image of balls, the region proposal with 35% IoU score will be labelled as background while the rest of the boxes will be labelled as ball. These images(region proposals) are warped(resized) to the size compatible with CNN, in case of VGG 16 it is 224x224. Using these images the weights of the network are fine-tuned.

接下來，需要數據以進行微調。 IoU> 50％的區域提案被認為是該對象的良好類別，其余部分被視為背景。在上面的球圖像中，IoU分數為35％的區域提議將被標記為背景，而其余的框將被標記為球。將這些圖像(區域建議)扭曲(調整大小)到與CNN兼容的大小，在VGG 16的情況下為224x224。使用這些圖像可以對網絡的權重進行微調。

培訓特定于課程的SVM (Training class-specific SVM)

Once the 4096-dimensional feature is obtained for each region proposal, the next task is to train a binary SVM for each class. For example, if the detection model is to detect three distinct objects — cat, dog and people, then three SVMs need to be trained for each of the class.

為每個區域提案獲得4096維特征后，下一個任務是為每個類別訓練一個二進制SVM。例如，如果檢測模型要檢測三個不同的對象(貓，狗和人)，則需要針對每個班級訓練三個SVM。

Data preparation: All the object proposals belonging to a particular class is segregated. The manually labelled ground truth image for the class is considered as a positive class while object proposals having IoU < 30% for that class is considered as negative class. The same thing is done for each class and N SVM models are trained for classifying each region proposal into N classes.

數據準備：隔離屬于特定類別的所有對象建議。手動標記的類別的地面真實圖像被視為肯定類別，而該類別的IoU <30％的對象建議被視為否定類別。對于每個類別都執行相同的操作，并且訓練了N個SVM模型以將每個區域建議分類為N個類別。

邊界框回歸 (Bounding box regression)

MyStart website | Edited by authorMyStart網站 | 由作者編輯

The region proposal bounding box predicted by selective search[2] algorithm might not be able to capture the entire object. To fine-tune the predicted bounding box, bounding box regression is used.

由選擇性搜索[2]算法預測的區域提議邊界框可能無法捕獲整個對象。要微調預測的邊界框，請使用邊界框回歸。

Consider the ground truth region proposal G and the predicted region proposal P by the selective search[2] algorithm.

通過選擇性搜索[2]算法考慮地面真實區域建議G和預測區域建議P。

To make the prediction of G scale-invariant, the following transformations are done such that the target for the regression is t.

為了預測G尺度不變，進行以下轉換，以使回歸的目標為t。

Transformation equation變換方程 Pinterest| Edited by authorPinterest | 由作者編輯

The input to the regression model is the feature from the last pooling layer of the CNN. We train 4 regression models per class with the target as t and input features as last pooling layer feature from CNN to learn regression parameter w.

回歸模型的輸入是來自CNN最后一個合并層的要素。我們每類訓練4個回歸模型，其中目標作為t ，輸入特征作為CNN的最后一個合并層特征，以學習回歸參數w。

Regression equation回歸方程

Here * is the placeholder for (x,y,w,h) and phi(P) is the last pooling layer feature corresponding to proposal P. So to predict ground truth G we can use the regression equation to calculate t from the region proposal P and then substitute the value of t and P in transformation equation to obtain G.

*是(x，y，w，h)的占位符，phi(P)是與提案P對應的最后一個合并層特征。因此，要預測地面實況G，我們可以使用回歸方程從區域提案中計算t P，然后將t和P的值代入轉換方程式以獲得G。

預測 (Prediction)

Once different parts of R-CNN are trained, the next part is to do the object detection.

一旦訓練了R-CNN的不同部分，下一部分就是進行對象檢測。

Source: Image by author資料來源：作者提供的圖片

An input image is taken and using selective search[2] algorithm, around 2000 region proposals are obtained for an image.

拍攝輸入圖像，并使用選擇性搜索[2]算法，為圖像獲得大約2000個區域建議。

Each region proposal image is warped to a fixed size of 224x224.

每個區域建議圖像都將變形為224x224的固定大小。

These region proposal images are then passed to the trained CNN to obtain 4096-dimensional feature vector for all the 2000 region proposals which result in 2000x4096 dimensional matrix.

然后將這些區域建議圖像傳遞給訓練后的CNN，以獲取所有2000個區域建議的4096維特征向量，從而得出2000x4096維矩陣。

Source: Image by author資料來源：作者提供的圖片

3. Each region proposal is classified using SVM for each class. generally for N classes, the SVM weights(4096-dimensional) are stacked in the form of a matrix and multiplied with the feature matrix. This results in a matrix which assigns a score to each class to which a region proposal can belong to.

3.針對每個類別，使用SVM對每個區域提案進行分類。通常對于N類，SVM權重(4096維)以矩陣的形式堆疊并與特征矩陣相乘。這產生一個矩陣，該矩陣為區域提案可以屬于的每個類別分配分數。

4. The proposal is assigned the class which receives the maximum score. So all the 2000 region proposals or bounding boxes in the image are labelled with a class label.

4.為提案分配獲得最高分的班級。因此，圖像中的所有2000個區域提議或邊界框都標記有類標簽。

5. Out of those many bounding boxes, a lot of them would be redundant and overlapping bounding boxes which need to be removed. To accomplish that Non-maximum suppression algorithm is used.

5.在許多邊界框中，其中許多將是多余的并且需要刪除重疊的邊界框。為此，使用了非最大抑制算法。

Non-maximum suppression algorithm:

非最大抑制算法：

Non-maximum suppression blog非最大抑制博客

Non-maximum suppression is a greedy algorithm. It works on one class at a time. For a particular class, it picks the box with the maximum score obtained using SVM. Then it calculates IoU score with all other bounding boxes belonging to that class. The boxes having IoU score greater than 70% are removed. In other words, the bounding boxes which have very high overlap are removed. Then the next highest score box is chosen and so on until all the overlapping bounding boxes are removed for that class. This is done for all classes to obtain the result as shown above.

非最大抑制是一種貪婪算法。一次只上一堂課。對于特定的班級，它會選擇使用SVM獲得最高分的盒子。然后，它使用屬于該類的所有其他邊界框計算IoU分數。 IoU分數大于70％的框將被刪除。換句話說，將重疊度很高的邊界框刪除。然后選擇下一個得分最高的框，依此類推，直到刪除該類的所有重疊邊界框。如上所示，對所有類都執行此操作以獲得結果。

6. Once the labelled bounding boxes are obtained the next task is to fine-tune the location of the boxes using regression.

6.一旦獲得標記的邊界框，下一個任務是使用回歸微調框的位置。

Fine-tuning region proposals through regression| Source: Image by author通過回歸微調區域提案| 資料來源：作者提供的圖片

During training,4 regression models were trained for each class. So for a particular bounding box, region proposal image is passed through CNN to obtain features P to be passed to the regression model. The regression model outputs the scale-invariant coordinates ( dx(P), dy(P), dw(P), dh(P) ). These coordinates are combined with the region proposal coordinates ( Px, Py, Pw, Ph ) to obtain the adjusted final coordinates ( Gx, Gy, Gw, Gh) using the formula below.

在訓練期間，為每個班級訓練了4個回歸模型。因此，對于特定的邊界框，區域建議圖像通過CNN傳遞，以獲得要傳遞給回歸模型的特征P。回歸模型輸出尺度不變坐標(dx(P)，dy(P)，dw(P)，dh(P))。這些坐標與區域建議坐標(Px，Py，Pw，Ph)組合，使用以下公式獲得調整后的最終坐標(Gx，Gy，Gw，Gh)。

原始紙的結果 (The result from the original paper)

Original paper原論文

結論 (Conclusion)

Although R-CNN is good at detecting objects, it has its shortcomings.

盡管R-CNN擅長檢測物體，但它也有缺點。

This algorithm is slow and it takes about 47 secs to perform object detection on an image.

該算法速度很慢，大約需要47秒才能對圖像執行目標檢測。

Training is not done in a single step. There are different models for doing different parts which make the training process time-consuming.

培訓不是一步一步完成的。執行不同部分的模型不同，這會使培訓過程變得很耗時。

These shortcomings are addressed in the later improvements of R-CNN which are Fast-RCNN, Faster-RCNN and Mask-RCNN. Having a good understanding of R-CNN helps to understand other variants of R-CNN easily and intuitively.

R-CNN的后續改進(即Fast-RCNN，Faster-RCNN和Mask-RCNN)解決了這些缺點。充分了解R-CNN有助于輕松直觀地了解R-CNN的其他變體。

翻譯自: https://towardsdatascience.com/deep-learning-method-for-object-detection-r-cnn-explained-ecdadd751d22