检测和语义分割_分割和对象检测-第1部分
檢測和語義分割
有關深層學習的FAU講義 (FAU LECTURE NOTES ON DEEP LEARNING)
These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!
這些是FAU YouTube講座“ 深度學習 ”的 講義 。 這是演講視頻和匹配幻燈片的完整記錄。 我們希望您喜歡這些視頻。 當然,此成績單是使用深度學習技術自動創建的,并且僅進行了較小的手動修改。 自己嘗試! 如果發現錯誤,請告訴我們!
導航 (Navigation)
Previous Lecture / Watch this Video / Top Level / Next Lecture
上一個講座 / 觀看此視頻 / 頂級 / 下一個講座
Welcome back to deep learning! So today, we want to discuss a couple of more application-oriented topics. We want to look into image processing and in particular into segmentation and object detection.
歡迎回到深度學習! 因此,今天,我們要討論幾個面向應用程序的主題。 我們想研究圖像處理,尤其是分割和對象檢測。
Semantic segmentation in road scenes. Image created using gifify. Source: YouTube道路場景中的語義分割。 使用gifify創建的圖像 。 資料來源: YouTubeSo let’s see what I have here for you. Here’s the outline of the next five videos: We will first introduce the topic, of course. Then, we’ll talk about segmentation. So, we’ll motivate it and discuss where the problems are with segmentation. Then we want to go into several techniques that allow you to do good image segmentation. You will see that there are actually very interesting methods that are quite powerful and can be applied to a wide variety of tasks. After that, we want to continue and talk about object detection. So, this is a kind of related topic. With object detection, we then want to look into different methods of how you can find objects in scenes and how you can actually identify which object belongs to which class. So, let’s start with the introduction.
因此,讓我們看看我在這里為您提供的服務。 這是接下來的五個視頻的概述:當然,我們將首先介紹該主題。 然后,我們將討論細分。 因此,我們將激發它并討論細分問題所在。 然后,我們要探討幾種可以使您進行良好的圖像分割的技術。 您將看到實際上有很多非常有趣的方法,它們非常強大,可以應用于各種各樣的任務。 之后,我們要繼續討論對象檢測。 因此,這是一種相關主題。 然后,通過對象檢測,我們希望研究不同的方法,這些方法可以如何在場景中找到對象,以及如何真正識別哪個對象屬于哪個類。 因此,讓我們從介紹開始。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。So far, we looked into image classification. Essentially, you can see that the problem is that you simply have the classification to cat, but you can’t make any information out of the spatial relation of objects to each other. An improvement is image segmentation. So in semantic segmentation, you then try to find the class of every pixel in the image. So here, you can see in red that we marked all of the pixels that belong to the class “cat”. Now, if we want to talk about object detection, we have to look into a slightly different direction. So here, the idea would be to identify essentially the area where the object of interest is. You can already see here if we use for example the methods that we learned in visualization, we will probably not be very happy because we would simply identify pixels that are related to that class. So, this has to be done in a different way because we are actually then interested in finding different instances. So we want to be able to figure out different cats in a single image and then find bounding boxes. So, this is essentially the task of object detection and instance recognition. Now lastly when we have mastered those two ideas, then we also want to talk about the problem of instance segmentation. Here, it’s not just that you find all pixels that show cats but you actually want to differentiate different cats and assign the segmentations to different instances. So, this is then instance segmentation which will be in the last video about these topics.
到目前為止,我們研究了圖像分類。 從本質上講,您可以看到問題是您只需要分類即可,但是不能從對象彼此的空間關系中獲取任何信息。 一種改進是圖像分割。 因此,在語義分割中,您然后嘗試查找圖像中每個像素的類別。 因此,在這里您可以看到紅色標記了屬于“貓”類的所有像素。 現在,如果我們要談論對象檢測,我們必須考慮一個略有不同的方向。 因此,這里的想法是實質上確定感興趣對象所在的區域。 您已經在這里看到了,如果我們使用例如在可視化中學習到的方法,我們可能不會很高興,因為我們只會識別與該類相關的像素。 因此,這必須以不同的方式完成,因為我們實際上對尋找不同的實例感興趣。 因此,我們希望能夠在單個圖像中找出不同的貓,然后找到邊界框。 因此,這實質上是對象檢測和實例識別的任務。 現在最后,當我們掌握了這兩個想法時,我們還想討論實例分段的問題。 在這里,不僅要找到所有顯示貓的像素,而且實際上還想區分不同的貓并將細分分配給不同的實例。 因此,這是實例細分,將在有關這些主題的最后一個視頻中。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。So, let’s go ahead and talk a bit about ideas towards image segmentation. Now in image segmentation, we want to find exactly which pixels belong to that specific class. We want to delineate essentially the boundary of meaningful objects. So, all of these regions that are within the boundary should have the same label and they belong to the same category. So, each pixel gets a semantic class and we want to generate pixel-wise dense labels. These concepts are, of course, shown here on images but technically you can also do similar things on sound when you, for example, look into spectrograms. The idea in images would be that we want to make out of the left-hand image the right-hand image. You can see already that we find the region that is identified by the airplane here and we find the boundary.
因此,讓我們繼續討論有關圖像分割的想法。 現在在圖像分割中,我們要精確地找到屬于該特定類別的像素。 我們想從本質上劃定有意義對象的邊界。 因此,邊界內的所有這些區域都應具有相同的標簽,并且它們屬于同一類別。 因此,每個像素都有一個語義類,我們想生成按像素的密集標簽。 這些概念當然會顯示在圖像上,但是從技術上講,例如,當您查看頻譜圖時,您也可以在聲音上做類似的事情。 圖像中的想法是,我們希望從左側圖像中選出右側圖像。 您已經可以看到,我們在這里找到了飛機識別的區域,并且找到了邊界。
Semantic edge segmentation finds the boundaries of semantic classes directly. Image created using gifify. Source: YouTube語義邊緣分割直接找到語義類的邊界。 使用gifify創建的圖像 。 資料來源: YouTubeOf course, this is a more simple task. Here, you can also think about more complex scenes like this example here from autonomous driving. Here, we are for example interested in where the street is, where persons are, where pedestrians, where vehicles are, and so on. We want to mark them in this complex scene.
當然,這是一個更簡單的任務。 在這里,您還可以通過自動駕駛來思考更復雜的場景,例如此處的示例。 例如,在這里,我們對街道在哪里,人在哪里,行人在哪里,車輛在哪里等等感興趣。 我們想在這個復雜的場景中標記它們。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Similar tasks can also be done for medical imaging. For example, if you’re interested in the identification of different organs, i.e., where the liver is, where the vessels are, or where cells are. So, of course, there are many, many more applications that we won’t talk about here. There are aerial images, if you process satellite images, of course, autonomous robotics, and also image editing where you can show that these kinds of techniques have very useful properties.
對于醫學成像也可以完成類似的任務。 例如,如果您對識別不同器官感興趣,例如肝臟在哪里,血管在哪里或細胞在哪里。 因此,當然,這里有許多我們不再討論的應用程序。 當然,如果您處理衛星圖像,則有航空圖像,這是自主機器人技術,還有圖像編輯,您可以證明這些技術具有非常有用的特性。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Of course, if we want to do so, we need to talk a bit about evaluation metrics. We have to be somehow able to measure the usefulness of a segmentation algorithm. This depends on several factors like the execution time, memory footprint, and quality. The quality of a method, we need to assess with different metrics. The main problem here is that very often the classes are not equally distributed. So, we have to somehow account for that. We can do that by also expanding the number of classes with a background class. Then, we can determine, for example, the probability of the pixel of class i to be inferred to belong to class j. For example, p subscript i, i would then represent the number of true positives.
當然,如果我們愿意,我們需要談談評估指標。 我們必須以某種方式能夠衡量分割算法的有效性。 這取決于幾個因素,例如執行時間,內存占用量和質量。 方法的質量,我們需要使用不同的指標進行評估。 這里的主要問題是類經常不是均勻分布的。 因此,我們必須以某種方式解決這一問題。 我們也可以通過增加背景類的數量來做到這一點。 然后,我們可以確定,例如,推斷出類別i的像素屬于類別j的概率。 例如,p下標i,則i代表真實正數。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。This then brings us to several metrics, for example, the pixel accuracy that would be the ratio between the amount of correctly classified pixels and the total number of pixels and the mean pixel accuracy which is the average ratio of correctly classified pixels per class basis.
然后,這將我們帶入多個度量標準,例如,像素精度將是正確分類的像素的數量與像素總數之間的比率,而平均像素精度則是每個類別的正確分類的像素的平均比例。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。More common actually to evaluate segmentations are then things like the mean intersection over union which is then the ratio between the intersection and the union of two sets and the frequency weighted intersection of a union which is then a balanced version where you also incorporate the class frequency into this measure. So, with these measures then we can figure out what a good segmentation is.
實際上,評估細分更常見的是諸如聯合上的平均交集,即兩個集合的交集和并集之間的比率,以及聯合的頻率加權交集,然后是平衡版本,其中您還合并了類別頻率這項措施。 因此,通過這些措施,我們可以弄清楚什么是良好的細分。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Then, we go ahead and, of course, we follow the ideas of using fully convolutional networks for segmentation. Now so far, if we have been using fully convolutional networks, we essentially had a high-dimensional input — the image — and then we used this CNN for the feature extraction. Then, the outputs were essentially the distributions over different classes. Thus, we had essentially a vector encoding the class probabilities.
然后,我們繼續前進,當然,我們遵循使用完全卷積網絡進行分割的想法。 到目前為止,如果我們一直在使用全卷積網絡,則本質上我們會有一個高維輸入(圖像),然后使用此CNN進行特征提取。 然后,輸出實質上是不同類別上的分布。 因此,我們實際上有一個編碼類概率的向量。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。So, you could also transform it into a fully convolutional neural network where you then essentially parse the entire image and transform it into a heat map. So, we’ve seen similar ideas already in visualization when we talked about the different activations. We could essentially also follow this line of interpretation and then we would get a very low dimensional very coarse heat map for the class “tabby cat”. This is, of course, one way it can go, but you will not be able to identify all the pixels that belong to that specific class in great detail. So, what you have to do is you somehow have to get the segmentation or the class information back to the original image resolution.
因此,您還可以將其轉換為完全卷積的神經網絡,然后在其中本質上解析整個圖像并將其轉換為熱圖。 因此,當我們談論不同的激活時,我們已經在可視化中看到了類似的想法。 從本質上講,我們也可以遵循這種解釋方式,然后對于“虎斑貓”類,我們將獲得非常低維,非常粗糙的熱圖。 當然,這是一種可行的方法,但是您將無法很詳細地識別屬于該特定類的所有像素。 因此,您要做的就是以某種方式將分段或類信息恢復為原始圖像分辨率。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Here, the key idea is not just to use a CNN as an encoder, but you also use a decoder. So, we end up with a structure that looks like this kind of CNN — you could even say hourglass — where we have this encoder and a decoder that does the upsampling again. This is, by the way, not an autoencoder because the input is the image, but the output is the segmentation mask. The encoder part of the network is essentially a CNN and this is very similar to techniques that we already talked about quite a bit.
在這里,關鍵思想不僅是使用CNN作為編碼器,而且還使用解碼器。 因此,我們最終得到了一種類似于CNN的結構(您甚至可以說是沙漏),在這里我們有了編碼器和解碼器,可以再次進行上采樣。 順便說一句,這不是自動編碼器,因為輸入是圖像,但輸出是分段掩碼。 網絡的編碼器部分本質上是CNN,這與我們已經討論過很多的技術非常相似。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。So, on the other side, we need a decoder. This decoder then is used to upsample the information again. There are actually several approaches on how to do this. One of the early ones is Long et al.’s Fully Convolutional Network [13]. There’s also SegNet [1] and I think the most popular one is U-net [21]. This is also the paper that I hinted at that has many references. So, U-net is really popular and you can see that you can check the citation count every day.
因此,另一方面,我們需要一個解碼器。 然后,使用該解碼器再次對信息進行上采樣。 實際上,有幾種方法可以做到這一點。 Long等人的“完全卷積網絡”是最早的一種[13]。 還有SegNet [1],我認為最受歡迎的是U-net [21]。 我也暗示過這篇論文有很多參考。 因此,U-net確實很受歡迎,您可以看到您可以每天檢查引文計數。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Well, let’s discuss how we can do this. The main issue is the upsampling part. So here, we want to have a decoder that somehow is creating a pixel-wise prediction. There are different options possible and one is, for example, unpooling. You can also do transpose convolutions which essentially is then not using the idea of pooling, but the idea of convolution but transposed such that you increase the resolution instead of doing a subsampling.
好吧,讓我們討論如何做到這一點。 主要問題是升采樣部分。 因此,在這里,我們希望有一個以某種方式創建逐像素預測的解碼器。 可能有不同的選項,其中之一是例如分池。 您還可以進行轉置卷積,其本質上不使用池化的概念,而是使用卷積的概念而是轉置的,這樣您就可以提高分辨率而不是進行二次采樣。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。So, let’s look at those upsampling techniques in some more detail. Of course, you can do something like the nearest-neighbor interpolation. There you then simply take the low-resolution information and you unpool simply by taking the nearest neighbor. There’s the bed of nails on which then takes just a single value and you just put it at one of the locations. So, the remaining image will look like a bed of nails. The idea here is, of course, that you just put the information at the position where you know that it belongs. Then, the remaining missing entries should be filled up by a learnable part that is then introduced in a later step of this network.
因此,讓我們更詳細地看一下那些上采樣技術。 當然,您可以執行類似最近鄰插值的操作。 然后,您只需在其中獲取低分辨率信息,然后僅通過獲取最近的鄰居即可將其釋放。 有一個釘床,它只需要一個值,而您只需將其放在其中一個位置即可。 因此,剩余的圖像將看起來像指甲。 當然,這里的想法是,您只需將信息放在您知道該信息所屬的位置即可。 然后,剩余的缺少條目應由可學習的部分填充,然后在該網絡的后續步驟中引入該部分。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Another approach is using max-pooling indices. So, here the idea is that in the encoder path, you perform max pooling and you save the indices of where the pooling actually occurred. Then, you can take this information in the upsampling step again and you write the information exactly at the place where the maximum came from. So this is very similar to what you would be doing in the backpropagation step of the maximum pooling.
另一種方法是使用最大池索引。 因此,這里的想法是在編碼器路徑中執行最大池化,并保存池實際發生位置的索引。 然后,您可以再次在升采樣步驟中獲取此信息,然后將信息準確地寫在最大值來自的位置。 因此,這與最大池化的反向傳播步驟中的操作非常相似。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Of course, there are also learnable techniques like the transposed convolution. Here, you learn an upsampling which is then sometimes also called deconvolution. What you actually do is you have a filter that moves essentially two pixels in the output for every one pixel in the input. You can control higher upsampling with the stride. So, let’s look at this example here. We have this single pixel that then gets unpooled. Here, you produce this 3 x 3 transpose convolution. We show it with a stride of two. Then, we move to the next pixel and you can see that an overlap area emerges in this case. There, you have to do something about this overlap area. For example, you could simply sum them up and hope that in the subsequent processing your network learns how to deal with this inconsistency in the upsampling step.
當然,也有易學的技術,例如轉置卷積。 在這里,您學習了升采樣,有時也稱為反卷積。 實際上,您有一個過濾器,該過濾器將輸入中的每個像素移動了兩個像素。 您可以大步控制更高的上采樣。 因此,讓我們在這里看這個例子。 我們有一個像素,然后將其釋放。 在這里,您將產生此3 x 3轉置卷積。 我們以兩個大步幅顯示它。 然后,我們移至下一個像素,您會看到在這種情況下出現了重疊區域。 在那里,您必須對此重疊區域執行某些操作。 例如,您可以簡單地將它們匯總起來,并希望您的網絡在后續處理中學習如何在上采樣步驟中處理這種不一致情況。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。We can go ahead and do this for the other two pixels in this example. Then, you see that we have this cross-shaped area. So, the transpose convolution results in an uneven overlap when the kernel size is not divisible by the stride. These uneven overlaps on the axis multiply and they create this characteristic checkerboard artifact. In principle, as mentioned before you should be able to learn how to remove those artifacts again in subsequent layers. In practice, it causes struggle and we recommend to avoid it completely.
在本示例中,我們可以繼續對其他兩個像素執行此操作。 然后,您會看到我們有這個十字形的區域。 因此,當內核大小不能被步幅整除時,轉置卷積會導致不均勻的重疊。 軸上的這些不均勻重疊會成倍增加,并會產生此棋盤格特征。 原則上,如前所述,您應該能夠學習如何在后續層中再次刪除這些工件。 在實踐中,這會引起斗爭,我們建議完全避免斗爭。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。So, how can this be avoided? Well, you choose an appropriate kernel size. You choose the kernel size in a way that it is divisible by the stride. Then, you can also do separate upsampling from convolution to compute the features. For example, you could resize the image using a neural network or by linear interpolation. Then, you add a convolution layer. So, this would be a typical approach to do this.
那么,如何避免這種情況呢? 好吧,您選擇合適的內核大小。 您可以選擇可以被步幅整除的內核大小。 然后,您還可以從卷積中進行單獨的上采樣以計算特征。 例如,您可以使用神經網絡或線性插值來調整圖像大小。 然后,添加一個卷積層。 因此,這將是執行此操作的典型方法。
CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。Okay. So until now, we understood all of the basic steps that we need in order to perform image segmentation. Actually, in the next video, we will then talk about how to actually integrate the encoder and the decoder to get good segmentation masks. I may already tell you, there’s a specific trick that you have to do. If you don’t use this trick, you will probably not be able to get a very good segmentation result. So, please stay tuned for the next video because there you will see how you can do good segmentations. You will learn about all the details of these advanced segmentation techniques. So thank you very much for listening and see you in the next video. Bye-bye!
好的。 因此,到目前為止,我們了解了執行圖像分割所需的所有基本步驟。 實際上,在下一個視頻中,我們將討論如何真正集成編碼器和解碼器以獲得良好的分割掩碼。 我可能已經告訴過您,您必須執行一個特定的技巧。 如果您不使用此技巧,則可能無法獲得很好的細分結果。 因此,請繼續關注下一個視頻,因為在那里您將看到如何進行良好的細分。 您將了解這些高級細分技術的所有詳細信息。 因此,非常感謝您收聽并在下一個視頻中見到您。 再見!
Segmentation results on chest X-rays. Image created using gifify. Source: YouTube胸部X線分割結果。 使用gifify創建的圖像 。 資料來源: YouTubeIf you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.
如果你喜歡這篇文章,你可以找到這里更多的文章 ,更多的教育材料,機器學習在這里 ,或看看我們的深入 學習 講座 。 如果您希望將來了解更多文章,視頻和研究信息,也歡迎關注YouTube , Twitter , Facebook或LinkedIn 。 本文是根據知識共享4.0署名許可發布的 ,如果引用,可以重新打印和修改。 如果您對從視頻講座中生成成績單感興趣,請嘗試使用AutoBlog 。
翻譯自: https://towardsdatascience.com/segmentation-and-object-detection-part-1-b8ef6f101547
檢測和語義分割
總結
以上是生活随笔為你收集整理的检测和语义分割_分割和对象检测-第1部分的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: pytorch机器学习_机器学习— Py
- 下一篇: ai人工智能编程_从人工智能动态编程:Q