當(dāng)前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

人脸识别mtcnn原理

發(fā)布時間：2023/12/10 pytorch 35 豆豆

生活随笔收集整理的這篇文章主要介紹了人脸识别mtcnn原理小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

人臉檢測，也就是在圖片中找到人臉的位置。輸入是一張可能含有人臉的圖片，輸出是人臉位置的矩形框。

人臉對齊。原始圖片中人臉的姿態(tài)、位置可能有較大的區(qū)別，為了之后統(tǒng)一處理，要把人臉“擺正”。為此，需要檢測人臉中的關(guān)鍵點（Landmark），如眼睛的位置、鼻子的位置、嘴巴的位置、臉的輪廓點等。根據(jù)這些關(guān)鍵點可以使用仿射變換將人臉統(tǒng)一校準(zhǔn)，以盡量消除姿勢不同帶來的誤差。

MTCNN網(wǎng)絡(luò)結(jié)構(gòu)

MTCNN由三個神經(jīng)網(wǎng)絡(luò)組成，分別是P-Net、R-Net、O-Net。在使用這些網(wǎng)絡(luò)之前，首先要將原始圖片縮放到不同尺度，形成一個“圖像金字塔”。接著會對每個尺度的圖片通過神經(jīng)網(wǎng)絡(luò)計算一遍。這樣做的原因在于：原始圖片中的人臉存在不同的尺度，如有的人臉比較大，有的人臉比較小。對于比較小的人臉，可以在放大后的圖片上檢測；對于比較大的人臉，可以在縮小后的圖片上檢測。這樣，就可以在統(tǒng)一的尺度下檢測人臉了。

P-Net

P-Net的輸入是一個寬和高皆為12像素，同時是3通道的RGB圖像，該網(wǎng)絡(luò)要判斷這個12×12的圖像中是否含有人臉，并且給出人臉框和關(guān)鍵點的位置。

輸出由三部分組成:

判斷該圖像是否是人臉，輸出向量的形狀為1×1×2，圖像是否是人臉的概率。

給出框的精確位置，一般稱之為框回歸。P-Net輸入的12×12的圖像塊可能并不是完美的人臉框的位置，如有的時候人臉并不正好為方形，有的時候12×12的圖像塊可能偏左或偏右，因此需要輸出當(dāng)前框位置相對于完美的人臉框位置的偏移。對于圖像中的框，可以用四個數(shù)來表示它的位置：框左上角的橫坐標(biāo)、框左上角的縱坐標(biāo)、框的寬度、框的高度。因此，框回歸輸出的值是：框左上角的橫坐標(biāo)的相對偏移、框左上角的縱坐標(biāo)的相對偏移、框的寬度的誤差、框的高度的誤差。輸出向量的形狀就是圖中的1×1×4。

給出人臉的5個關(guān)鍵點的位置。5個關(guān)鍵點分別為：左眼的位置、右眼的位置、鼻子的位置、左嘴角的位置、右嘴角的位置。每個關(guān)鍵點又需要橫坐標(biāo)和縱坐標(biāo)兩維來表示，因此輸出一共是10維（即1×1×10）。

R-Net

對每個P-Net輸出可能為人臉的區(qū)域都放縮到24×24的大小，再輸入到R-Net中，進行進一步判定。

O-Net

進一步把所有得到的區(qū)域縮放成48×48的大小，輸入到最后的O-Net中

從P-Net到R-Net，最后再到O-Net，網(wǎng)絡(luò)輸入的圖片越來越大，卷積層的通道數(shù)越來越多，內(nèi)部的層數(shù)也越來越多，因此它們識別人臉的準(zhǔn)確率應(yīng)該是越來越高的。同時，P-Net的運行速度是最快的，R-Net的速度其次，O-Net的運行速度最慢。之所以要使用三個網(wǎng)絡(luò)，是因為如果一開始直接對圖中的每個區(qū)域使用O-Net，速度會非常慢。實際上P-Net先做了一遍過濾，將過濾后的結(jié)果再交給R-Net進行過濾，最后將過濾后的結(jié)果交給效果最好但速度較慢的O-Net進行判別。這樣在每一步都提前減少了需要判別的數(shù)量，有效降低了處理時間。

中心損失 Center Loss

參考論文：A Discriminative Feature Learning Approach for Deep Face Recognition（http://ydwen.github.io/papers/WenECCV16.pdf）

在理想的狀況下，希望“向量表示”之間的距離可以直接反映人臉的相似度：

對于同一個人的兩張人臉圖像，對應(yīng)的向量之間的歐幾里得距離應(yīng)該比較小。
對于不同人的兩張人臉圖像，對應(yīng)的向量之間的歐幾里得距離應(yīng)該比較大。

在原始的CNN模型中，使用的是Softmax損失。Softmax是類別間的損失，對于人臉來說，每一類就是一個人。盡管使用Softmax損失可以區(qū)別出每個人，但其本質(zhì)上沒有對每一類的向量表示之間的距離做出要求。

中心損失（Center Loss）不直接對距離進行優(yōu)化，它保留了原有的分類模型，但又為每個類（人）指定了一個類別中心。同一類的圖像對應(yīng)的特征都應(yīng)該盡量靠近自己的類別中心，不同類的類別中心盡量遠(yuǎn)離。

還是設(shè)輸入的人臉圖像為，該人臉對應(yīng)的類別為，對每個類別都規(guī)定一個類別中心，記作。希望每個人臉圖像對應(yīng)的特征都盡可能接近其中心。因此定義中心損失為

多張圖像的中心損失就是將它們的值加在一起

這是一個非常簡單的定義。不過還有一個問題沒有解決，那就是如何確定每個類別的中心呢？從理論上來說，類別的最佳中心應(yīng)該是它對應(yīng)的所有圖片的特征的平均值。但如果采取這樣的定義，那么在每一次梯度下降時，都要對所有圖片計算一次，計算復(fù)雜度就太高了。針對這種情況，不妨近似一處理下，在初始階段，先隨機確定，接著在每個batch內(nèi)，使用對當(dāng)前batch內(nèi)的也計算梯度，并使用該梯度更新。此外，不能只使用中心損失來訓(xùn)練分類模型，還需要加入Softmax損失，也就是說，最終的損失由兩部分構(gòu)成，即，其中λ是一個超參數(shù)。

從圖中可以看出，當(dāng)中心損失的權(quán)重λ越大時，生成的特征就會具有越明顯的“內(nèi)聚性”。

def center_loss(features, label, alfa, nrof_classes):Center loss based on the paper "A Discriminative Feature Learning Approach for Deep Face Recognition"(http://ydwen.github.io/papers/WenECCV16.pdf):param features: 深度卷積網(wǎng)絡(luò)提取的特征，[batch_size, feature_dim]:param label: 類別標(biāo)簽， [batch_size, 1]:param alfa: :param nrof_classes: 類別總數(shù)， int:return:nrof_features = features.get_shape()[1]centers = tf.get_variable('centers', [nrof_classes, nrof_features], dtype=tf.float32,initializer=tf.constant_initializer(0), trainable=False)label = tf.reshape(label, [-1])centers_batch = tf.gather(centers, label)diff = (1 - alfa) * (centers_batch - features) # 計算梯度centers = tf.scatter_sub(centers, label, diff) # 更新類別中心loss = tf.reduce_mean(tf.square(features - centers_batch))return loss, centers

三元組損失 Triplet Loss

每次都在訓(xùn)練數(shù)據(jù)中取出三張人臉圖像，第一張圖像記為，第二張圖像記為，第三張圖像記為。在這樣一個“三元組”中，和對應(yīng)的是同一個人的圖像，而是另外一個不同的人的人臉圖像。因此，距離應(yīng)該較小，而距離應(yīng)該較大。嚴(yán)格來說，三元組損失要求下面的式子成立

即相同人臉間的距離平方至少要比不同人臉間的距離平方小，據(jù)此，設(shè)計損失函數(shù)為

這樣的話，當(dāng)三元組的距離滿足時，不產(chǎn)生任何損失，此時。當(dāng)距離不滿足上述等式時，就會有值為的損失。此外，在訓(xùn)練時會固定，以保證特征不會無限地“遠(yuǎn)離”。

三元組損失直接對距離進行優(yōu)化，因此可以解決人臉的特征表示問題。但是在訓(xùn)練過程中，三元組的選擇非常地有技巧性。如果每次都是隨機選擇三元組，雖然模型可以正確地收斂，但是并不能達到最好的性能。如果加入“難例挖掘”，即每次都選擇最難分辨的三元組進行訓(xùn)練，模型又往往不能正確地收斂。對此，又提出每次都選取那些“半難”（Semi-hard）的數(shù)據(jù)進行訓(xùn)練，讓模型在可以收斂的同時也保持良好的性能。此外，使用三元組損失訓(xùn)練人臉模型通常還需要非常大的人臉數(shù)據(jù)集，才能取得較好的效果。

def triplet_loss(anchor, positive, negative, alpha):"""Calculate the triplet loss according to the FaceNet paperArgs:anchor: the embeddings for the anchor images.positive: the embeddings for the positive images.negative: the embeddings for the negative images.Returns:the triplet loss according to the FaceNet paper as a float tensor."""with tf.variable_scope('triplet_loss'):pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), 1)neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), 1)basic_loss = tf.add(tf.subtract(pos_dist,neg_dist), alpha)loss = tf.reduce_mean(tf.maximum(basic_loss, 0.0), 0)return loss def select_triplets(embeddings, nrof_images_per_class, image_paths, people_per_batch, alpha):"""Select the triplets for training:param embeddings: 深度神經(jīng)網(wǎng)絡(luò)提取的圖片特征向量 [?, embedding_dim]:param nrof_images_per_class: list,每個人的圖片數(shù)量列表:param image_paths::param people_per_batch: 每個batch包含的類別（人）數(shù)量:param alpha::return:"""trip_idx = 0emb_start_idx = 0num_trips = 0triplets = []for i in range(people_per_batch):nrof_images = int(nrof_images_per_class[i])for j in range(1,nrof_images):a_idx = emb_start_idx + j - 1 # anchor indexneg_dists_sqr = np.sum(np.square(embeddings[a_idx] - embeddings), 1) # 計算anchor 圖片和其他人臉的距離for pair in range(j, nrof_images): p_idx = emb_start_idx + pair # positive indexpos_dist_sqr = np.sum(np.square(embeddings[a_idx]-embeddings[p_idx])) # 計算anchor 和positive人臉距離neg_dists_sqr[emb_start_idx:emb_start_idx+nrof_images] = np.NaN # 將anchor人臉與同類的人臉距離mask為Nanall_neg = np.where(neg_dists_sqr-pos_dist_sqr<alpha)[0] # 篩選出不同人臉之間的距離比相同人臉之間的距離大alpha的負(fù)例圖片nrof_random_negs = all_neg.shape[0]if nrof_random_negs>0:rnd_idx = np.random.randint(nrof_random_negs) # 從滿足要求的負(fù)例集中隨機選取一張圖片作為負(fù)例n_idx = all_neg[rnd_idx]triplets.append((image_paths[a_idx], image_paths[p_idx], image_paths[n_idx]))trip_idx += 1num_trips += 1emb_start_idx += nrof_imagesnp.random.shuffle(triplets)return triplets, num_trips, len(triplets) def train(args, sess, dataset, epoch, image_paths_placeholder, labels_placeholder, labels_batch,batch_size_placeholder, learning_rate_placeholder, phase_train_placeholder, enqueue_op, input_queue, global_step, embeddings, loss, train_op, summary_op, summary_writer, learning_rate_schedule_file,embedding_size, anchor, positive, negative, triplet_loss):batch_number = 0if args.learning_rate>0.0:lr = args.learning_rateelse:lr = facenet.get_learning_rate_from_file(learning_rate_schedule_file, epoch)while batch_number < args.epoch_size:# 從總數(shù)據(jù)中隨機選擇people_per_batch*images_per_person 張照片，同類的照片放在一起image_paths, num_per_class = sample_people(dataset, args.people_per_batch, args.images_per_person)print('Running forward pass on sampled images: ', end='')start_time = time.time()nrof_examples = args.people_per_batch * args.images_per_personlabels_array = np.reshape(np.arange(nrof_examples),(-1,3))image_paths_array = np.reshape(np.expand_dims(np.array(image_paths),1), (-1,3))# 將people_per_batch*images_per_person 張照片入隊列sess.run(enqueue_op, {image_paths_placeholder: image_paths_array, labels_placeholder: labels_array})emb_array = np.zeros((nrof_examples, embedding_size))nrof_batches = int(np.ceil(nrof_examples / args.batch_size))# 計算people_per_batch*images_per_person 張照片的向量表示，計算的同時出隊列，計算完成后，隊列為空for i in range(nrof_batches):batch_size = min(nrof_examples-i*args.batch_size, args.batch_size)emb, lab = sess.run([embeddings, labels_batch], feed_dict={batch_size_placeholder: batch_size, learning_rate_placeholder: lr, phase_train_placeholder: True})emb_array[lab,:] = embprint('%.3f' % (time.time()-start_time))# 選擇出“半難的”數(shù)據(jù)進行訓(xùn)練print('Selecting suitable triplets for training')triplets, nrof_random_negs, nrof_triplets = select_triplets(emb_array, num_per_class, image_paths, args.people_per_batch, args.alpha)selection_time = time.time() - start_timeprint('(nrof_random_negs, nrof_triplets) = (%d, %d): time=%.3f seconds' % (nrof_random_negs, nrof_triplets, selection_time))# Perform training on the selected tripletsnrof_batches = int(np.ceil(nrof_triplets*3/args.batch_size))triplet_paths = list(itertools.chain(*triplets))labels_array = np.reshape(np.arange(len(triplet_paths)),(-1,3))triplet_paths_array = np.reshape(np.expand_dims(np.array(triplet_paths),1), (-1,3))# 將“半難的”數(shù)據(jù)入隊列sess.run(enqueue_op, {image_paths_placeholder: triplet_paths_array, labels_placeholder: labels_array})nrof_examples = len(triplet_paths)train_time = 0i = 0emb_array = np.zeros((nrof_examples, embedding_size))loss_array = np.zeros((nrof_triplets,))# 按批次訓(xùn)練while i < nrof_batches:start_time = time.time()batch_size = min(nrof_examples-i*args.batch_size, args.batch_size)feed_dict = {batch_size_placeholder: batch_size, learning_rate_placeholder: lr, phase_train_placeholder: True}err, _, step, emb, lab = sess.run([loss, train_op, global_step, embeddings, labels_batch], feed_dict=feed_dict)emb_array[lab,:] = embloss_array[i] = errduration = time.time() - start_timeprint('Epoch: [%d][%d/%d]\tTime %.3f\tLoss %2.3f' %(epoch, batch_number+1, args.epoch_size, duration, err))batch_number += 1i += 1train_time += duration# Add validation loss and accuracy to summarysummary = tf.Summary()#pylint: disable=maybe-no-membersummary.value.add(tag='time/selection', simple_value=selection_time)summary_writer.add_summary(summary, step)return step

總結(jié)

以上是生活随笔為你收集整理的人脸识别mtcnn原理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

原理
mtcnn