當(dāng)前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

人脸爬取（人脸数据集的搜集）

發(fā)布時間：2024/3/24 pytorch 37 豆豆

生活随笔收集整理的這篇文章主要介紹了人脸爬取（人脸数据集的搜集）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

在進(jìn)行人臉相關(guān)處理中，人臉數(shù)據(jù)集是關(guān)鍵，這里描述一下怎樣爬取人臉數(shù)據(jù)集

1、獲取藝人名稱

① 獲取完整url路徑

在百度中搜索“中國藝人”

得到以下界面

通過分析，url的完整路徑為：

"https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php?resource_id=28266&from_mid=500&format=json&ie=utf-8&oe=utf-8&query=%E4%B8%AD%E5%9B%BD%E8%89%BA%E4%BA%BA&sort_key=&sort_type=1&stat0=&stat1=&stat2=&stat3=&pn="+pn+"&rn=100&_=1580457480665"

其中，pn為頁碼數(shù)。

② 解析全部藝人名稱列表

解析上述url，通過requests獲取網(wǎng)頁內(nèi)容，然后解析藝人姓名

def get_person_name():person_list = []pn_i=0while(True):pn=str(pn_i)pn_i+=100url="https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php?resource_id=28266&from_mid=500&format=json&ie=utf-8&oe=utf-8&query=%E4%B8%AD%E5%9B%BD%E8%89%BA%E4%BA%BA&sort_key=&sort_type=1&stat0=&stat1=&stat2=&stat3=&pn="+pn+"&rn=100&_=1580457480665"res = requests.get(url)try:json_str=json.loads(res.text)except:continuefigs=json_str['data'][0]['result']for i in figs:name=i['ename']print(name)person_list.append(name)return person_list

2、爬取相應(yīng)藝人的相應(yīng)照片

① 獲取圖片的url

while pn < self.__amount:url = "https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1639129009987_R&pv=&ic=&nc=1&z=&hd=&latest=&copyright=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&dyTabStr=MCwzLDEsNiwyLDQsNSw3LDgsOQ%3D%3D&ie=utf-8&sid=&word=" + word_quotetry:time.sleep(self.time_sleep)req = urllib.request.Request(url=url, headers=self.headers)#print(req)page = urllib.request.urlopen(req)rsp = page.read()rsp = str(rsp)index_list = find_all_sub("objURL",rsp)img_root = []for i in range(0,len(index_list)):temp = []if i == len(index_list) -1:temp = rsp[index_list[len(index_list) -1]:len(rsp)]else:temp = rsp[index_list[i]:index_list[i + 1]]img_root.append(temp)for img_root_path in img_root:temp_url = img_root_path[9:]end = temp_url.find('"') image_temp_url = temp_url[:end]if not find_in_list(image_url_list,image_temp_url):image_url_list.append(image_temp_url)except UnicodeDecodeError as e:print(e)print('-----UnicodeDecodeErrorurl:', url)except urllib.error.URLError as e:print(e)print("-----urlErrorurl:", url)except socket.timeout as e:print(e)print("-----socket timout:", url)else:# 讀取下一頁print("下載下一頁")pn += 60finally:page.close()

② 下載對應(yīng)的圖片

image_root_path = "./" + wordif not os.path.exists(image_root_path):os.mkdir(image_root_path)filepath = image_root_path + "/" + str(word) + "_" + str(number) + ".jpg"for img_url in image_url_list:number += 1filepath = image_root_path + "/" + str(word) + "_" + str(number) + ".jpg"print(filepath)count = 1try:urllib.request.urlretrieve(img_url, filepath)except socket.timeout:while count <= 3:try:urllib.request.urlretrieve(img_url, filepath)breakexcept socket.timeout:count += 1finally:# display the raw url of imagesprint('\t%d\t%s' % (number, img_url))if count > 3:print('\t%d\t%s failed' % (number, img_url))passprint("下載任務(wù)結(jié)束")

總結(jié)

以上是生活随笔為你收集整理的人脸爬取（人脸数据集的搜集）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： java ftp详解_Java FTP
下一篇：梳理百年深度学习发展史-七月在线机器学习