當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫实战之爬取51job前程无忧简历

發布時間：2023/12/14 python 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫实战之爬取51job前程无忧简历小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

首先F12對搜索的網頁進行分析，51job網址

我們可以觀察到，其網頁結構比較簡單，基本信息都在 p標簽下
這種情況利用正則表達式可以很容易的把信息提取出來

代碼如下：

import urllib.request import re#獲取原碼 def get_content(page,name):name = urllib.request.quote(name)url ='http://search.51job.com/list/000000,000000,0000,00,9,99,'+name+',2,'+ str(page)+'.html'a = urllib.request.urlopen(url)#打開網址html = a.read().decode('gbk')#讀取源代碼并轉為unicodereturn htmldef get(html):reg1 = re.compile(r'class="t1 ">.*?<a target="_blank" title=".*?" href="(.*?)".*?', re.S)#公司招人詳情detail_url=re.findall(reg1, html)print(detail_url)reg = re.compile(r'class="t1 ">.*? <a target="_blank" title="(.*?)".*?<a target="_blank" title="(.*?)" href="(.*?)".*?(.*?).*?(.*?).*? (.*?)',re.S)#基本信息items=re.findall(reg,html)return items,detail_url def run():name = input('請輸入想要爬取的職業:')#多頁處理，下載到文件for j in range(1,3):print("正在爬取第"+str(j)+"頁數據...")html=get_content(j,name)#調用獲取網頁原碼items, detail_url=get(html)for i,c in zip(items,detail_url):#print(i[0],i[1],i[2],i[3],i[4])with open ('51job.txt','a',encoding='utf-8') as f:f.write(i[0]+'\t'+i[1]+'\t'+i[3]+'\t'+i[4]+'\t'+i[5]+'\t'+i[2]+'\t'+c+'\n')f.close() if __name__ == '__main__':run()

演示如下：

txt文件：

總結：本代碼只是對搜索網頁上的職位進行簡單的爬取，后續將將對detail_url網頁內的職業內容詳情進行爬取，并進行數據清洗等操作，對數據文本進行挖掘與分析。

**對于51job詳情爬取并生成Excel文件請移步這篇文章：**https://blog.csdn.net/weixin_43746433/article/details/90490227

總結

以上是生活随笔為你收集整理的python爬虫实战之爬取51job前程无忧简历的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：安捷伦频谱仪的使用方法图解_安捷伦N90
下一篇： rails-redis hgetall