當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫——爬取Python岗位的那些事

發布時間：2025/3/15 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫——爬取Python岗位的那些事小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文目標

獲取 Ajax 請求,解析 JSON 中所需字段
數據保存到 Excel 中
數據保存到 MySQL, 方便分析

簡單分析

五個城市 Python 崗位平均薪資水平

Python 崗位要求學歷分布

Python 行業領域分布

Python 公司規模分布

查看頁面結構

我們輸入查詢條件以 Python 為例，其他條件默認不選，點擊查詢，就能看到所有 Python 的崗位了，然后我們打開控制臺，點擊網絡標簽可以看到如下請求：

從響應結果來看，這個請求正是我們需要的內容。后面我們直接請求這個地址就好了。從圖中可以看出 result 下面就是各個崗位信息。

到這里我們知道了從哪里請求數據，從哪里獲取結果。但是 result 列表中只有第一頁 15 條數據，其他頁面數據怎么獲取呢？

分析請求參數

我們點擊參數選項卡，如下：

發現提交了三個表單數據，很明顯看出來 kd 就是我們搜索的關鍵詞，pn 就是當前頁碼。first 默認就行了，不用管它。剩下的事情就是構造請求，來下載 30 個頁面的數據了。

構造請求，并解析數據

構造請求很簡單，我們還是用 requests 庫來搞定。首先我們構造出表單數據?data = {‘first’: ‘true’, ‘pn’: page, ‘kd’: lang_name}?之后用 requests 來請求url地址，解析得到的 Json 數據就算大功告成了。由于拉勾對爬蟲限制比較嚴格，我們需要把瀏覽器中 headers 字段全部加上，而且把爬蟲間隔調大一點，我后面設置的為 10-20s，然后就能正常獲取數據了。

import requestsdef get_json(url, page, lang_name):headers = {'Host': 'www.lagou.com','Connection': 'keep-alive','Content-Length': '23','Origin': 'https://www.lagou.com','X-Anit-Forge-Code': '0','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0','Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8','Accept': 'application/json, text/javascript, */*; q=0.01','X-Requested-With': 'XMLHttpRequest','X-Anit-Forge-Token': 'None','Referer': 'https://www.lagou.com/jobs/list_python?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=&suginput=','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}data = {'first': 'false', 'pn': page, 'kd': lang_name}json = requests.post(url, data, headers=headers).json()list_con = json['content']['positionResult']['result']info_list = []for i in list_con:info = []info.append(i.get('companyShortName', '無'))info.append(i.get('companyFullName', '無'))info.append(i.get('industryField', '無'))info.append(i.get('companySize', '無'))info.append(i.get('salary', '無'))info.append(i.get('city', '無'))info.append(i.get('education', '無'))info_list.append(info)return info_list

獲取所有數據

了解了如何解析數據，剩下的就是連續請求所有頁面了，我們構造一個函數來請求所有 30 頁的數據。

def main():lang_name = 'python'wb = Workbook()conn = get_conn()for i in ['北京', '上海', '廣州', '深圳', '杭州']:page = 1ws1 = wb.activews1.title = lang_nameurl = 'https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false'.format(i)while page < 31:info = get_json(url, page, lang_name)page += 1import timea = random.randint(10, 20)time.sleep(a)for row in info:insert(conn, tuple(row))ws1.append(row)conn.close()wb.save('{}職位信息.xlsx'.format(lang_name))if __name__ == '__main__':main()

完整代碼

import random import timeimport requests from openpyxl import Workbook import pymysql.cursorsdef get_conn():'''建立數據庫連接'''conn = pymysql.connect(host='localhost',user='root',password='root',db='python',charset='utf8mb4',cursorclass=pymysql.cursors.DictCursor)return conndef insert(conn, info):'''數據寫入數據庫'''with conn.cursor() as cursor:sql = "INSERT INTO `python` (`shortname`, `fullname`, `industryfield`, `companySize`, `salary`, `city`, `education`) VALUES (%s, %s, %s, %s, %s, %s, %s)"cursor.execute(sql, info)conn.commit()//如果大家對Python感興趣的話，可以加一下我們的學習交流摳摳群哦：649825285，免費領取一套學習資料和視頻課程喲~ def get_json(url, page, lang_name):'''返回當前頁面的信息列表'''headers = {'Host': 'www.lagou.com','Connection': 'keep-alive','Content-Length': '23','Origin': 'https://www.lagou.com','X-Anit-Forge-Code': '0','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0','Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8','Accept': 'application/json, text/javascript, */*; q=0.01','X-Requested-With': 'XMLHttpRequest','X-Anit-Forge-Token': 'None','Referer': 'https://www.lagou.com/jobs/list_python?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=&suginput=','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}data = {'first': 'false', 'pn': page, 'kd': lang_name}json = requests.post(url, data, headers=headers).json()list_con = json['content']['positionResult']['result']info_list = []for i in list_con:info = []info.append(i.get('companyShortName', '無')) # 公司名info.append(i.get('companyFullName', '無'))info.append(i.get('industryField', '無')) # 行業領域info.append(i.get('companySize', '無')) # 公司規模info.append(i.get('salary', '無')) # 薪資info.append(i.get('city', '無'))info.append(i.get('education', '無')) # 學歷info_list.append(info)return info_list # 返回列表def main():lang_name = 'python'wb = Workbook() # 打開 excel 工作簿conn = get_conn() # 建立數據庫連接不存數據庫注釋此行for i in ['北京', '上海', '廣州', '深圳', '杭州']: # 五個城市page = 1ws1 = wb.activews1.title = lang_nameurl = 'https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false'.format(i)while page < 31: # 每個城市30頁信息info = get_json(url, page, lang_name)page += 1time.sleep(random.randint(10, 20))for row in info:insert(conn, tuple(row)) # 插入數據庫，若不想存入注釋此行ws1.append(row)conn.close() # 關閉數據庫連接，不存數據庫注釋此行wb.save('{}職位信息.xlsx'.format(lang_name))if __name__ == '__main__':main()

總結

以上是生活随笔為你收集整理的Python爬虫——爬取Python岗位的那些事的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：十六、“心念田园穿古镇，足踏古岸望潭汀。
下一篇：用Python实现智能推荐！某音，某宝都