日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

用python抓取智联招聘信息并存入excel

發布時間:2023/12/14 python 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 用python抓取智联招聘信息并存入excel 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

用python抓取智聯招聘信息并存入excel

tags:python 智聯招聘導出excel


引言:前一陣子是人們俗稱的金三銀四,跳槽的小朋友很多,我覺得每個人都應該給自己做一下規劃,根據自己的進步作出調整。建議不要看到身邊的人漲了工資就盲目的心動。一般來說跳槽后要熟悉新的環境會浪費不少時間,如果現在的工作在氛圍和自身進步上還可以接受,其他比如待遇方面可以和公司協調解決。

本文參考了yaoyefengchen的博客:文章鏈接,并進行了地域搜索優化和將存儲方式由cvs改成大家常用的excel。下面進入正文

先說一下大概流程:
在智聯職位搜索頁面上選好自己的搜索條件后,發現鏈接地址為:

http://sou.zhaopin.com/jobs/searchresult.ashx?jl=北京&kw=php高級工程師&sm=0&re=2006&isfilter=1&p=1&sf=10001&st=15000

分析鏈接中的參數如下(過濾條件可以選擇不寫),并構造出請求的數據,header的設置只要可以訪問網頁即可。

paras = {'jl': city, # 搜索城市'kw': keyword, # 搜索關鍵詞'isadv': 0, # 是否打開更詳細搜索選項'isfilter': 1, # 是否對結果過濾'p': page, # 頁數're': region # region的縮寫,地區,2005代表海淀}# sf=10001&st=15000這兩個是我篩選的工資區間,如果有這個需求可以自己添加參數。url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?' + urlencode(paras)

yaoyefengchen用的是正則匹配出職位,薪資,公司等信息,并沒有提供具體地域(比如海淀還是朝陽)對應的region。我后來是用的xpath提取出了北京的各個地域組成字典,直接輸入地區的漢字就可以了。如下:

# 取搜索頁面得到地域的對應數字 比如海淀對應2005 def parseHtmlToGetRegion(regionAddress):url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=北京&sm=0&isfilter=1&p=1&re=2006'# 獲取代理ip地址 只取前五頁html= getHtml(url)regionId = html.xpath('/html/body/div[3]/div[3]/div[1]/div[4]/div[1]/div[2]/a/@href')region = html.xpath('/html/body/div[3]/div[3]/div[1]/div[4]/div[1]/div[2]/a/text()')#解析一下region中的編號去掉無效內容regionList = {}for i,regionHref in enumerate(regionId):if i==0:continueregionList[region[i]] = regionId[i][-4::]return regionList.get(regionAddress)

另外,cvs格式在用一些工具比如excel打開的時候經常出現亂碼,需要轉化或者下載一些專用的軟件。我覺得很不方便,所以直接存成了excel格式,不得不說,在存數據到excel文件這方面,python簡直比php容易太多了。

# 存入excle def write_xls_file(filename, headers, jobs):table = xlwt.Workbook(encoding='utf8')table_page = table.add_sheet('jobs')for i,header in enumerate(headers):table_page.write(0,i,header)for j,items in enumerate(jobs,start = 1):for q,item in items.items():table_page.write(j, q, item)table.save(filename)

完整代碼如下,可以直接使用。別忘了保存文章最下面的user_agents.py文件

#-*- coding: utf-8 -*- ''' Created on 2018-05-7 @author: Vinter_he ''' import re import requests import xlwt from tqdm import tqdm from urllib.parse import urlencode from requests.exceptions import RequestException from lxml import etree import user_agents import random import datetimedef get_one_page(city, keyword, region, page):'''獲取網頁html內容并返回'''paras = {'jl': city, # 搜索城市'kw': keyword, # 搜索關鍵詞'isadv': 0, # 是否打開更詳細搜索選項'isfilter': 1, # 是否對結果過濾'p': page, # 頁數're': region # region的縮寫,地區,2005代表海淀}headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','Host': 'sou.zhaopin.com','Referer': 'https://www.zhaopin.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?' + urlencode(paras)try:# 獲取網頁內容,返回html數據response = requests.get(url, headers=headers)# 通過狀態碼判斷是否獲取成功if response.status_code == 200:return response.textreturn Noneexcept RequestException as e:return Nonedef parse_one_page(html):'''解析HTML代碼,提取有用信息并返回'''# 正則表達式進行解析pattern = re.compile('<a style=.*? target="_blank">(.*?)</a>.*?' # 匹配職位信息'<td class="gsmc"><a href="(.*?)" target="_blank">(.*?)</a>.*?' # 匹配公司網址和公司名稱'<td class="zwyx">(.*?)</td>', re.S) # 匹配月薪# 匹配所有符合條件的內容items = re.findall(pattern, html)for item in items:job_name = item[0]job_name = job_name.replace('<b>', '')job_name = job_name.replace('</b>', '')yield {0: job_name,1: item[1],2: item[2],3: item[3]}# 存入excle def write_xls_file(filename, headers, jobs):table = xlwt.Workbook(encoding='utf8')table_page = table.add_sheet('jobs')for i,header in enumerate(headers):table_page.write(0,i,header)for j,items in enumerate(jobs,start = 1):for q,item in items.items():table_page.write(j, q, item)table.save(filename)def main(city, keyword, region, pages):'''主函數'''filename = '智聯_' +datetime.date.today().strftime('%Y-%m-%d')+ city + '_' + keyword + '.xls'headers = ['job', 'website', 'company', 'salary']jobs = []for i in tqdm(range(pages)):'''獲取該頁中所有職位信息,寫入xls文件'''region = parseHtmlToGetRegion(region)html = get_one_page(city, keyword, region, i)items = parse_one_page(html)for item in items:jobs.append(item)write_xls_file(filename, headers, jobs)def getHtml(url):response = requests.get(url=url, headers={'User-Agent':random.choice(user_agents.user_agents)}, timeout=10).texthtml = etree.HTML(response)return html# 取搜索頁面得到地域的對應數字 比如海淀對應2005 def parseHtmlToGetRegion(regionAddress):url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=北京&sm=0&isfilter=1&p=1&re=2006'# 獲取代理ip地址 只取前五頁html= getHtml(url)regionId = html.xpath('/html/body/div[3]/div[3]/div[1]/div[4]/div[1]/div[2]/a/@href')region = html.xpath('/html/body/div[3]/div[3]/div[1]/div[4]/div[1]/div[2]/a/text()')#解析一下region中的編號去掉無效內容regionList = {}for i,regionHref in enumerate(regionId):if i==0:continueregionList[region[i]] = regionId[i][-4::]return regionList.get(regionAddress)if __name__ == '__main__':main('北京', 'php工程師', '朝陽', 10)

下面是和以前一樣的user_agents.py文件 這個文件以后就不給了大家可以自己保存一下備用

#!/usr/bin/python # -*- coding:utf-8 -*- ''' Created on 2018-04-27@author: Vinter_he '''user_agents = ['Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11','Opera/9.25 (Windows NT 5.1; U; en)','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)','Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12','Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]

總結

以上是生活随笔為你收集整理的用python抓取智联招聘信息并存入excel的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。