python爬虫——智联招聘(上)
開發(fā)環(huán)境
win7+,python3.4+
pymysql庫,安裝:pip3 install pymysql
selenium庫,火狐瀏覽器56.0版本,geckodriver.exe,selenium知識(shí)點(diǎn)
MySQL5.5數(shù)據(jù)庫,Navicat圖形化界面
爬取步驟
1.分析智聯(lián)招聘網(wǎng),獲取網(wǎng)頁信息
????打開“https://www.zhaopin.com/”選擇城市“北京”,輸入“GIS”點(diǎn)擊“搜工作”網(wǎng)頁將顯示與“GIS”相關(guān)的北京地區(qū)的招聘信息
?? F12進(jìn)去開發(fā)者后臺(tái)“城市”“工作輸入”“搜工作按鈕”的html元素分別為“id=JobLocation”,“id=KeyWord_kw2”,“class=dosearch”(selenium知識(shí)點(diǎn))。根據(jù)這些可以自動(dòng)轉(zhuǎn)入下個(gè)頁面:
代碼一:
def get_main_page(keyword, city):fox = webdriver.Firefox()url = 'https://www.zhaopin.com/' fox.get(url)time.sleep(1)jl = fox.find_element_by_id('JobLocation')jl.clear()jl.send_keys(city)zl = fox.find_element_by_id('KeyWord_kw2')zl.clear()zl.send_keys(keyword)sj = fox.find_element_by_class_name('doSearch').click()time.sleep(3)
2.分析招聘信息,獲取信息
????查看源代碼找到各個(gè)部分的信息具體如下
def get_everypage_info(fox, keyword, city):fox.switch_to_window(fox.window_handles[-1])tables = fox.find_elements_by_tag_name('table') for i in range(0, len(tables)):if i == 0:''' row = ['職位名稱', '公司名稱', '工作地點(diǎn)', '公司規(guī)模', '工作經(jīng)驗(yàn)', '平均月薪', '學(xué)歷要求', '職位描述'] information.append(row) ''' else:address, develop, jingyan, graduate, require = " ", " ", " ", " ", " " job = tables[i].find_element_by_tag_name('a').textcompany = tables[i].find_element_by_css_selector('.gsmc a').textsalary = tables[i].find_element_by_css_selector('.zwyx').textspans = tables[i].find_elements_by_css_selector('.newlist_deatil_two span')for j in range(0, len(spans)):if "地點(diǎn)" in spans[j].get_attribute('textContent'):address = (spans[j].get_attribute('textContent'))[3:]elif "公司規(guī)模" in spans[j].get_attribute('textContent'):develop = (spans[j].get_attribute('textContent'))[5:]elif "經(jīng)驗(yàn)" in spans[j].get_attribute('textContent'):jingyan = (spans[j].get_attribute('textContent'))[3:]elif "學(xué)歷" in spans[j].get_attribute('textContent'):graduate = (spans[j].get_attribute('textContent'))[3:]require = (tables[i].find_element_by_css_selector('.newlist_deatil_last').get_attribute('textContent'))[8:]以上代碼得到每一頁的每個(gè)招聘公司的信息:職位名稱', '公司名稱', '工作地點(diǎn)', '公司規(guī)模', '工作經(jīng)驗(yàn)', '平均月薪', '學(xué)歷要求', '職位描述'
3.信息存入MySQL數(shù)據(jù)庫
????連接mysql并且創(chuàng)建新表,將數(shù)據(jù)逐行寫入數(shù)據(jù)庫,同時(shí)將“職位描述”寫入一個(gè)txt文件
連接mysql:
table_name = city + '_' + keyword conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', passwd='', db='python', charset='utf8') cursor = conn.cursor()
創(chuàng)建新表:
sql = """CREATE TABLE IF NOT EXISTS %s( 職位名稱 CHAR(100), 公司名稱 CHAR(100), 工作地點(diǎn) CHAR(100), 公司規(guī)模 CHAR(100), 工作經(jīng)驗(yàn) CHAR(100), 平均月薪 CHAR(100), 學(xué)歷要求 CHAR(100) )default charset=UTF8""" % (table_name) cursor.execute(sql)
將信息分別寫入mysql和txt:
insert_row = ('insert into {0}(職位名稱,公司名稱,工作地點(diǎn),公司規(guī)模,工作經(jīng)驗(yàn),平均月薪,學(xué)歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name)) insert_data = (job, company, address, develop, jingyan, salary, graduate) cursor.execute(insert_row, insert_data) conn.commit() with open('%s職位描述.txt' % (table_name), 'a', encoding='utf-8') as f:f.write(require)
4.招聘信息頁面跳轉(zhuǎn)
“下一頁”按鈕的html元素通過下面代碼找到并跳轉(zhuǎn):
count = 0 while count <= 10:try:next_page = fox.find_element_by_class_name('pagesDown-pos').click()break except:time.sleep(8)count += 1 continue if count > 10:fox.close() else:time.sleep(1)get_everypage_info(fox, keyword, city) 注意:此處十分重要,while循環(huán)用于判斷是否到了最后一頁,如果進(jìn)行10次“next_page = fox.find_element_by_class_name('pagesDown-pos').click()”仍然沒反應(yīng),就會(huì)跳出循環(huán)進(jìn)去下面的if,關(guān)閉瀏覽器;如果“next_page = fox.find_element_by_class_name('pagesDown-pos').click()”有反應(yīng)break也會(huì)跳出while進(jìn)入下面“else”進(jìn)而跳轉(zhuǎn)到下一頁5.“main”設(shè)置進(jìn)行城市循環(huán)
if __name__ == "__main__":citys = ['上海', '深圳', '廣州', '武漢', '杭州', '南京', '成都', '青島'] # '北京', 已爬取 job = '數(shù)據(jù)挖掘分析' for city in citys:print(" ")get_main_page(job, city)
每個(gè)城市的job信息爬取完了自動(dòng)進(jìn)行列表中下個(gè)城市信息爬取
6.注意和問題
(1)創(chuàng)建mysql表問題一:定義表的編碼形式“default charset=UTF8”,不然輸入寫入時(shí)報(bào)錯(cuò)
(2)數(shù)據(jù)寫入mysql表問題二:'insert into {0}(職位名稱,公司名稱,工作地點(diǎn),公司規(guī)模,工作經(jīng)驗(yàn),平均月薪,學(xué)歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name)處要先將表名帶入,insert 語句中表名和列名都不能帶單引號(hào)和雙引號(hào),提前寫入可以避免。和值一起寫入時(shí)默認(rèn)代了引號(hào);
?insert_row = ('insert into {0}(職位名稱,公司名稱,工作地點(diǎn),公司規(guī)模,工作經(jīng)驗(yàn),平均月薪,學(xué)歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name))
????????????insert_data = (job, company, address, develop, jingyan, salary, graduate)
? ? ? ? ? ? cursor.execute(insert_row, insert_data)
(3)time.sleep()根據(jù)網(wǎng)速和電腦性能而定,上佳的時(shí)間可以設(shè)置短;不佳的就要適當(dāng)延長時(shí)間設(shè)置,不讓代碼將捕捉不到html元素
完整代碼:
from selenium import webdriver from selenium.webdriver.common.keys import Keys import time import pymysqldef get_main_page(keyword, city):fox = webdriver.Firefox()url = 'https://www.zhaopin.com/' fox.get(url)time.sleep(1)jl = fox.find_element_by_id('JobLocation')jl.clear()jl.send_keys(city)zl = fox.find_element_by_id('KeyWord_kw2')zl.clear()zl.send_keys(keyword)sj = fox.find_element_by_class_name('doSearch').click()time.sleep(3)get_everypage_info(fox, keyword, city)def get_everypage_info(fox, keyword, city):fox.switch_to_window(fox.window_handles[-1])tables = fox.find_elements_by_tag_name('table')table_name = city + '_' + keywordconn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', passwd='', db='python', charset='utf8')cursor = conn.cursor()sql = """CREATE TABLE IF NOT EXISTS %s( 職位名稱 CHAR(100), 公司名稱 CHAR(100), 工作地點(diǎn) CHAR(100), 公司規(guī)模 CHAR(100), 工作經(jīng)驗(yàn) CHAR(100), 平均月薪 CHAR(100), 學(xué)歷要求 CHAR(100) )default charset=UTF8""" % (table_name)cursor.execute(sql)for i in range(0, len(tables)):if i == 0:''' row = ['職位名稱', '公司名稱', '工作地點(diǎn)', '公司規(guī)模', '工作經(jīng)驗(yàn)', '平均月薪', '學(xué)歷要求', '職位描述'] information.append(row) ''' else:address, develop, jingyan, graduate, require = " ", " ", " ", " ", " " job = tables[i].find_element_by_tag_name('a').textcompany = tables[i].find_element_by_css_selector('.gsmc a').textsalary = tables[i].find_element_by_css_selector('.zwyx').textspans = tables[i].find_elements_by_css_selector('.newlist_deatil_two span')for j in range(0, len(spans)):if "地點(diǎn)" in spans[j].get_attribute('textContent'):address = (spans[j].get_attribute('textContent'))[3:]elif "公司規(guī)模" in spans[j].get_attribute('textContent'):develop = (spans[j].get_attribute('textContent'))[5:]elif "經(jīng)驗(yàn)" in spans[j].get_attribute('textContent'):jingyan = (spans[j].get_attribute('textContent'))[3:]elif "學(xué)歷" in spans[j].get_attribute('textContent'):graduate = (spans[j].get_attribute('textContent'))[3:]require = (tables[i].find_element_by_css_selector('.newlist_deatil_last').get_attribute('textContent'))[8:]row = [job, company, address, develop, jingyan, salary, graduate, require]insert_row = ('insert into {0}(職位名稱,公司名稱,工作地點(diǎn),公司規(guī)模,工作經(jīng)驗(yàn),平均月薪,學(xué)歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name))insert_data = (job, company, address, develop, jingyan, salary, graduate)cursor.execute(insert_row, insert_data)conn.commit()with open('%s職位描述.txt' % (table_name), 'a', encoding='utf-8') as f:f.write(require)print('此頁已抓取···')conn.close()count = 0 while count <= 10:try:next_page = fox.find_element_by_class_name('pagesDown-pos').click()break except:time.sleep(8)count += 1 continue if count > 10:fox.close()else:time.sleep(1)get_everypage_info(fox, keyword, city)if __name__ == "__main__":citys = ['上海', '深圳', '廣州', '武漢', '杭州', '南京', '成都', '青島'] # '北京', 已爬取 job = '數(shù)據(jù)挖掘分析' for city in citys:print(" ")get_main_page(job, city)
最后獲取的輸入如圖
總結(jié)
以上是生活随笔為你收集整理的python爬虫——智联招聘(上)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 当领导招了100个初级开发去做3个资深开
- 下一篇: 用python爬取智联招聘