當前位置：首頁 > 编程语言 > python >内容正文

python

python-静态网页爬取

發布時間：2023/12/20 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python-静态网页爬取小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

靜態網頁爬取

文章目錄

靜態網頁爬取
前言
一、靜態網站分析
二、所需要的依賴和插件
- 1.引入庫
三、代碼展示
四、感謝大家今天就到此為止隨時歡迎學習交流

前言

網絡爬蟲（又稱為網頁蜘蛛，網絡機器人，在FOAF社區中間，更經常的稱為網頁追逐者），是一種按照一定的規則，自動地抓取萬維網信息的程序或者腳本。另外一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。。

一、靜態網站分析

例子網站：http://fitness.39.net/food/

任務要求：
第一步獲取食物的大類

第二步獲取所有大類里面所有分頁的食材

第三步獲取所有大類里面所有分頁食材的營養成分和營養占比

二、所需要的依賴和插件

1.引入庫

代碼如下（示例）：

from asyncio.windows_events import NULL import requests import mysql.connector import datetime from lxml import etree from bs4 import BeautifulSoup from selenium import webdriver

插件：phantomjs.exe
插件：XPath
插件介紹：
PhantomJS是一個可編程的無頭瀏覽器.

無頭瀏覽器：一個完整的瀏覽器內核,包括js解析引擎,渲染引擎,請求處理等,但是不包括顯示和用戶交互頁面的瀏覽器。

2.PhantomJS的使用場景
PhantomJS的適用范圍就是無頭瀏覽器的適用范圍。通常無頭瀏覽器可以用于頁面自動化，網頁監控，網絡爬蟲等：

頁面自動化測試：希望自動的登陸網站并做一些操作然后檢查結果是否正常。
網頁監控：希望定期打開頁面，檢查網站是否能正常加載，加載結果是否符合預期。加載速度如何等。
網絡爬蟲：獲取頁面中使用js來下載和渲染信息，或者是獲取鏈接處使用js來跳轉后的真實地址。
————————————————
版權聲明：本文為CSDN博主「violetgo」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/violetgo/article/details/48105593

三、代碼展示

第一步獲取食物的大類
用XPath插件找到需要爬取的html組件

response = requests.get(url="http://fitness.39.net/food/") response.encoding = 'utf-8'# 獲取一級分類的數據 html = etree.HTML(response.text) print(html.xpath("//dl[@id='food_ele_dl_id']//dd//a" )) items = html.xpath("//dl[@id='food_ele_dl_id']//dd//a" )data_time=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") #系統時間 val = [] val1 = [] # 遍歷items的集合 for item in items:title = "".join(item.xpath("./@title"))href = "".join(item.xpath("./@href"))val.append((title,href,data_time)) # print(html.xpath("" )) print(val)mydb = mysql.connector.connect(host="11.11.11.111",user="root",passwd="123456",database="python" )mycursor = mydb.cursor()#刪除舊的食材分類數據 sql = "DELETE FROM p_food_type "mycursor.execute(sql)mydb.commit()print(mycursor.rowcount, " 條記錄刪除")#插入新數據 sql = "INSERT INTO p_food_type (title, url,create_time) VALUES (%s, %s, %s)"mycursor.executemany(sql, val)mydb.commit() # 數據表內容有更新，必須使用到該語句print(mycursor.rowcount, "記錄插入成功。")

第二步獲取大類里面所有的小類包括分頁
分析一下分頁如何搞定
舉例以谷物分析

通過圖片我們可以找到分頁的規律。
那么我們可以通過循環大類的url并拼接成分頁的url去請求數據直到拼接的url請求不到數據我們判定為是最后一頁而跳出本次大類進入下一大類。

#查詢大類表 setsql = "SELECT * FROM p_food_type " mycursor.execute(setsql) myresult = mycursor.fetchall() n = 100sum = 0 counter = 1 #循環大類表 for x in myresult:counter=1while counter <= n:# 分頁的處理print(x[2])sun=x[2]s1=list(sun) print (s1[-6])s1[-6]=str(counter)s=''.join(s1)print("我是分頁鏈接"+s)#請求分頁鏈接 response = requests.get(url=s)response.encoding = 'utf-8'# 獲取一級分類的數據# print(response.text)html = etree.HTML(response.text)# 獲取當前分頁鏈接的所有食材items = html.xpath("//ul[@id='sort_list_show']/li/a" )print(items)# 判斷當前分頁是不是有食材沒有的話直接跳出循環進入下一個大類的循環if items==[]:breakdata_time=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") #系統時間# 遍歷當前分頁所有的食材for item in items:val1=[]# 獲取當前食材的名字和鏈接title = "".join(item.xpath("./@title"))href = "".join(item.xpath("./@href"))print(href)print(title)#判斷食材名字不為空 if title!="" or title!=NULL:#這是大類的名字 print("我在這里"+x[1])# 請求食材的url進入當前食材的詳情頁面response2 = requests.get(url=href)response2.encoding = 'utf-8'# print(response2.text)# 下面就是獲取當前食材的營養成分和占比html2 = etree.HTML(response2.text)items2 = html2.xpath("//div[@class='ck-show']//ul//li/i" )#別名 nameb="".join(html2.xpath("//span[@class='title-left']/text()"))# 減肥指數jfzs="".join(html2.xpath("//span[@class='ju-box']/cite/text()"))print(jfzs)#有利yl= "".join(html2.xpath("//span[@class='ju']/p[1]/text()"))print(yl)#無利wl ="".join(html2.xpath("//span[@class='ju']/p[2]/text()"))print(wl)#gi值gi ="".join(html2.xpath("//span[@class='gi']/b/text()"))print(gi)# 卡路里占比經過js計算所以xpath的方式獲取不到咱們就用phantomjs.exe插件獲取browser = webdriver.PhantomJS(executable_path=r'D:\\python\\phantomjs.exe')browser.get(href)html = browser.page_sourcesoup = BeautifulSoup(html, 'lxml')# 這是phantomjs.exe獲取組件的方式和xpath有所不同# 卡路里百分比kll=soup.find('div',{'class':"bing"}).findAll('td')#脂肪占比# zfzb= "".join(html2.xpath("//span[@class='down']/text()"))p1,p2,p3=kll[1].get_text().partition('%')zfzb=p1+p2print(zfzb)#碳水化合物占比# tshhw= "".join(html2.xpath("//span[@class='up']/text()"))p1,p2,p3= kll[4].get_text().partition('%')tshhw=p1+p2print(tshhw)#蛋白質# dbz = "".join(html2.xpath("//div[@class='bing']//table//tbody/tr[4]/td[2]/text()"))p1,p2,p3=kll[7].get_text().partition('%')dbz=p1+p2print(dbz)print(nameb)v0 = "".join(items2[0].xpath("./text()"))print(v0)v30 = "".join(items2[30].xpath("./text()"))if v30=="":items2[30]="-"print(items2[30])v31 = "".join(items2[31].xpath("./text()"))if v31=="":items2[31]="-"print(items2[31])val1.append((x[1],title,href,data_time,"".join(items2[0].xpath("./text()")),"".join(items2[1].xpath("./text()")),"".join(items2[2].xpath("./text()")),"".join(items2[3].xpath("./text()")),"".join(items2[4].xpath("./text()")),"".join(items2[5].xpath("./text()")),"".join(items2[6].xpath("./text()")),"".join(items2[7].xpath("./text()")),"".join(items2[8].xpath("./text()")),"".join(items2[9].xpath("./text()")),"".join(items2[10].xpath("./text()")),"".join(items2[11].xpath("./text()")),"".join(items2[12].xpath("./text()")),"".join(items2[13].xpath("./text()")),"".join(items2[14].xpath("./text()")),"".join(items2[15].xpath("./text()")),"".join(items2[16].xpath("./text()")),"".join(items2[17].xpath("./text()")),"".join(items2[18].xpath("./text()")),"".join(items2[19].xpath("./text()")),"".join(items2[20].xpath("./text()")),"".join(items2[21].xpath("./text()")),"".join(items2[22].xpath("./text()")),"".join(items2[23].xpath("./text()")),"".join(items2[24].xpath("./text()")),"".join(items2[25].xpath("./text()")),"".join(items2[26].xpath("./text()")),"".join(items2[27].xpath("./text()")),"".join(items2[28].xpath("./text()")),"".join(items2[29].xpath("./text()")),items2[30],items2[31],"".join(items2[32].xpath("./text()")),"".join(items2[33].xpath("./text()")),"".join(items2[34].xpath("./text()")),"".join(items2[35].xpath("./text()")),"".join(items2[36].xpath("./text()")),"".join(items2[37].xpath("./text()")),"".join(items2[38].xpath("./text()")),"".join(items2[39].xpath("./text()")),"".join(items2[40].xpath("./text()")),"".join(items2[41].xpath("./text()")),"".join(items2[42].xpath("./text()")),"".join(items2[43].xpath("./text()")),nameb,jfzs,yl,wl,gi,zfzb,tshhw,dbz))print(val1)else:breakcounter += 1

四、感謝大家今天就到此為止隨時歡迎學習交流

總結

以上是生活随笔為你收集整理的python-静态网页爬取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： burn suite启动_Microso
下一篇：高效非支配排序ENS python版