生活随笔
收集整理的這篇文章主要介紹了
python-静态网页爬取
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
靜態網頁爬取
文章目錄
- 靜態網頁爬取
- 前言
- 一、靜態網站分析
- 二、所需要的依賴和插件
- 三、代碼展示
- 四、感謝大家今天就到此為止隨時歡迎學習交流
前言
網絡爬蟲(又稱為網頁蜘蛛,網絡機器人,在FOAF社區中間,更經常的稱為網頁追逐者),是一種按照一定的規則,自動地抓取萬維網信息的程序或者腳本。另外一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。。
一、靜態網站分析
例子網站:http://fitness.39.net/food/
任務要求:
第一步獲取食物的大類
第二步獲取所有大類里面所有分頁的食材
第三步獲取所有大類里面所有分頁食材的營養成分和營養占比
二、所需要的依賴和插件
1.引入庫
代碼如下(示例):
from asyncio
.windows_events import
NULL
import requests
import mysql
.connector
import datetime
from lxml import etree
from bs4 import BeautifulSoup
from selenium import webdriver
插件:phantomjs.exe
插件:XPath
插件介紹:
PhantomJS是一個可編程的無頭瀏覽器.
無頭瀏覽器:一個完整的瀏覽器內核,包括js解析引擎,渲染引擎,請求處理等,但是不包括顯示和用戶交互頁面的瀏覽器。
2.PhantomJS的使用場景
PhantomJS的適用范圍就是無頭瀏覽器的適用范圍。通常無頭瀏覽器可以用于頁面自動化,網頁監控,網絡爬蟲等:
頁面自動化測試:希望自動的登陸網站并做一些操作然后檢查結果是否正常。
網頁監控:希望定期打開頁面,檢查網站是否能正常加載,加載結果是否符合預期。加載速度如何等。
網絡爬蟲:獲取頁面中使用js來下載和渲染信息,或者是獲取鏈接處使用js來跳轉后的真實地址。
————————————————
版權聲明:本文為CSDN博主「violetgo」的原創文章,遵循CC 4.0 BY-SA版權協議,轉載請附上原文出處鏈接及本聲明。
原文鏈接:https://blog.csdn.net/violetgo/article/details/48105593
三、代碼展示
第一步獲取食物的大類
用XPath插件找到需要爬取的html組件
response
= requests
.get
(url
="http://fitness.39.net/food/")
response
.encoding
= 'utf-8'
html
= etree
.HTML
(response
.text
)
print(html
.xpath
("//dl[@id='food_ele_dl_id']//dd//a" ))
items
= html
.xpath
("//dl[@id='food_ele_dl_id']//dd//a" )data_time
=datetime
.datetime
.now
().strftime
("%Y-%m-%d %H:%M:%S")
val
= []
val1
= []
for item
in items
:title
= "".join
(item
.xpath
("./@title"))href
= "".join
(item
.xpath
("./@href"))val
.append
((title
,href
,data_time
))
print(val
)mydb
= mysql
.connector
.connect
(host
="11.11.11.111",user
="root",passwd
="123456",database
="python"
)mycursor
= mydb
.cursor
()
sql
= "DELETE FROM p_food_type "mycursor
.execute
(sql
)mydb
.commit
()print(mycursor
.rowcount
, " 條記錄刪除")
sql
= "INSERT INTO p_food_type (title, url,create_time) VALUES (%s, %s, %s)"mycursor
.executemany
(sql
, val
)mydb
.commit
() print(mycursor
.rowcount
, "記錄插入成功。")
第二步獲取大類里面所有的小類包括分頁
分析一下分頁如何搞定
舉例以谷物分析
通過圖片我們可以找到分頁的規律。
那么我們可以通過循環大類的url并拼接成分頁的url去請求數據直到拼接的url請求不到數據我們判定為是最后一頁而跳出本次大類進入下一大類。
setsql
= "SELECT * FROM p_food_type "
mycursor
.execute
(setsql
)
myresult
= mycursor
.fetchall
() n
= 100sum = 0
counter
= 1
for x
in myresult
:counter
=1while counter
<= n
:print(x
[2])sun
=x
[2]s1
=list(sun
) print (s1
[-6])s1
[-6]=str(counter
)s
=''.join
(s1
)print("我是分頁鏈接"+s
)response
= requests
.get
(url
=s
)response
.encoding
= 'utf-8'html
= etree
.HTML
(response
.text
)items
= html
.xpath
("//ul[@id='sort_list_show']/li/a" )print(items
)if items
==[]:breakdata_time
=datetime
.datetime
.now
().strftime
("%Y-%m-%d %H:%M:%S") for item
in items
:val1
=[]title
= "".join
(item
.xpath
("./@title"))href
= "".join
(item
.xpath
("./@href"))print(href
)print(title
)if title
!="" or title
!=NULL
:print("我在這里"+x
[1])response2
= requests
.get
(url
=href
)response2
.encoding
= 'utf-8'html2
= etree
.HTML
(response2
.text
)items2
= html2
.xpath
("//div[@class='ck-show']//ul//li/i" )nameb
="".join
(html2
.xpath
("//span[@class='title-left']/text()"))jfzs
="".join
(html2
.xpath
("//span[@class='ju-box']/cite/text()"))print(jfzs
)yl
= "".join
(html2
.xpath
("//span[@class='ju']/p[1]/text()"))print(yl
)wl
="".join
(html2
.xpath
("//span[@class='ju']/p[2]/text()"))print(wl
)gi
="".join
(html2
.xpath
("//span[@class='gi']/b/text()"))print(gi
)browser
= webdriver
.PhantomJS
(executable_path
=r'D:\\python\\phantomjs.exe')browser
.get
(href
)html
= browser
.page_sourcesoup
= BeautifulSoup
(html
, 'lxml')kll
=soup
.find
('div',{'class':"bing"}).findAll
('td')p1
,p2
,p3
=kll
[1].get_text
().partition
('%')zfzb
=p1
+p2
print(zfzb
)p1
,p2
,p3
= kll
[4].get_text
().partition
('%')tshhw
=p1
+p2
print(tshhw
)p1
,p2
,p3
=kll
[7].get_text
().partition
('%')dbz
=p1
+p2
print(dbz
)print(nameb
)v0
= "".join
(items2
[0].xpath
("./text()"))print(v0
)v30
= "".join
(items2
[30].xpath
("./text()"))if v30
=="":items2
[30]="-"print(items2
[30])v31
= "".join
(items2
[31].xpath
("./text()"))if v31
=="":items2
[31]="-"print(items2
[31])val1
.append
((x
[1],title
,href
,data_time
,"".join
(items2
[0].xpath
("./text()")),"".join
(items2
[1].xpath
("./text()")),"".join
(items2
[2].xpath
("./text()")),"".join
(items2
[3].xpath
("./text()")),"".join
(items2
[4].xpath
("./text()")),"".join
(items2
[5].xpath
("./text()")),"".join
(items2
[6].xpath
("./text()")),"".join
(items2
[7].xpath
("./text()")),"".join
(items2
[8].xpath
("./text()")),"".join
(items2
[9].xpath
("./text()")),"".join
(items2
[10].xpath
("./text()")),"".join
(items2
[11].xpath
("./text()")),"".join
(items2
[12].xpath
("./text()")),"".join
(items2
[13].xpath
("./text()")),"".join
(items2
[14].xpath
("./text()")),"".join
(items2
[15].xpath
("./text()")),"".join
(items2
[16].xpath
("./text()")),"".join
(items2
[17].xpath
("./text()")),"".join
(items2
[18].xpath
("./text()")),"".join
(items2
[19].xpath
("./text()")),"".join
(items2
[20].xpath
("./text()")),"".join
(items2
[21].xpath
("./text()")),"".join
(items2
[22].xpath
("./text()")),"".join
(items2
[23].xpath
("./text()")),"".join
(items2
[24].xpath
("./text()")),"".join
(items2
[25].xpath
("./text()")),"".join
(items2
[26].xpath
("./text()")),"".join
(items2
[27].xpath
("./text()")),"".join
(items2
[28].xpath
("./text()")),"".join
(items2
[29].xpath
("./text()")),items2
[30],items2
[31],"".join
(items2
[32].xpath
("./text()")),"".join
(items2
[33].xpath
("./text()")),"".join
(items2
[34].xpath
("./text()")),"".join
(items2
[35].xpath
("./text()")),"".join
(items2
[36].xpath
("./text()")),"".join
(items2
[37].xpath
("./text()")),"".join
(items2
[38].xpath
("./text()")),"".join
(items2
[39].xpath
("./text()")),"".join
(items2
[40].xpath
("./text()")),"".join
(items2
[41].xpath
("./text()")),"".join
(items2
[42].xpath
("./text()")),"".join
(items2
[43].xpath
("./text()")),nameb
,jfzs
,yl
,wl
,gi
,zfzb
,tshhw
,dbz
))print(val1
)else:breakcounter
+= 1
四、感謝大家今天就到此為止隨時歡迎學習交流
總結
以上是生活随笔為你收集整理的python-静态网页爬取的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。