當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫实现爬取网页主页信息（html代码）

發布時間：2023/12/14 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫实现爬取网页主页信息（html代码）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python爬蟲實現爬取網頁主頁信息（html代碼）

1.爬取網站源碼

urllib整體介紹： urllib是一個包,收集幾個模塊來處理網址 urllib.request打開和瀏覽url中內容 urllib.error包含從 urllib.request發生的錯誤或異常 urllib.parse解析url urllib.robotparser解析 robots.txt文件 import urllib.requestclass GetHtml(object): #創建類，object表示主類（寫不寫都可）def __init__(self, URL): #def __init__將類實例化、self：類本身self.url = URL #為url賦值=URLdef get_index(self): #定義方法self.response = urllib.request.urlopen(self.url) return self.response.read()html = GetHtml("http://product.yesky.com/keyboard/") print(html.get_index())

運行結果：（獲取到網站源碼）

但是在服務器日志信息中會顯示訪問采用python腳本，容易被網站管理員ban掉ip，所以需要修改其user-agent信息

import urllib.requestclass GetHtml(object):def __init__(self, URL, HEAD):self.url = URLself.head = HEADdef get_index(self):self.request = urllib.request.Request(self.url) #進行請求self.request.add_header("user-agent", self.head) #添加頭部信息self.response = urllib.request.urlopen(self.request)return self.response.read()html = GetHtml("http://product.yesky.com/keyboard/", "Mozilla/5.0 (Windows NT 8.1; Win32; x32; rv:65.0) Gecko/20100101 Firefox/65.0") #將服務器顯示信息進行修改 print(html.get_index())

運行結果：

服務器端顯示結果：
·
成功修改user-agent信息，并且獲取到了網站源代碼信息。

總結

以上是生活随笔為你收集整理的python爬虫实现爬取网页主页信息（html代码）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： JAVA 蓝桥杯算法提高阮小二买彩票
下一篇： Python学习 Day43 数据解析-