當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫抓取网站模板的完整版实现

發布時間：2024/8/1 python 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫抓取网站模板的完整版实现小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

業余愛好喜歡倒弄下個人網站。對之前的個人博客網站模板不太滿意，網上看到別人的網站真漂亮啊，于是想著搞下來借鑒下，僅用于個人用途。若是單個存取太麻煩，用工具的話還得找，于是想到干脆使用python實現下，python用于爬蟲可真厲害。

之前的博客搭建在了華為云，地址在這里：個人博客

下面分享下抓去網站模板的完整版實現，親測可用。（注：僅限個人愛好者研究使用，不要用于其他非法用途。）

環境準備

由于個人使用的是64位版本的python3環境，安裝下用到的第三方庫。

BeautifulSoup庫，簡稱bs4,常用的爬蟲庫，可以在HTML或XML文件中提取數據的網頁信息提取，方便得到dom的標簽和屬性值。

lxml庫，python的HTML/XML的解析器，速度很快，其主要功能是解析和提取XML和HTML中的數據。

urllib庫，這個庫一般不用下python內置的urllib庫。這個是模擬發起網絡請求，一般建議使用requests，它是對urllib的再次封裝。需要注意的是python2和python3上的異同。python2上沒有urllib.request。python2中的如urllib.urllib2.urlopen需改為?urllib.request.urlopen()。

庫的安裝

由于默認倉庫網站被墻的原因，需要改下鏡像才能成功下載。對于python3推薦使用pip或pip3的install。因為pip2是留給python2用的，如果電腦上同時有python2和python3安裝的話。

臨時改變鏡像：

$pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package # 清華源 $pip3 install -i http://pypi.douban.com/simple some-package #豆瓣鏡像

用國內源碼對pip進行升級：

$pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U

如果覺得每次這樣臨時改變鏡像設置不太方便，可以對配置進行修改。

?linux下的指定位置為：

$HOME/.config/pip/pip.conf #或者 $HOME/.pip/pip.conf

windows下的指定位置為：

%APPDATA%\pip\pip.ini #或者 %HOME%\pip\pip.ini

實現原理?

首先要進行網頁分析，實現原理還是比較簡單的，就跟用網站訪問類似，你能訪問到網頁就能通過查看網頁源代碼找到里面的相關鏈接，js腳本和css文件等。模板無非就是需要把相關的css，js文件和網頁文件下載下來。所以原理就是爬取網頁找到上面的script，link標簽，a herf標簽，把相關的網址鏈接提取和保存起來存為文件，然后去重并調用urlretrieve()方法直接將遠程數據下載到本地。比如你要下載某個網頁或文件，只需調用urlretrieve()，指定好參數即可。

urlretrieve(url, filename=None, reporthook=None, data=None)

如將百度首頁的網頁保存下來，只需：

#!/usr/bin/env python # coding=utf-8 import os from urllib.request import urlretrievedef cbk(a,b,c): '''''回調函數 @a:已經下載的數據塊 @b:數據塊的大小 @c:遠程文件的大小 ''' per=100.0*a*b/c if per>100: per=100 print('%.2f%%' % per)url='http://www.baidu.com' dir=os.path.abspath('.') work_path=os.path.join(dir,'baidu.html') urlretrieve(url,work_path,cbk)

完整源碼

#!/usr/bin/env python # -*- coding: utf-8 -*- # by yangyongzhen # 2016-12-06from bs4 import BeautifulSoup import urllib, urllib.request, os, time import re import lxmlrootpath = os.getcwd() + u'/抓取的模板/'def makedir(path):if not os.path.isdir(path):os.makedirs(path)#創建抓取的根目錄 makedir(rootpath)#顯示下載進度 def Schedule(a, b, c):'''''a:已經下載的數據塊b:數據塊的大小c:遠程文件的大小'''per = 100.0 * a * b / cif per > 100:per = 100print('%.2f%%' % per)def grabHref(url, listhref, localfile):html = urllib.request.urlopen(url).read()html = str(html, 'gb2312', 'ignore').encode('utf-8', 'ignore')content = BeautifulSoup(html, features="lxml").findAll('link')myfile = open(localfile, 'w')pat = re.compile(r'href="([^"]*)"')pat2 = re.compile(r'http')for item in content:h = pat.search(str(item))href = h.group(1)if pat2.search(href):ans = hrefelse:ans = url + hrefif not ans.__contains__(url):continueif ans.endswith('/'):ans += 'index.html'listhref.append(ans)myfile.write(ans)myfile.write('\r\n')print(ans)content = BeautifulSoup(html, features="lxml").findAll('script')pat = re.compile(r'src="([^"]*)"')pat2 = re.compile(r'http')for item in content:h = pat.search(str(item))if h:href = h.group(1)if pat2.search(href):ans = hrefelse:ans = url + hreflisthref.append(ans)myfile.write(ans)myfile.write('\r\n')print(ans)content = BeautifulSoup(html, features="lxml").findAll('a')pat = re.compile(r'href="([^"]*)"')pat2 = re.compile(r'http')for item in content:h = pat.search(str(item))if h:href = h.group(1)if pat2.search(href):ans = hrefelse:ans = url + hrefif not ans.__contains__(url):continue if ans.endswith('/'):ans += 'index.html' listhref.append(ans)myfile.write(ans)myfile.write('\r\n')print(ans)myfile.close()def _progress(block_num, block_size, total_size):'''回調函數@block_num: 已經下載的數據塊@block_size: 數據塊的大小@total_size: 遠程文件的大小'''sys.stdout.write('\r>> Downloading %s %.1f%%' %(filename, float(block_num * block_size) / float(total_size) * 100.0))sys.stdout.flush()def main():url = "http://http://www.helongx.com/" #采集網頁的地址listhref = [] #鏈接地址localfile = 'ahref.txt' #保存鏈接地址為本地文件，文件名grabHref(url, listhref, localfile)listhref = list(set(listhref)) #去除鏈接中的重復地址curpath = rootpathstart = time.perf_counter()for item in listhref:curpath = rootpathname = item.split('/')[-1]fdir = item.split('/')[3:-1]for i in fdir:curpath += icurpath += '/'print(curpath)makedir(curpath)local = curpath + nameprint('name:' + name)if len(name) == 0:continueif name.__contains__('www'):continueif name.__contains__('?'):continueprint(local)try:urllib.request.urlretrieve(item, local, Schedule) # 遠程保存函數except Exception as e:print(e)end = time.perf_counter()print(u'模板抓取完成！')print(u'一共用時：', end - start, u'秒')if __name__ == "__main__":main()

注意事項

針對不同的網站，需要分析下網頁源碼找到鏈接的規律。比如有的網站首頁就是 www.xxx.xxx,不帶index.html后綴或者后綴是別的其他的如index.aspx或index.php之類的。可以修改腳本源碼，加些特殊的處理。比如自動補上首頁名稱和只抓取本網站的內容：

for item in content:h = pat.search(str(item))href = h.group(1)if pat2.search(href):ans = hrefelse:ans = url + href#非本站的鏈接不抓取if not ans.__contains__(url):continue#補上首頁后綴名if ans.endswith('/'):ans += 'index.html'listhref.append(ans)myfile.write(ans)myfile.write('\r\n')print(ans)content = BeautifulSoup(html, features="lxml").findAll('script')

引用

python爬蟲之bs4模塊（超詳細）_- 打小就隔路à的博客-CSDN博客_bs4模塊

bs4介紹_- 白鹿 -的博客-CSDN博客_bs4

Python-- lxml用法_ydw_ydw的博客-CSDN博客_lxml python

python中pip和pip3的區別、使用以及加速方法_非晚非晚的博客-CSDN博客_python3使用pip還是pip3

Python爬蟲實戰案例：一鍵爬取，多種網頁模板任你選！_Code皮皮蝦的博客-CSDN博客

python3的urlretrieve（）方法的作用與使用（入門）_逸少凌仙的博客-CSDN博客_python urlretrieve?小白如何入門 Python 爬蟲？ - 知乎

Python爬蟲教程（從入門到精通）

Python-xpath與bs4_「已注銷」的博客-CSDN博客

Python網絡爬蟲 - 飛槳AI Studio

python 爬蟲 2 （網頁解析bs4、lxml、xpath、正則）_BeanInJ的博客-CSDN博客

python爬蟲訓練11：正則表達式，bs4，xpath抓取網站數據對比_<編程路上>的博客-CSDN博客

https://blog.csdn.net/weixin_43788986/category_11895372.html

解析網頁哪家強-Xpath和正則表達式（re）及BeautifulSoup的比較（文中含有三者的基本語法介紹）_莫莫先生的博客-CSDN博客_xpath和正則表達式

Beautiful Soup 4.4.0 文檔 — beautifulsoup 4.4.0q 文檔

爬蟲學習筆記（五）——網頁解析工具(bs4、xpath)_別呀的博客-CSDN博客_網頁解析工具

爬蟲系列（一）：解析網頁的常見方式匯總——re、bs4、xpath——以鏈家租房信息爬取為例_limSedrick=quant的博客-CSDN博客

總結

以上是生活随笔為你收集整理的Python爬虫抓取网站模板的完整版实现的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 2016年总结：教师路的开启，爱情味的初
下一篇： python封装成jar包_将Pytho