當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

网络爬虫中进行数据抓取

發(fā)布時(shí)間：2025/7/25 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了网络爬虫中进行数据抓取小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

以下內(nèi)容是《用python寫網(wǎng)絡(luò)爬蟲》的讀書筆記

一、安裝firebug lite

firebug lite是一個(gè)用于在網(wǎng)站中查看調(diào)試html，css和javascript的有效工具。它現(xiàn)在可以安裝在chrome和firefox瀏覽器上。chrome瀏覽器的安裝教程。firefox瀏覽器的安裝教程。

二、三種頁面抓取方式

（1）正則表達(dá)式

正則表達(dá)式是我們進(jìn)行數(shù)據(jù)獲取的最基本的方式，不了解正則表達(dá)式的，可以參看正則表達(dá)式的基本用法。

我們可以先下載html數(shù)據(jù)，然后用正則表達(dá)式對html中的數(shù)據(jù)進(jìn)行匹配。以下是一個(gè)簡單的用法樣例：

url = "http://www.cnblogs.com/xudong-bupt/p/3586889.html" html = download(url) list = re.findall('<div class="BlogStats">(.*?)</div>', html) print list[0]

這個(gè)樣例的作用是能夠打印出html文件中第一個(gè)<div class = "BlogStats"></div>之間的內(nèi)容。

用正則表達(dá)式來獲取數(shù)據(jù)，優(yōu)點(diǎn)是形式簡單，缺點(diǎn)是很難獲得健壯的正則表達(dá)式，當(dāng)頁面發(fā)生細(xì)微變化時(shí)，正則表達(dá)式可能就不起作用了。

（2）Beautiful Soup

在開始之前內(nèi)，首先我們需要在python中安裝beautifulsoup模塊，我使用 pip install beautifulsoup4，來進(jìn)行模塊的安裝。它的相關(guān)方法可查閱其官方文檔

現(xiàn)在我們來執(zhí)行一個(gè)小樣例：

from bs4 import BeautifulSoup from Chapter1.Background_Research import * def tes_example():'''use a broken_html to test the beautiful soup:return:'''broken_html = "<url class=country><li>Area <li>Population</url>"# use beautiful soup to parse the broken_htmlsoup = BeautifulSoup(broken_html, 'html.parser')fixed_html = soup.prettify()print fixed_htmldef find_text(url, id_name):'''find the lable text which id is equal to id_name:param url: the url of the html:param id_name: locate the special id:return: the text between the special label'''html = download(url)soup = BeautifulSoup(html, "html.parser")tr = soup.find(attrs={'id': id_name})text = tr.textreturn text text = find_text("http://www.cnpythoner.com/post/300.html", 'title') print text

（3）Lxml

Lxml?是基于libxml2這個(gè)xml解析庫的python封裝。該模塊使用c語言編寫，解析速度比beautiful soup更快，不過安裝教程也更為復(fù)雜，附上最新的安裝說明。

Lxml和beautiful soup相比有一個(gè)明顯的優(yōu)點(diǎn)就是它能夠使用css選擇器進(jìn)行數(shù)據(jù)抽取。它已經(jīng)能夠?qū)崿F(xiàn)大部分的css3屬性，但是還有一部分是不支持的。具體可參看它的說明文檔。

下面是Lxml使用的一個(gè)小樣例：

import lxml.html
from Chapter1.Background_Research import download

def test_lxml():
'''
use a broken_html to test the beautiful soup
:return:
'''
broken_html = "<url class="country"><li>Area <li>Population</url>"
# use beautiful soup to parse the broken_html
parse_html = lxml.html.fromstring(broken_html)
fixed_html = lxml.html.tostring(parse_html, pretty_print=True)
print fixed_html
test_lxml()

def find_text(url, id_name):
'''
it can get all text of label a under the div which id is id_name

:param url: given a url
:param id_name: define the special id name
:return: all text
'''
html = download(url)
tree_html = lxml.html.fromstring(html)
td = tree_html.cssselect('div#'+id_name+'> a')
values = []
for d in td:
values.append(d.text_content())
return values

values = find_text("http://www.cnpythoner.com/post/300.html", 'bdshare')
for value in values:
print value

轉(zhuǎn)載于:https://www.cnblogs.com/whatyouknow123/p/7725119.html

總結(jié)

以上是生活随笔為你收集整理的网络爬虫中进行数据抓取的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。