當前位置：首頁 > 编程语言 > python >内容正文

python

python 爬虫抓站

發布時間：2025/4/14 python 92 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 爬虫抓站小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python 爬蟲抓站記錄（蝦米，百度，豆瓣，新浪微博）

python 下用到的庫，urllib, urllib2, BeautifulSoup, cookielib, mechanize, re

看Firebug模擬瀏覽器行為。

1. 蝦米

蝦米不用登陸，沒有IP限制，最簡單。Python抓了下Xiami電臺的試聽數?里用的是

import urllib2

content = urllib2.urlopen('http://www.xiami.com/artist/top/id/1234').read()

每個歌手爬個兩三頁。把試聽數10000以上的歌記下來。歌手id大概有11w。

處理網頁用split，正則表達式re.compile， BeautifulSoup都成。

( Beautiful Soup的中文文檔?http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html?)

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(content)

就能對soup操作了，支持正則表達式，譬如

soup.find('p', align=re.compile('^b.*'))['id']可以從?<p id="secondpara" align="blah"> 中抓出?secondpara，可以通過這個抓到a href=""中間的url

soup.find("b", { "class" : "lime" })可以抓到 <b class="lime">Lime</b>

soup.find("b", { "class" : "lime" }).string 可以提取到<>與</>中間的內容 Lime。

2. 百度

如果想在空間里取米粒之類的操作就要登陸，mechanize比較好用。

import mechanize
import cookielib

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open('https://passport.baidu.com/?login&tpl=mn')

br.select_form(nr=0)
br.form['username']='abcabcabc' #用戶名
br.form.find_control("password").readonly = False
br.form['password']='123123123' #密碼
br.submit()

然后可以試試開個百度窗口看看自己的用戶名在不在網頁里

content = br.open('http://www.baidu.com').read()

print 'cwyalpha' in content

3. 新浪微博

新浪微博的登陸很麻煩。網上有很多模擬SinaSSO登陸的做法，比如

http://chen.yi.bo.blog.163.com/blog/static/15062110920120151191189/

http://denger.iteye.com/blog/1039052

http://community.itbbs.cn/thread/19120/

http://blog.csdn.net/xiaojianpitt/article/details/6440561

問題是SinaSSO的版本變化太快，半年前的登陸代碼現在就不管用了。

之前主要爬新浪微群的用戶，基本沒有IP和訪問數限制，可以直接把firefox的cookie抓過來用。用sqlite2cookie

def sqlite2cookie(filename):
??? from cStringIO import StringIO
??? from pysqlite2 import dbapi2 as sqlite
??? con = sqlite.connect(filename)
??? cur = con.cursor()
??? cur.execute("select host, path, isSecure, expiry, name, value from moz_cookies")
??? ftstr = ["FALSE","TRUE"]
??? s = StringIO()
??? s.write("""\
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This is a generated file!? Do not edit.
""")
??? for item in cur.fetchall():
??????? s.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (
??????????? item[0], ftstr[item[0].startswith('.')], item[1],
??????????? ftstr[item[2]], item[3], item[4], item[5]))
??? s.seek(0)
??? cookie_jar = cookielib.MozillaCookieJar()
??? cookie_jar._really_load(s, '', True, True)
??? return cookie_jar

之后

cookiejar = sqlite2cookie(r'C:\Documents and Settings\a\Application Data\Mozilla\Firefox\Profiles\8mi38ldc.default\cookies.sqlite') #ff profile下的cookie地址
br = mechanize.Browser()
# Browser options
br.set_cookiejar(cookiejar)

爬微群的用戶用到ajax，是POST方法。

r = br.open('http://q.weibo.com/ajax/members/page',
??????????????????????? urllib.urlencode({'page':str(page),'gid':gid}),
??????????????????????? timeout=30).read()

搜索是這樣

searchq = '韓寒' #文件開頭用utf8（# -*- coding:utf-8 -*-）
r = br.open('http://s.weibo.com/weibo/' + urllib.quote(searchq)).read()

新浪搜索限制是10s一個請求，單IP可以10個用戶同時搜，最多返回50頁，但可以把時間限制在某天某個小時內，所以基本夠用。

4. 豆瓣

抓豆瓣短評不用登陸，但是手機版豆瓣單IP訪問太多會封禁。網頁版豆瓣單cookie單IP抓一小時就會讓輸驗證碼，可以用三個cookie輪流抓（三個都不用登陸）

上面的sqlite2cookie()函數是把整個cookie發出去，也可以把特定域名的cookie發出去

def sqlite2cookiehost(filename,host):
??? con = sqlite.connect(filename)
??? con.text_factory = str
??? cur = con.cursor()
??? cur.execute("select host, path, isSecure, expiry, name, value from moz_cookies where host like ?"
??????????? ,['%%%s%%' % host])
??? ftstr = ["FALSE","TRUE"]
??? s = StringIO()
??? s.write("""\
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This is a generated file! Do not edit.
""")
??? for item in cur.fetchall():
??????? s.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (
??????????? item[0], ftstr[item[0].startswith('.')], item[1],
??????????? ftstr[item[2]], item[3], item[4], item[5]))
??? s.seek(0)
??? cookie_jar = cookielib.MozillaCookieJar()
??? cookie_jar._really_load(s, '', True, True)
??? return cookie_jar

調用

cookiejar = sqlite2cookiehost(r'C:\Documents and Settings\a\Application Data\Mozilla\Firefox\Profiles\8mi38ldc.default\cookies.sqlite', 'douban')

Chrome的cookie調用

def sqlite2cookieChrome(filename):#filename
?? #from pysqlite2 import dbapi2 as sqlite
?
?## but we can make sqlite3 always return bytestrings ...
?# Cookies file come from C:\Users\JiangHonglei\AppData\Local\Google\Chrome\User Data\Default\Cookies
?? con = sqlite3.connect(filename)
?? con.text_factory = str
? ?
?? cur = con.cursor()
?? #cur.execute("select host, path, isSecure, expiry, name, value from moz_cookies")
?? cur.execute("select host_key, path, secure, expires_utc, name, value from cookies")
? ?
?? ftstr = ["FALSE","TRUE"]
?
?? s = StringIO()
?? s.write("""\
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This is a generated file!? Do not edit.
""")
?? for item in cur.fetchall():
????? try:
???????? s.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (
???????????? item[0], ftstr[item[0].startswith('.')], item[1],
???????????? ftstr[item[2]], item[3], item[4], item[5]))
????? except UnicodeError:
??????????? continue
?? s.seek(0)
?
?? cookie_jar = cookielib.MozillaCookieJar()
?? cookie_jar._really_load(s, '', True, True)
?? return cookie_jar

之后

cookiejarChrome = sqlite2cookieChrome(r'C:\Documents and Settings\a\Local Settings\Application Data\Google\Chrome\User Data\Default\Cookies')

IE的cookie比較特殊，每個域名的cookie分開存放，直接讀取不能，從Temporary Internet Files里找到cookie拷出來用。。

cookiejarIE = mechanize.MSIECookieJar(delayload=True)
cookiejarIE.load_cookie_data(r'F:\BJ1KF314.txt')
5. 備注

爬豆瓣短評時碰到一些問題，比如碰到這種評論就會有utf8編碼無法讀取的錯誤

讀內容的時候加ignore或者replace的參數，更多解釋可以看u2b上watch?v=sgHbC6udIqc 這個視頻

response = br.open(nextpageurl, timeout = 20)
content = response.read()
content = content.decode('utf-8','ignore')

最好在抓網頁的時候加 except (urllib2.URLError, IOError): 排錯，某些無法抓取的情況可能是網站把IP封了，另一種情況是這個頁面真的是不存在。。。比如見過大爺這片子是沒有評論沒有打分的。。

mechanize和BeautifulSoup在處理之后最好手工清理下，否則內存占用會越來越大，

當網頁不需要使用時要把mechanize的歷史記錄清了，br.clear_history()

BeautifulSoup存的內容用完最好也清掉，soup.decompose()

不清除歷史記錄很有可能爬1000個網頁內存就要耗掉1G。。

總結

以上是生活随笔為你收集整理的python 爬虫抓站的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python spider code
下一篇： Scrapy入门教程

python

python 爬虫抓站

python 爬蟲抓站 記錄（蝦米，百度，豆瓣，新浪微博）

總結

python 爬蟲抓站記錄（蝦米，百度，豆瓣，新浪微博）