Python-crawler-citeulike
之前裝過beautifulsoup,這次要裝lxml,用easy_install裝:到python/scripts目錄下,運行easy_install lxml,自動安裝
-----------分界線--------------
之前直接用urlopen(url),拒絕訪問,403forbidden
模仿真實上網,添加cookie (轉自http://www.yihaomen.com/article/python/210.htm)
import re import random import socket import urllib2 import cookielib from bs4 import BeautifulSoup import lxmlERROR = {'0':'Can not open the url,checck you net','1':'Creat download dir error','2':'The image links is empty','3':'Download faild','4':'Build soup error,the html is empty','5':'Can not save the image to your disk',}class BrowserBase(object): def __init__(self):socket.setdefaulttimeout(20)def speak(self,name,content):print '[%s]%s' %(name,content)def openurl(self,url):"""打開網頁"""cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())self.opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)urllib2.install_opener(self.opener)user_agents = ['Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11','Opera/9.25 (Windows NT 5.1; U; en)','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)','Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12','Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7","Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",] agent = random.choice(user_agents)self.opener.addheaders = [("User-agent",agent),("Accept","*/*"),('Referer','http://www.google.com')]try:res = self.opener.open(url)# print res.read()except Exception,e:self.speak(str(e)+url)raise Exceptionelse:return res?
----------------分界線-------------------
用beautifulsoup解析html文件(教程參考:http://beautifulsoup.readthedocs.org/zh_CN/latest/#)
soup = BeautifulSoup(res, "lxml) 生成beautifulsoup對象,是一棵由html里的tag作節點的樹對象。
soup = BeautifulSoup(res,"lxml") tag = soup.find(id ="showtexform") #body.form( id ="showtexform") return tag.contents[1].contents[1]['value']beautifulsoup的搜索方法:find(),find_all():
? ?1. 字符串:查找與字符串完整匹配的內容,soup.find_all('b');找b標簽
? ?2.?正則表達式:通過正則表達式的?match()?來匹配,soup.find_all(re.compile('^b'));找b打頭的標簽
? ?3. 列表
? ?......
tag的屬性的操作方法與字典相同: tag['value']
tag的?.contents?屬性可以將tag的子節點以列表的方式輸出
?
轉載于:https://www.cnblogs.com/yuchenkit/p/5369763.html
總結
以上是生活随笔為你收集整理的Python-crawler-citeulike的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [bmgr]android应用数据备份以
- 下一篇: linux中redis的主从