當前位置：首頁 > 编程语言 > python >内容正文

python

...python の学习

發布時間：2024/4/17 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 ...python の学习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

5.14

...上次學python 好像是一個月前..

寫點東西記錄下叭..

現在在看李老大寫的博客寫..可能直接開抄代碼...

感覺自己寫的總是爬不成功，之前寫的爬豆瓣影評的爬蟲還是殘的...

1.最簡單的爬取一個網頁

import urllib2
html = urllib2.urlopen('http://music.163.com/')
print html.read()

2.把爬取到的網頁存起來

可是好像因為之前用了那個網頁映射工具，現在生成的 html 里面是當前目錄下的東西，而不是自己爬的那個網頁里面的內容...sigh..

import urllib2 response = urllib2.urlopen('http://music.163.com/') html = response.read() open('testt.html',"w").write(html)

3.爬取ZOL 的一張壁紙

import urllib2 import re # 正則表達式所用到的庫# 我們所要下載的圖片所在網址 url = 'http://desk.zol.com.cn/bizhi/6377_78500_2.html' response = urllib2.urlopen(url) # 獲取網頁內容 html = response.read()# 確定一個正則表達式，用來找到圖片的所在地址 reg = re.compile(r'<img id="bigImg" src="(.*?jpg)" .*>'); imgurl = reg.findall(html)[0]# 打開圖片并保存為haha.jpg imgsrc = urllib2.urlopen(imgurl).read() open("haha.jpg","w").write(imgsrc)

是直接抄的老大的代碼

然而我爬出來的壁紙是這樣的

不懂啊

解決了......?

http://m.ithao123.cn/content-6589593.html

應該用 "wb" 去打開文件

import urllib2 import reurl = 'http://desk.zol.com.cn/bizhi/6377_78500_2.html' response = urllib2.urlopen(url)html = response.read()reg = re.compile(r'<img id="bigImg" src="(.*?jpg)" .*>'); imgurl = reg.findall(html)[0]imgsrc = urllib2.urlopen(imgurl).read() open("haha.jpg","wb").write(imgsrc)

然后

可以看到壁紙了，感人！！！

5.15

今天試了下李老大爬ZOL 壁紙的代碼，爬出來的文件夾里面是空的啊...而且文件名是亂碼..

不過李老大說了那個只適用于linux

于是開始看崔慶才教程

1.爬取貼吧帖子

效果圖

然后在抄代碼的過程中遇到三個問題

1) print 中文的時候會報錯

一種解決方案是這個

#!/usr/bin/python#coding:utf-8

這篇博客講的

2）然后在改了上面那個問題后

會報錯，像下面這樣

解決辦法就是圖里面說的這樣

3.最后一個問題就是貼吧改版了

要換成?

<h3 class="core_title_txt".*?>(.*?)</h3>

最后代碼是這樣的 #!/usr/bin/python #coding:utf-8 __author__ = 'CQC' # -*- coding:utf-8 -*-import urllib import urllib2 import reclass Tool:removeImg = re.compile('<img.*?>| {7}|')removeAddr = re.compile('<a.*?>|</a>')replaceLine = re.compile('<tr>|<div>|</div>|</p>')replaceTD= re.compile('<td>')replacePara = re.compile('<p.*?>')replaceBR = re.compile('<br><br>|<br>')removeExtraTag = re.compile('<.*?>')def replace(self,x):x = re.sub(self.removeImg,"",x)x = re.sub(self.removeAddr,"",x)x = re.sub(self.replaceLine,"\n",x)x = re.sub(self.replaceTD,"\t",x)x = re.sub(self.replacePara,"\n ",x)x = re.sub(self.replaceBR,"\n",x)x = re.sub(self.removeExtraTag,"",x)return x.strip()class BDTB:def __init__(self,baseUrl,seeLZ,floorTag):self.baseURL = baseUrlself.seeLZ = '?see_lz='+str(seeLZ)self.tool = Tool()self.file = Noneself.floor = 1self.defaultTitle = u"百度貼吧"self.floorTag = floorTagdef getPage(self,pageNum):try:url = self.baseURL+ self.seeLZ + '&pn=' + str(pageNum)request = urllib2.Request(url)response = urllib2.urlopen(request)return response.read().decode('utf-8','ignore')except urllib2.URLError, e:if hasattr(e,"reason"):print u"連接百度貼吧失敗,錯誤原因",e.reasonreturn Nonedef getTitle(self,page):pattern = re.compile('<h3 class=core_title_txt.*?>(.*?)</h3>',re.S)result = re.search(pattern,page)if result:return result.group(1).strip()else:return Nonedef getPageNum(self,page):pattern = re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)result = re.search(pattern,page)if result:return result.group(1).strip()else:return Nonedef getContent(self,page):pattern = re.compile('<div id="post_content_.*?>(.*?)</div>',re.S)items = re.findall(pattern,page)contents = []for item in items:content = "\n"+self.tool.replace(item)+"\n"contents.append(content.encode('utf-8'))return contentsdef setFileTitle(self,title):if title is not None:self.file = open(title + ".txt","w+")else:self.file = open(self.defaultTitle + ".txt","w+")def writeData(self,contents):for item in contents:if self.floorTag == '1':floorLine = "\n" + str(self.floor) + u"-----------------------------------------------------------------------------------------\n"self.file.write(floorLine)self.file.write(item)self.floor += 1def start(self):indexPage = self.getPage(1)pageNum = self.getPageNum(indexPage)title = self.getTitle(indexPage)self.setFileTitle(title)if pageNum == None:print "URL已失效，請重試"returntry:print "該帖子共有" + str(pageNum) + "頁"for i in range(1,int(pageNum)+1):print "正在寫入第" + str(i) + "頁數據"page = self.getPage(i)contents = self.getContent(page)self.writeData(contents)except IOError,e:print "寫入異常，原因" + e.messagefinally:print "寫入任務完成"print u"請輸入帖子代號" baseURL = 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/')) seeLZ = raw_input("是否只獲取樓主發言，是輸入1，否輸入0\n") floorTag = raw_input("是否寫入樓層信息，是輸入1，否輸入0\n") bdtb = BDTB(baseURL,seeLZ,floorTag) bdtb.start()

還不懂原理，再看

5.18

爬取貼吧內容的一個類

#!/usr/bin/python #coding:utf-8 import urllib import urllib2 import reclass bdtb:def __init__(self,baseurl,seelz):self.baseurl = baseurlself.seelz = '?see_lz='+str(seelz)def getPage(self,pagenum):try:url = self.baseurl + self.seelz + '&pn=' + str(pagenum)request = urllib2.Request(url)response = urllib2.urlopen(request)print response.read()return responseexcept urllib2.URLError,e:if hasattr(e,"reason"):print u"連接百度貼吧失敗，錯誤原因",e.reasonreturn Nonebaseurl = 'http://tieba.baidu.com/p/3138733512' bb = bdtb(baseurl,1) bb.getPage(1)

?5.19

模擬登陸學校的信息門戶

要用 ie 才能夠看到成績，但是看不到表單，就是 form data

這個時候再用回搜狗

#coding=utf-8import urllib import urllib2 import cookielib import reclass CHD:def __init__(self):self.loginUrl = 'http://bksjw.chd.edu.cn/loginAction.do'self.cookies = cookielib.CookieJar()self.postdata = urllib.urlencode({'dllx':dldl'zjh':xxxx'mm':xxxx})self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookies))def getPage(self):request = urllib2.Request(url = self.loginUrl,data = self.postdata)result = self.opener.open(request)print result.read().decode('gbk')chd = CHD()chd.getPage()

轉載于:https://www.cnblogs.com/wuyuewoniu/p/5491979.html

總結

以上是生活随笔為你收集整理的...python の学习的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Python

上一篇：操作系统开发系列—13.g.操作系统的系
下一篇： Python爬虫开发

python

...python の 学习

總結

...python の学习