當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python爬虫学习（一）

發(fā)布時(shí)間：2025/3/21 python 20 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫学习（一）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

基于python2.7

get與post：

url = "http://zzk.cnblogs.com" urllib.urlopen(url)----->get方法name = urllib.urlencode({"k":"b"}) urllib.urlopen(url+name)----->pst方法

開發(fā)者工具中form表單的method選項(xiàng)為post，那么必須使用post方法。

urllib：

import urllib import rereponse = urllib.urlopen("https://www.baidu.com") #打開指定的網(wǎng)頁(yè)，返回網(wǎng)頁(yè)所有信息 reponse_code = reponse.getcode() #獲取狀態(tài)碼 reponse_body = reponse.read() #獲取網(wǎng)頁(yè)內(nèi)容 #直接保存網(wǎng)頁(yè)地址的內(nèi)容到指定的文件 save = urllib.urlretrieve("https://www.baidu.com", filename="/home/guido/python/baidu.html") images = re.findall(r"src='(.*?\.jpg)'", reponse_body) #利用正則表達(dá)式匹配數(shù)據(jù) urllib.urlretrieve(images[0], filename="/home/guido/python/baidu_images.html")

拼接鏈接格式

import urllib parament = urllib.urlencode({"t":"b", "w":"ios"}) url = ("http://zzk.cnblogs.com/s?"+parament) print(url)執(zhí)行結(jié)果： http://zzk.cnblogs.com/s?t=b&w=ios

urllib2：

import urllib2 url = "http://www.phpno.com"
#偽造瀏覽器請(qǐng)求頭 send_headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-Encoding":"gzip, deflate, sdch","Accept-Language":"zh-CN,zh;q=0.8","Cache-Control":"max-age=0","Connection":"keep-alive","Cookie":"ASPSESSIONIDCCTRDBQT=OJNFDDEANPLCEFLECFILODNN; Hm_lvt_39dcd5bd05965dcfa70b1d2457c6dcae=1484820976,1484821014,1484821053; Hm_lpvt_39dcd5bd05965dcfa70b1d2457c6dcae=1484821053","Host":"www.nm3dp.com","Referer":"https://www.baidu.com/link?url=Q_AEn1rb05AX6miw616Tx5bIWILq5K_FpUQl_eyJ7TS&wd=&eqid=cb712bbf00052caf00000003588091e9","Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36" } req = urllib2.Request(url, headers=send_headers) #合并瀏覽器向服務(wù)器發(fā)送的內(nèi)容 r = urllib2.urlopen(req) print(r.read())

Beautiful Soup

response = urllib.uelopen("http://www.3jy.com/")

html = response.read()

創(chuàng)建beautifulsoup對(duì)象：

soup = Beautifulsoup(html)

格式化輸出soup對(duì)象的內(nèi)容：

print(soup.prettify())

找標(biāo)簽：

soup.title soup.head soup.b soup.a

找屬性：

soup.p.attrs

獲取文字：

soup.p.string

css選擇器：

soup.select('title') 通過標(biāo)簽名查找soup.select('.sister') 通過類名查找soup.select('#link1') 通過ID名查找soup.select(p link1) 組合查找soup.select('head>title') 直接子標(biāo)簽查找soup.select('a[class='sister']') 屬性查找soup.p['class'] 獲取標(biāo)簽內(nèi)某個(gè)屬性的值（內(nèi)容）

通過索引的方式把select的返回值列表，又轉(zhuǎn)換成可以用select方法的對(duì)象，可以進(jìn)一步操作

aa = soup.select('body') bb = aa[o] cc = bb.select('a[class='sister']')

轉(zhuǎn)載于:https://www.cnblogs.com/Guido-admirers/p/6307739.html

總結(jié)

以上是生活随笔為你收集整理的python爬虫学习（一）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。