生活随笔
收集整理的這篇文章主要介紹了
04-正则解析+bs4基础爬虫
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
正則
匹配字符串
正則表達(dá)式必備技能:
元字符:(單個(gè)字符)
1,普通字符
2, 特殊字符
. 匹配所有文字符號(hào),除了換行
\w 匹配數(shù)字,字母,下劃線
\d 匹配數(shù)字
\W 匹配除了數(shù)字,字母,下劃線
\D 匹配除了數(shù)字
^ 匹配一句話得開(kāi)頭
$ 匹配一句話得結(jié)尾
[abc] 字符組 等于 \w
[^abc] 字符組 非字符組 等于 \W
3,量詞
用來(lái)修飾前面得一個(gè)單位的出現(xiàn)次數(shù)
* 表示出現(xiàn)0次或者更多次
+ 表示出現(xiàn)一次或者更多次
? 表示出現(xiàn)0次或者1次
{n}表示出現(xiàn)n次 # \d{3} 表示出現(xiàn)3個(gè)數(shù)字
4 ,貪婪匹配
默認(rèn)的量詞(,+,?)都是盡可能多的匹配內(nèi)容
5 , 惰性匹配
a.?x (匹配到a和第一個(gè)x就結(jié)束)惰性匹配,盡可能少的匹配,回溯算法
import re lst
= re
.findall
(r
"\d+","哈哈99呵呵88嘿嘿77")
print(lst
)
it
= re
.finditer
(r
"\d+","哈哈99呵呵88嘿嘿77")
for item
in it
:print(item
.group
())it
= re
.finditer
(r
"\d+","1哈哈哈sdasd12123ha4562哈哈哈78922")
obj
= it
.__next__
()
print(obj
.group
())
for item
in it
:print (item
.group
())
result
= re
.match
(r
"\d+","123456哈哈99呵呵88嘿嘿77")
print(result
.group
())
result1
= re
.search
(r
"\d+","哈哈123456哈哈99呵呵88嘿嘿77")
print(result1
.group
())
obj
= re
.compile(r
"www\.baidu\.com")
it
= obj
.finditer
("www.baidu.com,www.jd.com")
for item
in it
:print(item
.group
())
obj
= re
.compile(r
"www\.(?P<gpc>baidu|jd)\.com")
it
= obj
.finditer
("www.baidu.com,www.jd.com")
for item
in it
:print(item
.group
("gpc")) obj
= re
.compile(r
"www\.(baidu|jd)\.com")
it
= obj
.findall
("www.baidu.com,www.jd.com")
print(it
)
from urllib
.request
import urlopen
import re
content
= urlopen
("https://news.163.com/").read
().decode
("gbk")
obj
= re
.compile(r
'<div class="mod_top_news2" id="js_top_news">.*?">(?P<title>.*?)</a>',re
.S
)
lst
= obj
.findall
(content
)
print(lst
)
補(bǔ):分析瀏覽器開(kāi)發(fā)者工具中Elements和network這兩個(gè)選項(xiàng)卡對(duì)應(yīng)的頁(yè)面源碼數(shù)據(jù)有何不同之處?
- Elements中包含的顯示的頁(yè)面源碼數(shù)據(jù)為當(dāng)前頁(yè)面所有的數(shù)據(jù)加載完畢后對(duì)應(yīng)的完整頁(yè)面數(shù)據(jù)源碼(包含了動(dòng)態(tài)加載的數(shù)據(jù))
- network中顯示的頁(yè)面源碼數(shù)據(jù)僅僅為某一個(gè)單獨(dú)的請(qǐng)求對(duì)應(yīng)的相應(yīng)數(shù)據(jù)(不包含動(dòng)態(tài)加載數(shù)據(jù))
結(jié)論:如果在進(jìn)行數(shù)據(jù)解析的時(shí)候,一定需要對(duì)頁(yè)面布局進(jìn)行數(shù)據(jù)分析,如果當(dāng)前網(wǎng)站沒(méi)有動(dòng)態(tài)加載的數(shù)據(jù)可以直接使用Elements對(duì)頁(yè)面布局進(jìn)行分析,否則只可以使用network對(duì)頁(yè)面數(shù)據(jù)進(jìn)行分析。
如下圖 靜態(tài)網(wǎng)站:
動(dòng)態(tài)加載:
使用正則提取校花網(wǎng)圖片鏈接
import requests
import urllib
import re
import osdirName
= "校花圖片"
if not os
.path
.exists
(dirName
): os
.mkdir
(dirName
) headers
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
url
= "http://www.521609.com/daxuemeinv/"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
ex
= '<li>.*?<img src="(.*?)" width=.*?</li>'
img_src_list
= re
.findall
(ex
,page_text
,re
.S
)
for src
in img_src_list
:print(src
)src
= "http://www.521609.com"+src imgpath
= dirName
+ '/'+src
.split
('/')[-1] urllib
.request
.urlretrieve
(src
,imgpath
) print(imgpath
,"下載完成")
bs4
- bs4 解析原理
- 實(shí)例化一個(gè)BeautifulSoup的對(duì)象,且將待解析的頁(yè)面源碼數(shù)據(jù)加載到該對(duì)象中
- 調(diào)用BeautifulSoup對(duì)象中相關(guān)方法或者屬性進(jìn)行標(biāo)簽定位和文本數(shù)據(jù)提取
- 環(huán)境安裝
- pip install lxml #解析器
- pip install bs4
- BeautifulSoup對(duì)象的實(shí)例化:
- BeautifulSoup(fp,‘lxml’):用來(lái)將本地存儲(chǔ)的html文檔中的數(shù)據(jù)進(jìn)行解析
- BeautifulSoup(page_text,‘lxml’):用來(lái)將互聯(lián)網(wǎng)上請(qǐng)求到的頁(yè)面源碼數(shù)據(jù)今夕解析
- 標(biāo)簽定位:
- soup.tagName:只可以定位到第一次出現(xiàn)的TagName標(biāo)簽
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
print(soup
.li
)
- soup.find(‘tagName’,attrName=‘value’):屬性定位
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
print(soup
.find
('div',class_
="book-mulu"))
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
print(soup
.find
('div',id="top_right_nav"))
print(soup
.find
('p',class_
="des"))
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
print(soup
.find
('p',class_
="des"))
print(soup
.findAll
('p',class_
="des"))
soup.select(“選擇器”)
> : 表示一個(gè)層級(jí)
:空格表示多個(gè)層級(jí)
如下:
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
print(soup
.select
('.book-mulu > ul > li >a'))
print(soup
.select
('.book-mulu a'))
文本數(shù)據(jù)提取:
- .text : 返回的是改標(biāo)簽下的所有文本內(nèi)容
- .string: 返回的是改標(biāo)簽直系下的文本內(nèi)容
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
a_href
= soup
.select
('.book-mulu > ul > li >a')
for a
in a_href
:print(a
.string
)
取url屬性
import requests
from bs4
import BeautifulSoupheaders
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
a_href
= soup
.select
('.book-mulu > ul > li >a')
for a
in a_href
:print(a
['href'])
import requests
from bs4
import BeautifulSoupfp
= open("三國(guó)演義.txt","w",encoding
="utf-8")
headers
= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
url
= "http://www.shicimingju.com/book/sanguoyanyi.html"
page_text
= requests
.get
(url
=url
,headers
=headers
).text
soup
= BeautifulSoup
(page_text
,'lxml')
a_list
= soup
.select
('.book-mulu > ul > li >a')
for a
in a_list
:title
= a
.string detail_url
= "http://www.shicimingju.com" + a
['href'] page_text_detail
= requests
.get
(url
=detail_url
,headers
=headers
).textsoup
= BeautifulSoup
(page_text_detail
,'lxml')div_tag
= soup
.find
('div',class_
="chapter_content")content
= div_tag
.textfp
.write
(title
+':'+content
+"\n")print(title
,"保存成功!!!")
fp
.close
()
總結(jié)
以上是生活随笔為你收集整理的04-正则解析+bs4基础爬虫的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。