日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

04-正则解析+bs4基础爬虫

發(fā)布時(shí)間:2024/9/15 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 04-正则解析+bs4基础爬虫 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

正則
匹配字符串
正則表達(dá)式必備技能:
元字符:(單個(gè)字符)
1,普通字符
2, 特殊字符
. 匹配所有文字符號(hào),除了換行
\w 匹配數(shù)字,字母,下劃線
\d 匹配數(shù)字
\W 匹配除了數(shù)字,字母,下劃線
\D 匹配除了數(shù)字
^ 匹配一句話得開(kāi)頭
$ 匹配一句話得結(jié)尾
[abc] 字符組 等于 \w
[^abc] 字符組 非字符組 等于 \W
3,量詞
用來(lái)修飾前面得一個(gè)單位的出現(xiàn)次數(shù)
* 表示出現(xiàn)0次或者更多次
+ 表示出現(xiàn)一次或者更多次
? 表示出現(xiàn)0次或者1次
{n}表示出現(xiàn)n次 # \d{3} 表示出現(xiàn)3個(gè)數(shù)字
4 ,貪婪匹配
默認(rèn)的量詞(,+,?)都是盡可能多的匹配內(nèi)容
5 , 惰性匹配
a.?x (匹配到a和第一個(gè)x就結(jié)束)惰性匹配,盡可能少的匹配,回溯算法

import re # 關(guān)于處理正則表達(dá)式的一個(gè)模塊lst = re.findall(r"\d+","哈哈99呵呵88嘿嘿77") print(lst) # \d 匹配數(shù)字 #['99', '88', '77']it = re.finditer(r"\d+","哈哈99呵呵88嘿嘿77") for item in it:print(item.group())# 99# 88# 77it = re.finditer(r"\d+","1哈哈哈sdasd12123ha4562哈哈哈78922") # 默認(rèn)匹配第一個(gè)數(shù)字 obj = it.__next__() print(obj.group()) #1 # 正則遍歷所有數(shù)字 for item in it:print (item.group()) # 1 # 12123 # 4562 # 78922## match 默認(rèn)從開(kāi)頭匹配 只能匹配一次 result = re.match(r"\d+","123456哈哈99呵呵88嘿嘿77") print(result.group()) # 123456## search 查找得到第一個(gè)結(jié)果就返回 result1 = re.search(r"\d+","哈哈123456哈哈99呵呵88嘿嘿77") print(result1.group()) # 123456obj = re.compile(r"www\.baidu\.com") it = obj.finditer("www.baidu.com,www.jd.com") for item in it:print(item.group()) # www.baidu.com##括號(hào)括起來(lái)的內(nèi)容是你想要的結(jié)果 ##(?P <name>) obj = re.compile(r"www\.(?P<gpc>baidu|jd)\.com") it = obj.finditer("www.baidu.com,www.jd.com") for item in it:print(item.group("gpc")) # baidu#jdobj = re.compile(r"www\.(baidu|jd)\.com") it = obj.findall("www.baidu.com,www.jd.com") print(it) # ['baidu', 'jd']

from urllib.request import urlopen #爬蟲(chóng)url模塊 import re# 打開(kāi)網(wǎng)頁(yè)url,拿到頁(yè)面源代碼 content = urlopen("https://news.163.com/").read().decode("gbk") # print (content) obj = re.compile(r'<div class="mod_top_news2" id="js_top_news">.*?">(?P<title>.*?)</a>',re.S) lst = obj.findall(content) print(lst)


補(bǔ):分析瀏覽器開(kāi)發(fā)者工具中Elements和network這兩個(gè)選項(xiàng)卡對(duì)應(yīng)的頁(yè)面源碼數(shù)據(jù)有何不同之處?

  • Elements中包含的顯示的頁(yè)面源碼數(shù)據(jù)為當(dāng)前頁(yè)面所有的數(shù)據(jù)加載完畢后對(duì)應(yīng)的完整頁(yè)面數(shù)據(jù)源碼(包含了動(dòng)態(tài)加載的數(shù)據(jù))
  • network中顯示的頁(yè)面源碼數(shù)據(jù)僅僅為某一個(gè)單獨(dú)的請(qǐng)求對(duì)應(yīng)的相應(yīng)數(shù)據(jù)(不包含動(dòng)態(tài)加載數(shù)據(jù))
    結(jié)論:如果在進(jìn)行數(shù)據(jù)解析的時(shí)候,一定需要對(duì)頁(yè)面布局進(jìn)行數(shù)據(jù)分析,如果當(dāng)前網(wǎng)站沒(méi)有動(dòng)態(tài)加載的數(shù)據(jù)可以直接使用Elements對(duì)頁(yè)面布局進(jìn)行分析,否則只可以使用network對(duì)頁(yè)面數(shù)據(jù)進(jìn)行分析。
    如下圖 靜態(tài)網(wǎng)站:


動(dòng)態(tài)加載:




使用正則提取校花網(wǎng)圖片鏈接

# -*- coding: utf-8 -*- import requests import urllib import re import osdirName = "校花圖片" #定義文件名字 if not os.path.exists(dirName): #3 判斷是否存在該文件os.mkdir(dirName) # 如果不存在該文件則創(chuàng)建headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" } #模仿瀏覽器UA頭 url = "http://www.521609.com/daxuemeinv/" #目標(biāo)url page_text = requests.get(url=url,headers=headers).text #捕獲到當(dāng)前首頁(yè)的頁(yè)面源碼數(shù)據(jù) ex = '<li>.*?<img src="(.*?)" width=.*?</li>' # 從當(dāng)前獲取的頁(yè)面源碼數(shù)據(jù)中解析出圖片地址 img_src_list = re.findall(ex,page_text,re.S) # 調(diào)用獲取到的頁(yè)面源碼以及加上re解析的參數(shù)信息 #re.S作用于正則換行 for src in img_src_list:print(src)src = "http://www.521609.com"+src #解析出來(lái)的url并不完整需要進(jìn)行拼接imgpath = dirName + '/'+src.split('/')[-1] #為圖片進(jìn)行命名urllib.request.urlretrieve(src,imgpath) #下載圖片print(imgpath,"下載完成")

bs4

  • bs4 解析原理
    • 實(shí)例化一個(gè)BeautifulSoup的對(duì)象,且將待解析的頁(yè)面源碼數(shù)據(jù)加載到該對(duì)象中
    • 調(diào)用BeautifulSoup對(duì)象中相關(guān)方法或者屬性進(jìn)行標(biāo)簽定位和文本數(shù)據(jù)提取
  • 環(huán)境安裝
    • pip install lxml #解析器
    • pip install bs4
  • BeautifulSoup對(duì)象的實(shí)例化:
    • BeautifulSoup(fp,‘lxml’):用來(lái)將本地存儲(chǔ)的html文檔中的數(shù)據(jù)進(jìn)行解析
    • BeautifulSoup(page_text,‘lxml’):用來(lái)將互聯(lián)網(wǎng)上請(qǐng)求到的頁(yè)面源碼數(shù)據(jù)今夕解析
  • 標(biāo)簽定位:

- soup.tagName:只可以定位到第一次出現(xiàn)的TagName標(biāo)簽

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text # with open("三國(guó).html","w",encoding="utf-8") as f: # f.write(page_text) soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 print(soup.li) #soup.tagName:只可以定位到第一次出現(xiàn)的TagName標(biāo)簽


- soup.find(‘tagName’,attrName=‘value’):屬性定位

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text # with open("三國(guó).html","w",encoding="utf-8") as f: # f.write(page_text) soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 print(soup.find('div',class_="book-mulu")) #soup.find('tagName',attrName='value'):屬性定位


# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text # with open("三國(guó).html","w",encoding="utf-8") as f: # f.write(page_text) soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 print(soup.find('div',id="top_right_nav")) #soup.find('tagName',attrName='value'):屬性定位 print(soup.find('p',class_="des"))

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text # with open("三國(guó).html","w",encoding="utf-8") as f: # f.write(page_text) soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 print(soup.find('p',class_="des")) #soup.find('tagName',attrName='value'):屬性定位 print(soup.findAll('p',class_="des"))# 跟find一樣用于屬性定位,只不過(guò)findAll返回的是列表


soup.select(“選擇器”)

  • 類選擇器
  • id選擇器
  • 層級(jí)選擇器

> : 表示一個(gè)層級(jí)

:空格表示多個(gè)層級(jí)

如下:

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text # with open("三國(guó).html","w",encoding="utf-8") as f: # f.write(page_text) soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 print(soup.select('.book-mulu > ul > li >a')) #soup.select("選擇器") > : 大于號(hào)表示一個(gè)層級(jí) print(soup.select('.book-mulu a')) #空格表示多個(gè)層級(jí)

文本數(shù)據(jù)提取:

  • .text : 返回的是改標(biāo)簽下的所有文本內(nèi)容
  • .string: 返回的是改標(biāo)簽直系下的文本內(nèi)容
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 a_href = soup.select('.book-mulu > ul > li >a') for a in a_href:print(a.string)#a.text


取url屬性

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 a_href = soup.select('.book-mulu > ul > li >a') for a in a_href:print(a['href'])#屬性提取

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoupfp = open("三國(guó)演義.txt","w",encoding="utf-8") headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" } url = "http://www.shicimingju.com/book/sanguoyanyi.html" page_text = requests.get(url=url,headers=headers).text soup = BeautifulSoup(page_text,'lxml') #實(shí)例化一個(gè)BeautifulSoup對(duì)象 a_list = soup.select('.book-mulu > ul > li >a') for a in a_list:title = a.string #中文名字detail_url = "http://www.shicimingju.com" + a ['href'] #解析出來(lái)的url地址# 對(duì)詳情頁(yè)發(fā)起請(qǐng)求解析出章節(jié)內(nèi)容page_text_detail = requests.get(url=detail_url,headers=headers).textsoup = BeautifulSoup(page_text_detail,'lxml')div_tag = soup.find('div',class_="chapter_content")content = div_tag.textfp.write(title+':'+content+"\n")print(title,"保存成功!!!") fp.close()

總結(jié)

以上是生活随笔為你收集整理的04-正则解析+bs4基础爬虫的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。