當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息

發布時間：2023/12/16 python 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

網頁爬蟲與正則匹配

實現原理

利用requests訪問網站獲取html，用re正則表達式匹配并處理字符

代碼

# -*- coding: utf-8 -*- #上面一行告訴編譯器要用的編碼格式。這樣即使有中文也不會有問題 import re import requestsresponse = requests.get('https://www.quora.com/Is-online-education-overrated') #要爬取的網頁 f = open("words.txt", "a") #以讀寫模式創建/打開文件 data = response.text #用文字表示網站源碼，編碼格式可以改的 title = ' '.join(re.findall('<title>(.*?)</title>',data)) #網頁標題 result_list = re.findall('"text": "(.*?)."',data)+re.findall(r'''\\\\\\"text\\\\\\": \\\\\\"(.*?).\\\\\\",''',data) #這里的正則表達式比較復雜，主要是找“text”元素的內容。根據網頁的html規則不同，要找的tag也不同 f.write('\n') #寫一行換一行 print(title) #輸出標題for result in result_list: #做一些后續處理result = result.replace(r'\u2019', r"'") #手動轉特殊字符result = result.replace('\\\\\\', "") #去掉雙反斜線result = result.replace(r'/', "") #去掉單斜線result = result.replace(r'\n', "") #去掉換行符result = result.replace('\\', "") #去掉單反斜線check = result.split() #格式換成list，每個元素是一個回答for ele in check: #遍歷列表元素，把其它無關字符刪掉if '"modifiers": {"image": ' in ele or len(ele) >= 13:check.remove(ele)result = ' '.join(check) #轉化成strf.write(result+", "+title+"\n") #寫到文件里print(result) #輸出寫入的主要內容f.close() #好習慣保存文件

功能拓展

為了演示正則匹配和爬蟲，代碼加入了很多無關緊要的代碼，而且匹配的并不是很完美。大部分還是比較準確的，比如

Culture of Qualit "On quality Terry Anderson emphasized that ""learning- knowledge- assessment- and educational experiences will result in high levels of learning by all He also believes that the ""integration of the new tools and affordances of the educational Semantic Web and emerging social software solutions will further enhance and make more accessible and affordable quality online learning experiences" Since I have titled this observation as ODeL Xperitu [from Latin experitu = experienced tested proven] let me say that learning must progress to maturity; to function well as social innovators promoting excellence through Capacity Building and Development. Yes this is the Quality Assurance (QA) principle that defines and determ "Michael Moore even says that this is a fact of distance education wherein ""teaching is hardly ever an individual act but a process joining together the expertise of a number of specialists."

然而，還是有一些奇奇怪怪的字符沒被刪掉

#NAME? #NAME? #NAME? #NAME? #NAME? For any Query or Enquiry Please Call u2013 Hiren Harwani - 9712186969 (you can join us in Whatsapp also

看小伙伴們能不能自己嘗試改進這個程序啦！
另外，如果是爬取中文網頁，要注意把編碼格式改成utf-8哦
網上其實還有很多其它第三方庫，比如beautiful soap 4，也是很值得探究的。

總結

以上是生活随笔為你收集整理的Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：支付宝企业转账到个人账号(php源码,亲
下一篇：一定要会的Python爬虫技能，搜索名字