Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
網(wǎng)頁爬蟲與正則匹配
實(shí)現(xiàn)原理
利用requests訪問網(wǎng)站獲取html,用re正則表達(dá)式匹配并處理字符
代碼
# -*- coding: utf-8 -*- #上面一行告訴編譯器要用的編碼格式。這樣即使有中文也不會有問題 import re import requestsresponse = requests.get('https://www.quora.com/Is-online-education-overrated') #要爬取的網(wǎng)頁 f = open("words.txt", "a") #以讀寫模式創(chuàng)建/打開文件 data = response.text #用文字表示網(wǎng)站源碼,編碼格式可以改的 title = ' '.join(re.findall('<title>(.*?)</title>',data)) #網(wǎng)頁標(biāo)題 result_list = re.findall('"text": "(.*?)."',data)+re.findall(r'''\\\\\\"text\\\\\\": \\\\\\"(.*?).\\\\\\",''',data) #這里的正則表達(dá)式比較復(fù)雜,主要是找“text”元素的內(nèi)容。根據(jù)網(wǎng)頁的html規(guī)則不同,要找的tag也不同 f.write('\n') #寫一行換一行 print(title) #輸出標(biāo)題for result in result_list: #做一些后續(xù)處理result = result.replace(r'\u2019', r"'") #手動轉(zhuǎn)特殊字符result = result.replace('\\\\\\', "") #去掉雙反斜線result = result.replace(r'/', "") #去掉單斜線result = result.replace(r'\n', "") #去掉換行符result = result.replace('\\', "") #去掉單反斜線check = result.split() #格式換成list,每個元素是一個回答for ele in check: #遍歷列表元素,把其它無關(guān)字符刪掉if '"modifiers": {"image": ' in ele or len(ele) >= 13:check.remove(ele)result = ' '.join(check) #轉(zhuǎn)化成strf.write(result+", "+title+"\n") #寫到文件里print(result) #輸出寫入的主要內(nèi)容f.close() #好習(xí)慣保存文件功能拓展
為了演示正則匹配和爬蟲,代碼加入了很多無關(guān)緊要的代碼,而且匹配的并不是很完美。大部分還是比較準(zhǔn)確的,比如
Culture of Qualit "On quality Terry Anderson emphasized that ""learning- knowledge- assessment- and educational experiences will result in high levels of learning by all He also believes that the ""integration of the new tools and affordances of the educational Semantic Web and emerging social software solutions will further enhance and make more accessible and affordable quality online learning experiences" Since I have titled this observation as ODeL Xperitu [from Latin experitu = experienced tested proven] let me say that learning must progress to maturity; to function well as social innovators promoting excellence through Capacity Building and Development. Yes this is the Quality Assurance (QA) principle that defines and determ "Michael Moore even says that this is a fact of distance education wherein ""teaching is hardly ever an individual act but a process joining together the expertise of a number of specialists."然而,還是有一些奇奇怪怪的字符沒被刪掉
#NAME? #NAME? #NAME? #NAME? #NAME? For any Query or Enquiry Please Call u2013 Hiren Harwani - 9712186969 (you can join us in Whatsapp also看小伙伴們能不能自己嘗試改進(jìn)這個程序啦!
另外,如果是爬取中文網(wǎng)頁,要注意把編碼格式改成utf-8哦
網(wǎng)上其實(shí)還有很多其它第三方庫,比如beautiful soap 4,也是很值得探究的。
總結(jié)
以上是生活随笔為你收集整理的Python爬虫实战——Quora网站文字自动化爬取和正则匹配筛选信息的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 支付宝企业转账到个人账号(php源码,亲
- 下一篇: [NOTE] NFS 和TFTP安装与配