當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫实战操作（3）—— 获取列表下的新闻、诗词

發布時間：2023/12/31 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫实战操作（3）—— 获取列表下的新闻、诗词小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文前兩部分想實現：給定鏈接，獲取分頁的新聞標題內容，部分程序參考爬蟲實戰操作（2）一新浪新聞內容細節，爬蟲的鏈接是國際新浪網。

1. 單個新聞

獲取國際新聞最新消息下得單個信息內容
根據上面得鏈接簡單修改了下程序參數，主要是評論數得修改。

#給一個新聞id,返回一個信息評論數，因為評論數的網址只差一個新聞id不一樣 import re import requests import json commentURL = "https://comment.sina.com.cn/page/info?version=1&format=json\ &channel=gj&newsid=comos-i{}&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\ &page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601956837238&_=1601956837238" def getCommentCounts(newsurl): m = re.search('doc-ii(.+).shtml', newsurl)newsid = m.group(1) #獲取新聞編碼id comments=requests.get(commentURL.format(newsid))jd=json.loads(comments.text.strip('jsonp_1601956837238').strip('()'))return jd["result"]["count"]["total"]#獲取評論數 import requests from datetime import datetime from bs4 import BeautifulSoup #輸入：網址；輸出：新聞正文，標題，評論數，來源 def getNewsDetail(newsurl):result = {}res = requests.get(newsurl)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')result['title'] = soup.select(".main-title")[0].textresult['newssource'] = soup.select(".source")[0].texttimesource =soup.select(".date")[0].textresult['dt'] = datetime.strptime(timesource, "%Y年%m月%d日 %H:%M")result['article'] = '\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])result['editor'] = soup.select("#article p")[-1].text.strip('責任編輯：')result['comments'] = getCommentCounts(newsurl)return result import json news="https://news.sina.com.cn/w/2020-10-06/doc-iivhvpwz0572161.shtml" getNewsDetail(news)

2. 列表新聞

思想:
先找到控制網頁分頁的url，如下面的圖示
再獲取每一頁的所有新聞的鏈接
接著獲取每個鏈接的內容
最后修改分頁url的頁碼

#獲取每一頁的鏈接，在調用上面的函數獲取每個鏈接的內容 def parselistlink(url):newsdetails=[]res=requests.get(url)#去除兩邊的字符串，使得可以用json解析jd=json.loads(res.text.lstrip('newsloadercallback(').rstrip(');'))for ent in jd['result']['data']:#將每頁下每個新聞的鏈接傳給getNewsDetail，獲取每個新聞的內容newsdetails.append(getNewsDetail(ent['url']))return newsdetailsurl='https://interface.sina.cn/news/get_news_by_channel_new_v2018.d.html?cat_1=51923&show_num=27&level=1,2&page={}&callback=newsloadercallback&_=1601968313565' result=pd.DataFrame() import pandas as pd #獲取前5頁的內容 for i in range(1,5):newsurl=url.format(i)newsary=parselistlink(newsurl)result=pd.concat([result,pd.DataFrame(newsary)],axis=0) print(result) result1=result.drop_duplicates(keep='first') result1=result1.reset_index().drop('index',axis=1) print(result1)

3. 列表詩詞

詩詞鏈接：https://www.shicimingju.com/chaxun/zuozhe/9_2.html

1.先獲取每一頁的詩詞的鏈接

url='http://www.shicimingju.com/chaxun/zuozhe/9.html' base='https://www.shicimingju.com' headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} #使用headers（客戶端的一些信息），偽裝為人類用戶，使得服務器不會簡單地識別出是爬蟲 r=requests.get(url,headers=headers) html=r.text.encode(r.encoding).decode() soup=BeautifulSoup(html,'lxml') div=soup.find('div',attrs={'class':'card shici_card'}) hrefs=[h3.find('a')['href'] for h3 in div.findAll('h3')] hrefs=[base+i for i in hrefs] hrefs

2.再獲取所有頁碼下的所有詩詞的鏈接

def gethrefs(url):headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}#使用headers（客戶端的一些信息），偽裝為人類用戶，使得服務器不會簡單地識別出是爬蟲base='https://www.shicimingju.com'nexturl=urlans=[]while nexturl!=0:r=requests.get(nexturl,headers=headers)html=r.text.encode(r.encoding).decode()soup=BeautifulSoup(html,'lxml')div=soup.find('div',attrs={'class':'card shici_card'})hrefs=[h3.find('a')['href'] for h3 in div.findAll('h3')]hrefs=[base+i for i in hrefs]try:nexturl=base+soup.find('a',text='下一頁')['href']print('讀取頁碼中')except Exception as e:print('已經是最后一頁')nexturl=0ans.append(hrefs)return ans

3.獲取每個連接下的古詩內容

def writeotxt(url):headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}#使用headers（客戶端的一些信息），偽裝為人類用戶，使得服務器不會簡單地識別出是爬蟲r=requests.get(url,headers=headers)soup=BeautifulSoup(r.text.encode(r.encoding),'lxml')#數據清洗titile=soup.find('h1',id='zs_title').textcontent=soup.find('div',class_='item_content').text.strip()#先建一個文件夾firedir=os.getcwd()+'蘇軾的詞'if not os.path.exists(firedir):os.mkdir(firedir)with open (firedir+'/%s.txt'%title,mode='w+',encoding='utf-8') as f:f.write(title+'\n')f.write(content+'\n')print('正在載入第 %d首古詩。。。'%i)

總結

以上是生活随笔為你收集整理的爬虫实战操作（3）—— 获取列表下的新闻、诗词的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：电脑店能安装mysql_用U盘给台式机安
下一篇：使用U盘PE修复电脑常规问题