日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

Python-爬取2345电影并写入文件

發(fā)布時(shí)間:2025/3/21 python 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Python-爬取2345电影并写入文件 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
一個(gè)簡(jiǎn)單爬蟲

1.目標(biāo):爬取2345電影網(wǎng)2017年最新電影
2.所使用的庫:

from bs4 import BeautifulSoup import requests import codecs

測(cè)試環(huán)境 Python 3.6.0
3.目標(biāo)鏈接
http://dianying.2345.com/list/—-2017—2.html
點(diǎn)擊下一頁觀察每個(gè)url變化規(guī)律

4.開發(fā)者工具觀察
所有電影內(nèi)容都在{‘class’:’v_picConBox mt15’}這個(gè)div里,div里面是ul,ul下又包含所有電影內(nèi)容的li標(biāo)簽

(1)先寫第一個(gè)函數(shù),首先獲取整個(gè)頁面

def getHTMLText(self,url):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""

(2)然后計(jì)算所有頁面頁數(shù),這時(shí)可以使用檢查元素

def getPages(self):html = self.getHTMLText(self.urlBase)soup = BeautifulSoup(html,'lxml')# tag = soup.find('div',attrs={'class':'v_picConBox mt15'})tag = soup.find('div',attrs={'class':'v_page'})subTags = tag.find_all('a')#獲取頁碼數(shù)return int(subTags[-2].get_text())

(3)用于拼接url的函數(shù)
pages由上面的getPages()函數(shù)獲得

def getUrls(self,pages):urlHead = 'http://dianying.2345.com/list/----2017---'urlEnd = '.html'#pages由上面的getPages()函數(shù)獲得,產(chǎn)生1-23之間數(shù)字for i in range(1,pages+1):url = urlHead + str(i) +urlEndself.urls.append(url)

(4)爬取頁面函數(shù),包含三項(xiàng)內(nèi)容movieName/ movieScore/ movieStaring
這里之前定義了一個(gè)類

class MovieItem(object):movieName = NonemovieScore = NonemovieStaring = None def spider(self,urls):#此時(shí)urls是23個(gè)拼接好的url地址for url in urls:#循環(huán)獲取頁面內(nèi)容htmlContent = self.getHTMLText(url)soup = BeautifulSoup(htmlContent,'lxml')anchorTag = soup.find('ul',attrs={'class':'v_picTxt pic180_240 clearfix'})tags = anchorTag.find_all('li')for tag in tags:item = MovieItem()item.movieName = tag.find('span',attrs ={'class':'sTit'}).getText()item.movieScore = tag.find('span',attrs={'class':'pRightBottom'}).em.get_text().replace('分:','')item.movieStaring = tag.find('span',attrs={'class':'sDes'}).get_text().replace('主演:','')self.items.append(item)


/—————————————————————————————————-/

(5)最后一個(gè)函數(shù)save()
這里導(dǎo)入了codecs模塊,這個(gè)模塊可以選擇輸入字符的編碼.之前的程序在寫入txt時(shí)都需要將字符串的編碼轉(zhuǎn)換成utf8,這里只需要用

codecs.open(filename,'w','utf8')

打開文件就行了.后面往句柄中輸入的字符串都會(huì)自動(dòng)保存為utf8的編碼.

def save(self,items):count =0fileName = '2017熱門電影.txt'.encode('GBK')#格式化,這里有兩種實(shí)現(xiàn)方式tplt = "{0:^10}\t{1:<10}\t{2:^10}"#使用了之前導(dǎo)入的codesc庫,修改編碼with codecs.open(fileName,'w','utf-8') as fp:#items是已經(jīng)存儲(chǔ)好爬取內(nèi)容的列表for item in items:# fp.write('%s \t %s \t %s \r\n' %(item.movieName,item.movieScore,item.movieStaring))# tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"fp.write(tplt.format(item.movieName,item.movieScore,item.movieStaring))count = count + 1print('\r當(dāng)前進(jìn)度:{:.2f}%'.format(count*100/len(tags),end=''))

完整代碼

from bs4 import BeautifulSoup import requests import codecsclass MovieItem(object):movieName = NonemovieScore = NonemovieStaring = Noneclass GetMovie(object):def __init__(self):self.urlBase = 'http://dianying.2345.com/list/----2017--.html'self.pages = self.getPages()self.urls = [] #存放拼接后的urlself.items = []self.getUrls(self.pages)self.spider(self.urls)self.save(self.items)def getHTMLText(self,url):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def getPages(self):html = self.getHTMLText(self.urlBase)soup = BeautifulSoup(html,'lxml')# tag = soup.find('div',attrs={'class':'v_picConBox mt15'})tag = soup.find('div',attrs={'class':'v_page'})subTags = tag.find_all('a')#獲取頁碼數(shù)return int(subTags[-2].get_text())def getUrls(self,pages):urlHead = 'http://dianying.2345.com/list/----2017---'urlEnd = '.html'for i in range(1,pages+1):url = urlHead + str(i) +urlEndself.urls.append(url)def spider(self,urls):for url in urls:htmlContent = self.getHTMLText(url)soup = BeautifulSoup(htmlContent,'lxml')anchorTag = soup.find('ul',attrs={'class':'v_picTxt pic180_240 clearfix'})# print(anchorTag)tags = anchorTag.find_all('li')for tag in tags:item = MovieItem()item.movieName = tag.find('span',attrs ={'class':'sTit'}).getText()item.movieScore = tag.find('span',attrs={'class':'pRightBottom'}).em.get_text().replace('分:','')item.movieStaring = tag.find('span',attrs={'class':'sDes'}).get_text().replace('主演:','')self.items.append(item)def save(self,items):count =0fileName = '2017熱門電影.txt'.encode('GBK')tplt = "{0:^10}\t{1:<10}\t{2:^10}"with codecs.open(fileName,'w','utf-8') as fp:for item in items:# fp.write('%s \t %s \t %s \r\n' %(item.movieName,item.movieScore,item.movieStaring))# tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"fp.write(tplt.format(item.movieName,item.movieScore,item.movieStaring))#這里處理的不好,以后填坑count = count + 1print('\r當(dāng)前進(jìn)度:{:.2f}%'.format(count*100/len(items),end=''))if __name__ == '__main__':GM = GetMovie()

去廣告修正版

from bs4 import BeautifulSoup import requests import codecsclass MovieItem(object):movieName = NonemovieScore = NonemovieStaring = Noneclass GetMovie(object):def __init__(self):self.urlBase = 'http://dianying.2345.com/list/----2018--.html'self.pages = self.getPages()self.urls = [] #存放拼接后的urlself.items = []self.getUrls(self.pages)self.spider(self.urls)self.save(self.items)def getHTMLText(self,url):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def getPages(self):html = self.getHTMLText(self.urlBase)soup = BeautifulSoup(html,'lxml')# tag = soup.find('div',attrs={'class':'v_picConBox mt15'})tag = soup.find('div',attrs={'class':'v_page'})subTags = tag.find_all('a')#獲取頁碼數(shù)# print(subTags)# print("aaaa",int(subTags[-2].get_text()))return int(subTags[-2].get_text())def getUrls(self,pages):urlHead = 'http://dianying.2345.com/list/----2018---'urlEnd = '.html'for i in range(1,pages+1):url = urlHead + str(i) +urlEndself.urls.append(url)def spider(self,urls):for url in urls:htmlContent = self.getHTMLText(url)soup = BeautifulSoup(htmlContent,'lxml')anchorTag = soup.find('ul',attrs={'class':'v_picTxt pic180_240 clearfix'})# print(anchorTag)tags = anchorTag.find_all('li')tags.pop(9)# print(tags[9])for tag in tags:try:item = MovieItem()item.movieName = tag.find('span',attrs ={'class':'sTit'}).get_text().strip()item.movieScore = tag.find('span',attrs={'class':'pRightBottom'}).em.get_text().replace('分:','')item.movieStaring = tag.find('span',attrs={'class':'sDes'}).get_text().replace('主演:','')# print(item.movieName,item.movieScore,item.movieStaring)self.items.append(item)except Exception as e:raise edef save(self,items):count =0fileName = '2018熱門電影.txt'tplt = "{0:^10}\t{1:^10}\t{2:^10}"# tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"# for item in items:# # print((tplt.format(item.movieName,item.movieScore,item.movieStaring,chr(12288))))# print((tplt.format(item.movieName,item.movieScore,item.movieStaring)))with codecs.open(fileName,'w','utf-8') as fp:for item in items:fp.write(tplt.format(item.movieName,item.movieScore,item.movieStaring)+'\n')count = count + 1print('\r當(dāng)前進(jìn)度:{:.2f}%'.format(count*100/len(items),end=''))if __name__ == '__main__':GM = GetMovie()

總結(jié)

以上是生活随笔為你收集整理的Python-爬取2345电影并写入文件的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。