當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫之汽车之家/抽屉新热榜/煎蛋网

發布時間：2023/12/20 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫之汽车之家/抽屉新热榜/煎蛋网小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

alex?http://www.cnblogs.com/alex3714/articles/5465198.html

one

http://www.runoob.com/python3/python3-tutorial.html python是一門編程語言，作為學習python的開始，需要事先搞明白：編程的目的是什么？什么是編程語言？什么是編程？編程的目的： #計算機的發明，是為了用機器取代/解放人力，而編程的目的則是將人類的思想流程按照某種能夠被計算機識別的表達方式傳遞給計算機，從而達到讓計算機能夠像人腦/電腦一樣自動執行的效果。什么是編程語言？ #上面提及的能夠被計算機所識別的表達方式即編程語言，語言是溝通的介質，而編程語言是程序員與計算機溝通的介質。在編程的世界里，計算機更像是人的奴隸，人類編程的目的就命令奴隸去工作。python的創始人為吉多·范羅蘇姆（Guido van Rossum）。1989年的圣誕節期間，吉多·范羅蘇姆為了在阿姆斯特丹打發時間，決心開發一個新的腳本解釋程序 python是一門富含優雅設計理念的編程語言 import thisPython可以應用于眾多領域，如：數據分析、組件集成、網絡服務、圖像處理、數值計算和科學計算等眾多領域。目前業內幾乎所有大中型互聯網企業都在使用Python，如：Youtube、Dropbox、BT、Quora（中國知乎）、豆瓣、知乎、Google、Yahoo!、Facebook、NASA、百度、騰訊、汽車之家、美團等。互聯網公司廣泛使用Python來做的事一般有：自動化運維、自動化測試、大數據分析、爬蟲、Web 等。python是一門解釋性語言，由解釋器逐行讀取python代碼，并執行 go是編譯型語言,編譯器把源程序的每一條語句都編譯成機器語言,并保存成二進制文件,這樣運行時計算機可以直接以機器語言來運行此程序,速度很快安裝python3.6解釋器安裝pycharm軟件測試python解釋器，交互式，第一個hello world 變量變量名只能是字母、數字或下劃線的任意組合變量名的第一個字符不能是數字關鍵字不能聲明為變量名每個變量有三個屬性 print(type(a),id(a),a) 列表字典常量 input 注釋 boola > b True/False # 1、None，0，空（空字符串，空列表，空字典等）三種情況下布爾值為False# 2、其余均為真格式化輸出%s%d 運算符+-*/+= if else #小紅age=19age=33if age > 40:print('阿姨再見')elif age <30:print('小姐姐有空嗎')else:print('大齡...')while True:print('aa')print('bb')print('cc') print('dd') #這輩子走不到dd這一行count = 0 while count <20:count+=1print(count)if count ==8:breakfor i in [1,2,3,4,5]:print(i)for i in {'name':'chaoge','age':18}:print(i)for i in range(10):if i == 8:continueprint('i的值是:', i)可變列表，字典l1=[1,2,3]id(l1)l1[0]=4id(l1) 不可變數字字符串元祖s1='oldboy'id(s1)s1='oldgirl'id(s1) 字符串s1='oldboy'索引 012345s1[index]s1[1] 正向取索引1的值s1[-1] 反向取索引-1的值，從后往前取切片顧頭不顧尾，步長，默認步長1s1='oldboy's1[0:3:2] 'od's1[0:4] 'oldb's1[::] 'oldboy's1[::-1] 'yobdlo' lenlen(s1)s1.__len__() 成員運算in,not in s1='oldboy''c' in s1 'c' not in s1 移除前后的空白s1=' my name is yuchao 's1.strip()name=input().strip() 分割splitchaoge='chaoge|18|180|1800'chaoge.split('|') #返回一個列表大寫小寫轉化chaoge='ABCDEFG'chaoge.lower()chaoge.upper()addr=input().upper() format格式化輸出info='my name is {},my age is {}'.format('yuchao',18)info='my name is {name},my age is {age}'.format(age=18,name='chao')info='my name is {0},my age is {1},i am real age {1},really{1}'.format('chao',18) 字符串拼接s1='old's2='boy's1+s2 替換info='my name is chaoge,my name is chaoge'info.replace('chaoge','yuchao')info.replace('chaoge','yuchao',1) #替換一次判斷全數字age='123123e'age='123123'age.isdigit()age=input('input your age:')if age.isdigit():age=int(age)print('你的年紀是%s'%age)else:print('輸入的age非法，重新輸入') count統計次數info='my name is yuchao,my name is yuchao and yuchao'info.count('yuchao') one

two

print()默認有一個換行符 print('aaa',end='') print('bbb')列表操作l1=[1,'chao',[6,6],{'name':'chao'}]#按照索引取值l1[0]l1[-1]#正負索引修改值l1[0]=2l1[-1]=3#修改[6,6]的值l1[2][1]=7 切片顧頭不顧尾步長l2=l1[0:3:2] 長度l1.__len__()len(l1) 成員運算 in not in 1 in l12 not in l1 在末尾追加l1.append(3)l1.append('chao') l1.append([1,2]) 刪除a=10del adel l1[1]l1.remove(1)l1.remove('chao')l1.pop()#刪除結尾一個元素 for遍歷列表for i in l1:print(i)for i in range(len(l1)):print(l1[i]) 列表的index/count方法l1.index(1)l1.count('chao') extend擴展l1.extend([1,2,3])l1.extend(l2) clear清空列表l1.clear() 反轉，倒敘l1.reverse()元祖定義age = (11, [1, 3, 4], 33, 44, 55)age=tuple([11,22,33,44])#傳入可迭代對象索引取值切片長度成員運算 for循環字典定義：key必須是不可變類型,value可以是任意類型dic={'name':'chao','age':18}dic=dict({'name':'chao','age':18})dic=dict(name='egon',age=18) 按key取值,可取可存dic['addr']='huaian' 長度len 成員運算in not in 刪除dic.pop('addr') clear清空 valus取出所有值dic.values() keys取出所有鍵dic.kyes()集合類型set1={'alex','egon','wupeiqi'}type(set1) 集合添加數據set1.add('yuanhao') 清空clear文件處理 ,r是讀，t是文本模式 f=open(r'C:\Users\admin\PycharmProjects\test\test','r',encoding='utf-8') print(f) data=f.read() print(data) f.close()with open('/tmp/test.txt' ,'r')as f:print(f.read())上面都是讀，r，接下來使用w，寫模式 w模式打開文件，他會清空文件，需要注意，文件不存在默認會創建 with open('/Users/yuchao/PycharmProjects/oldboy_python/ghostdriver.log' ,'w')as f:f.write('test file\n')f.write('test hello\n') 由于w每次打開都會清空文件，因此內容永遠是這2行#追加寫入可以用a模式 with open('/Users/yuchao/PycharmProjects/oldboy_python/ghostdriver.log' ,'r')as f:f.write('test file\n')f.write('test hello\n')#讀取圖片數據并且寫入到一個新圖片 with open('/Users/yuchao/PycharmProjects/oldboy_python/girl.jpg' ,'rb')as f:with open('new_girl.JPG','wb')as f1:f1.write(f.read())函數def關鍵詞定義函數,定義階段只檢測語法，不執行代碼調用階段，執行代碼先執行再調用無參函數定義，調用def foo():print('hello foo')foo()#調用空函數，passdef foo():pass有參函數定義，調用def foo2(arg):print('foo2傳入的參數arg是',arg)foo2('chaoge') 函數返回值def foo3(a,b):return a+bfoo3(3+2) 函數無return，默認返回Nonedef foo3():passprint(foo3()) #None 以上都是位置參數關鍵詞參數foo(a=3,b=4) 可變長參數，*args，位置傳參，*args接收一個元祖 def foo(a,b,*args):print(a)print(b)print(args) foo(1,2,3,4,5) 可變長參數，關鍵詞傳參,**kwargs接收一個字典 def foo(x,y,**kwargs):print(x)print(y)print(kwargs) foo(1,2,z=12,u=10) 函數名字可以當做變量傳遞 def foo(x,y,**kwargs):print(x)print(y)print(kwargs) a=foo print(a) a(1,2,z=12,u=10) 函數名可以當做參數傳遞 def foo():print('hello')def foo2(func):func() foo2(foo) 函數嵌套 def foo():def inner():print('i am inner')return innerf=foo() f() 閉包函數內部函數包含對外部作用域而非全局作用域的引用 x=100 def foo():x=10def inner():print(x)return inner f=foo() f()模塊模塊就是別人寫好的python代碼文件/文件夾，使用from 文件夾 import 模塊import time print(time.strftime('%Y-%m-%d %T'))import random print(random.randint(10,20))import os os.remove('new_girl.JPG')import os print(os.mkdir('test'))pip3 install Flask from flask import Flask app=Flask(__name__) app.debug=True @app.route('/index') def index():return 'hello huaian' if __name__ == '__main__':app.run() two

three

什么是爬蟲技術？互聯網你就可以想象成一個巨大蜘蛛網，爬蟲程序就是網上的爬蟲機器人，它從出生就放在了網絡上，一直開始爬行，遇見什么就捉住什么互聯網公司哪家爬蟲開發做的最牛？百度，谷歌，搜索引擎就是一個巨大的爬蟲今日頭條app的新聞，點一下，唰就調到了另一個頁面，然后蹦出一些新聞，這就是今日頭條爬取的其他公司的新聞信息我有一個大學同學，她是學設計的，畢業后他就去了上海，社交app的，她負責攝影，去全國各地拍照，寫文章,然后推到公司的文章頁面，但是這樣太費錢了啊，而且效率很低，出文章產品率非常低那他們公司就想了個辦法，招了個nb的爬蟲程序員，讓他寫程序，在整個互聯網，各家做旅游最nb的公司，去爬取他們公司發布的圖片，然后自己再進行處理因此這個爬蟲程序是需要遵循法律責任的，不得爬取別人有原創版權的作品，進行自己的商業使用。每個互聯網公司都是遵循的robots.txt協議的那些能爬,哪些不能爬我們平時訪問資源，都是瀏覽器訪問一個url，發起一個請求，別人返還給你數據因此爬蟲就是通過代碼模擬瀏覽器發起請求，獲取別人的內容使用的模塊就是requests bs4模塊用于解析響應的內容互聯網通信，web協議介紹 https://www.cnblogs.com/pyyu/p/9467256.html 下載圖片 import requests data=requests.get('http://www.pythonav.cn/av/man.jpg') with open('1.jpg','wb') as f:f.write(data.content)介紹了http web基礎,開始requests模塊講解 http://www.cnblogs.com/wupeiqi/articles/6283017.html 爬取汽車之家的新聞爬取抽屜的新聞點贊抽屜網的文章 three

汽車之家新聞：

import requests from bs4 import BeautifulSoup import os# 模擬瀏覽器發請求 r1 = requests.get(url='https://www.autohome.com.cn/news/') # print(r1) #<class 'requests.models.Response'> r1.encoding = 'gbk' # 解決亂碼問題 # print(r1.text)#打印網頁文本 # print(r1.content)#打印網頁bytes數據# bs4解析響應體對象 soup = BeautifulSoup(r1.text, 'html.parser') # print(soup)# bs4查找標簽 container = soup.find(name='div', attrs={'id': 'auto-channel-lazyload-article'}) li_list = container.find_all(name='li') # print(type(li_list))#<class 'bs4.element.ResultSet'> for tag in li_list:# 過濾出所有title <h3>標簽title = tag.find(name='h3')if not title:continue# 簡介summary = tag.find(name='p')# 文章url# a是找到的tag,字典數據a = tag.find(name='a')url = "https:" + a.attrs.get("href")# 圖片urlimg = tag.find(name='img')img_url = "https:" + img.get("src")print('標題:', title.text)print('簡介：', summary.text)print('url:', url)print('圖片地址：', img_url)print('-' * 100)# 保存圖片 r2 = requests.get(url=img_url) file_name = img_url.rsplit('/', maxsplit=1)[1] file_path = os.path.join('imgs', file_name) with open(file_path, 'wb')as f:f.write(r2.content) autohome.py import requests from bs4 import BeautifulSoup import osr1 = requests.get(url="https://www.autohome.com.cn/news/",) r1.encoding = 'gbk'# 文章內容開始 ''' auto-channel-lazyload-article ''' # 使用python的bs4模塊處理汽車之家的響應內容 # 創建一個soup對象 soup = BeautifulSoup(r1.text, 'html.parser') # 通過bs4提供的標簽查找法進行標簽過濾 # 找到文章內容所有 # container = soup.find(name='div', attrs={'id': 'auto-channel-lazyload-article'}) article_all = soup.find(name='div', attrs={'class': 'article-wrapper'}) # 然后繼續過濾，發現每一篇文章都存在li標簽下 # find_all找到所有的li標簽，返回的是一個列表數據，因此就可以for遍歷，分離出每一個li li_list = article_all.find_all(name='li') for tag in li_list:# 遍歷出每一個tag之后，就可以找到tag中的標題titletitle = tag.find(name="h3")# 發現這里有None空數據因此做個判斷if not title:# 如果title為空，就跳過當前循環continue# print(title.text)# 文章內容content = tag.find(name='p')# print(content.text)# 文章地址,先找到a標簽，然后取出a標簽的href屬性，通過a.attrs得到a標簽屬性，字典數據a = tag.find(name='a')# print(a.attrs)#字典數據就可以通過get取值，還得拼接url全路徑url = "https:" + a.attrs.get("href")# print(url)# 圖片地址img = tag.find(name="img")img_url = "https:" + img.attrs.get("src")# print(img_url)#因此我們可以格式化輸出新聞信息print("文章標題： ",title)print('文章詳情： ',content)print('文章地址： ',url)print("文章圖片地址： ",img_url)print('-'*50)#下載所有圖片data=requests.get(url=img_url)#圖片名字分割url，分割一次，得到2個值，取索引1的值img_name=img_url.rsplit('/',maxsplit=1)[1]img_path=os.path.join('img',img_name)with open(img_path,'wb') as f:f.write(data.content)print("下載完成，圖片保存在Img文件夾") 詳細注釋版

抽屜新熱榜新聞

import requests import os from bs4 import BeautifulSoup# 抽屜不加請求頭的話，默認403終止了 r1 = requests.get(url='https://dig.chouti.com/',headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'} )soup = BeautifulSoup(r1.text, 'html.parser') # 內容區域 container = soup.find(name='div', attrs={'class': "content-list"}) div_list = container.find_all(name='div') for tag in div_list:title = tag.find(name="a", attrs={'class': "show-content color-chag"})if not title:continuesummary = tag.find(name="div", attrs={"class": "area-summary"})if not summary:continuea = tag.find(name="a")url = "https:" + a.attrs.get("href")img = tag.find("img")# 獲取img的源地址，可能有None，因此要做判斷img_url = img.get("original")if not img_url:continueimg_url = "https:" + img_url# 下載縮略圖r2 = requests.get(url=img_url)file_namne = img_url.rsplit("/", maxsplit=1)[1]file_namne = file_namne.rstrip("?quality=80")file_path = os.path.join("img2", file_namne)with open(file_path, 'wb') as f:f.write(r2.content)print("標題：",title.text.strip())print("簡介：",summary.text)print("文件鏈接：",url)print("圖片鏈接",img_url) 抽屜新聞，圖片 import requestsr1 = requests.get(url="https://dig.chouti.com/",headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",}, ) # 瀏覽器發起請求，獲取cookie r1_cookie_dict = r1.cookies.get_dict() print(r1_cookie_dict)# 提交登錄,https://dig.chouti.com/login r2 = requests.post(url="https://dig.chouti.com/login",# 攜帶請求頭headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",},# 此處是抽屜網提交登錄的Form datadata={'phone': "8615210858004","password": "a6081798","oneMonth": 1},# 提交cookiecookies=r1_cookie_dict ) ###點贊 # 點擊點贊時，獲取的點贊url，地址是：https://dig.chouti.com/link/vote?linksId=21775786，點贊提交方式是POST # 取消點贊的url是：https://dig.chouti.com/vote/cancel/vote.do rsp = requests.get(url="https://dig.chouti.com/",headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",}, )from bs4 import BeautifulSoup# bs4解析，查找標簽 soup = BeautifulSoup(rsp.text, "html.parser") div = soup.find(attrs={'id': 'content-list'})items = div.find_all(attrs={"class": "item"}) for item in items:tag = item.find(attrs={"class": "part2"})nid = tag.get("share-linkid")#循環根據nid進行點贊r1 = requests.post(# 循環遍歷點贊，取出linksidurl="https://dig.chouti.com/link/vote?linksId=%s" % nid,# 取消點贊發送的地址，必須帶上鏈接ID# url="https://dig.chouti.com/vote/cancel/vote.do",# data={# "linksId": 21775786# },headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",},# 同一個cookie，身份驗證cookies=r1_cookie_dict)print(r1.text) 抽屜點贊

import requests from bs4 import BeautifulSoup import osr1 = requests.get(url="http://dig.chouti.com",headers={# 模擬瀏覽器客戶端"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}) # print(r1.text)# bs4解析 soup = BeautifulSoup(r1.text, 'html.parser') # print(soup) # 找到內容顯示區域 content = soup.find(name="div", attrs={"class": "content-list"}) # print(content) # 每一個div，item是每一個文章塊，返回一個列表 item = content.find_all(name="div", attrs={"item"}) for tag in item:# 遍歷每一個item內容塊,標題就是a標簽的text值title = tag.find(name='a')# print(title.text)# 摘要信息summary = tag.find(name="span", attrs={"class": "summary"})if not summary:continue# print(summary.text)# 文章鏈接，通過attrs方法獲取字典數據url = tag.find(name="a").attrs.get("href")# print(url)# 縮略圖img_url = tag.find(name="img", attrs={"alt": "抽屜新熱榜"})img_url = "https:" + img_url.get("original")print("文章標題： ", title.text.strip())print("文章摘要： ", summary.text)print("文章鏈接： ", url)print("圖片鏈接： ", img_url)print("-————"*20)#下載圖片img1=requests.get(url=img_url)img_name=img_url.rsplit("/",maxsplit=1)[1]img_name=img_name.rsplit("?quality=80")[0]print(img_name)img_path=os.path.join("img2",img_name)with open(img_path,'wb') as f:f.write(img1.content) 注釋版抽屜

登錄github

import requests# 訪問github頁面 r1 = requests.get(url="https://github.com/login" ) r1_cookie_dict = r1.cookies.get_dict() # print(r1.text) # bs4解析頁面 from bs4 import BeautifulSoups1 = BeautifulSoup(r1.text, "html.parser")token = s1.find(name="input", attrs={"name": "authenticity_token"}).get("value") print(token) r2 = requests.post(url="https://github.com/session",data={"commit": "Sign in","utf8": "?","authenticity_token": token,"login": "","password": ""},cookies=r1_cookie_dict ) print(r2.text) github登錄

轉載于:https://www.cnblogs.com/pyyu/p/9560924.html

總結

以上是生活随笔為你收集整理的爬虫之汽车之家/抽屉新热榜/煎蛋网的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【PS基础教程】PS制作圆形小图标
下一篇： Word2016如何去掉首页页码并从任意