feapder爬虫框架使用
官方文檔:feapder官方文檔|feapder-document
安裝命令:pip install?feapder
完整版本安裝命令:
pip3 install feapder[all]兩者區別不大,完整版支持內存去重。
創建爬蟲項目命令:
feapder create -p spider-douyin 這里創建了一個名為 spider-douyin 的爬蟲項目?然后使用cd命令進入到spiders目錄下,使用命令創建爬蟲:
spiders> feapder create -s douyin_pinglun? ? ? ?#創建名為douyin_pinglun的爬蟲
spider為分布式爬蟲,airspider為一般的輕量級爬蟲,選擇第一個輕量級的爬蟲就可以了。
?創建的文件如下
生成的文件簡單,方法 start_requests 初始化了url,最開始的鏈接解析都是從這里解析的,不用 callback方法指定方法函數的話,會默認交由parse方法來解析。如果需要傳參,如cookie,headers怎么辦呢?
這里需要自定義中間件,download_midware:
這里可以設置代理,headers,cookies等,以字典的鍵值對形式傳遞,feapder封裝的所有的請求都會走這里,傳遞加密密文可以從這里傳遞。
def download_midware(self, request):request.headers = {'User-Agent': ""}request.proxies = {"https":"https://12.12.12.12:6666"}request.cookies = {}return request?在mysql數據庫中建表,設置好自己想要抓取的字段,然后來到爬蟲框架設置里面:
將mysql鏈接配置下來。除了mysql鏈接,這里還可以設置隨機ua,爬蟲并發數量,日志信息等。
?配置好后,cd到items里面。
輸入命令創建item:
feapder create -i douyin_pinlun #douyin_pinlun 為mysql數據庫存儲的表名字?就會創建一個item文件。需要注意的是如果設置自增id的話,需要把self.id給注銷了。
爬蟲引用item的話,需要導入。
from items import douyin_pinlun_item #這里引用會提示報錯,但是不影響使用?使用起來要構造實例:item = douyin_pinlun_item.DouyinPinlunItem()
item = douyin_pinlun_item.DouyinPinlunItem() item['text_pinglun'] = i.get('text') item['digg_count'] = i.get('digg_count', 'null') item['nickname'] = i.get("user").get("nickname") yield item爬蟲文件可以創建單個或多個。單個爬蟲的話文件直接啟動就好了,多個需要到main里面配置。
啟動時的線程數量為:thread_count? ? ? 這里填多少個就是多少個。
?main封裝的話簡單的可以直接引入文件導出,這樣就可以一鍵啟動多個爬蟲。需要注意的是導入douyin_me.js的文件會因為目錄的不同報錯,同級目錄可以copy下來多放一個。
?爬蟲異常重試與數據處理:
這里需要另外一個自定義中間件validate 來進行處理。
?這里返回的狀態碼不為200的話就主動拋出個異常,然后就會重新請求這次url,這個時候可以更換代理或者看看有沒有別的驗證,如滑塊什么的;
如果沒有數據,就返回的為False的話就拋棄當前請求,不往下面的流程走了,簡單基本的邏輯判斷可以寫在這里
@summary: 校驗函數, 可用于校驗response是否正確 若函數內拋出異常,則重試請求 若返回True 或 None,則進入解析函數 若返回False,則拋棄當前請求 可通過request.callback_name 區分不同的回調函數,編寫不同的校驗邏輯
請求返回的? response 直接xpath,json,css選擇器等。
url = response .xpath("./@href").extract()? ? #返回的匹配出來的文本,以列表保存
?如果不加extract() ,返回的也是一個可以迭代的對象,后面循環的時候需要用extract_first()來獲取內容數據。
簡單來說extract()獲取xpath里面所有的迭代對象,extract_first()則返回一個。
自定義請求:通過日志,發現請求走的的都是get,如果我想用post請求過別的自定義請求,不想用他默認的自定義中間件download_midware的話改怎么做呢?
導包Request:
from feapder import RequestzRequest這是封裝的reques方法,request方法支持的參數他都支持,指定post請求如下:
在 請求里設置method為POST? ? ? ?method='POST'
import feapder from feapder import Request class TestAirSpiders(feapder.AirSpider):__custom_setting__ = dict(USE_SESSION=True,TASK_MAX_CACHED_SIZE=10,)def start_requests(self):data = {'www': 'eee'}headers = {}yield Request('https://www.baidu.com/', method='POST', data=data, headers=headers, callback=self.start_callbacks)def start_callbacks(self,request, response):print(response.text)print("爬蟲開始") if __name__ == "__main__":TestAirSpiders(thread_count=1).start()?訪問百度可以看到日志攜帶了參數。需要注意的是自定義請求,callback指定解析的函數
def start_callbacks(self,request, response)傳參需要的參數除了self應該還有兩個,不然這里會報錯。
代碼示例:
# -*- coding: utf-8 -*- """ Created on 2023-04-01 10:56:27 --------- @summary: --------- @author: 13008 """ from py_mini_racer import MiniRacer import feapder from items import douyin_pinlun_itemclass DouyinPinglun(feapder.AirSpider):def vmrun(self, url_):ctx = MiniRacer()with open('./douyin_me.js', mode='r', encoding='utf-8') as f:code = f.read()ctx.eval(code)sig_url = ctx.call("get_cookie", url_)return sig_urldef start_requests(self):self.aweme_id = '7177232220380286263' # 視頻idfor i in range(0, 2000, 20):url = f"/aweme/v1/web/comment/list/?device_platform=webapp&aid=6383&channel=channel_pc_web&aweme_id={self.aweme_id}&cursor={str(i)}&count=20&item_type=0&insert_ids=&rcFT=&pc_client_type=1&version_code=170400&version_name=17.4.0&cookie_enabled=true&screen_width=1536&screen_height=864&browser_language=zh-CN&browser_platform=Win32&browser_name=Chrome&browser_version=97.0.4692.71&browser_online=true&engine_name=Blink&engine_version=97.0.4692.71&os_name=Windows&os_version=10&cpu_core_num=16&device_memory=8&platform=PC&downlink=10&effective_type=4g&round_trip_time=100&webid=7146895385552848424"url_2 = self.vmrun(url)yield feapder.Request(f"https://www.douyin.com" + url_2)def download_midware(self, request):request.headers = {'bd-ticket-guard-client-csr': 'LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0NCk1JSUJEVENCdFFJQkFEQW5NUXN3Q1FZRFZRUUdFd0pEVGpFWU1CWUdBMVVFQXd3UFltUmZkR2xqYTJWMFgyZDENCllYSmtNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVCRnJRaGw0TkxiYkxoWEJWZTFEUm9CajUNCkRzWEl5eEdiaUZCY0U3QlJDZllFR3B4OVd3RmJXVW55YjgvSmZZLzkxdkJsODkwNHI3WXBZSjVSWjlJNEk2QXMNCk1Db0dDU3FHU0liM0RRRUpEakVkTUJzd0dRWURWUjBSQkJJd0VJSU9kM2QzTG1SdmRYbHBiaTVqYjIwd0NnWUkNCktvWkl6ajBFQXdJRFJ3QXdSQUlnUWVJSWJTWVpHTVVuaWJ1REJybllCM2wyTTR2eFQ5Q2hoQTYyME1BaVNJY0MNCklFV3BaSHQ0aUJUamlJME9WL0F2MU9aNzFjTzdMSktzellwMmZrL1BJaXM1DQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0NCg==','sec-ch-ua-mobile': '?0','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36','referer': f'https://www.douyin.com/discover?modal_id={self.aweme_id}','cookie': 'ttwid=1%7CXc3vhkHxi5GB-1tUeehhRU5p5pj3py5fOWZN5yxlJYs%7C1664016270%7C78148e57c65267c6ba64426ff64f012cf4c034afcb4a4b67f6feb91594ae73ce; douyin.com; passport_csrf_token=be7a44d9d1e987a8fc3a81632a4494d5; passport_csrf_token_default=be7a44d9d1e987a8fc3a81632a4494d5; s_v_web_id=verify_lfxcvumk_18USuvw8_iN9F_4yCB_8l5r_YUXsKsbmDgOb; csrf_session_id=a127a12b0dce1d8495284dc6af43071f; VIDEO_FILTER_MEMO_SELECT=%7B%22expireTime%22%3A1680924196256%2C%22type%22%3A1%7D; download_guide=%223%2F20230401%22; strategyABtestKey=%221680430517.569%22; __ac_nonce=064295cac00a877a95111; __ac_signature=_02B4Z6wo00f01ROBItwAAIDAcIvitEgq83UToSZAACDLM6a6ogezNWp1z0quFGNsh5aE-ETyixb3tLrKDg2jkApA2hnKLZKG82ma9Y9K2msGuSV14Eo17pYf72jZ0MRzY3zOzriUjM8hYOm6ab; home_can_add_dy_2_desktop=%221%22; msToken=TpdjVDe0UXBttERpn4pnc8ZmFhZAXrQuLSfFZ57iA94rIz3TmNTD437X-NzoafOgM4Pg6POPi55Hg8GoQKlPUO9RX1sfGbq1ohcoIQ0CttViqvKESO2rYlc=; tt_scid=S2eYwo65hsY77ld8I1ZFMIJZPhZSDJ.zDk8GP7.iS0rVmszVYpOznO6QXGEnUauxd711; d_ticket=7d056a6264e34ad92bfa1930807d65d25b97c; passport_assist_user=CkEoM6Oza0lRv9LQawD3TVktl002CQx_PvcEuM7VZuZ2-ANWXV59GXnEaX6Nu5s87ShkHw5sk8eDl7BmxtcDrPAOOhpICjwV4Jy2o-WYdhyKT7C2mvpSzx7_460hh-kfswRKYLmt8BaZmttjgWE6gMbThx8-U2wE_zEflBybuPO1FF0QybWtDRiJr9ZUIgEDTd3k5Q%3D%3D; n_mh=AVPM_2zR50Xj37sUHvJJ3kubTvbsSRblrQivnnHL0VU; sso_auth_status=b8f6f1a6bf4be96e27920baea5c8f659; sso_auth_status_ss=b8f6f1a6bf4be96e27920baea5c8f659; sso_uid_tt=43ef6c83a74358f6911aabd38408aa58; sso_uid_tt_ss=43ef6c83a74358f6911aabd38408aa58; toutiao_sso_user=1e20a827e4d6181b109ee4e16fe76ff9; toutiao_sso_user_ss=1e20a827e4d6181b109ee4e16fe76ff9; sid_ucp_sso_v1=1.0.0-KDlmMGFiYWI5ZmEwN2FjNDIwNTQ5NmJhNjA3YmNjZjA0NWMwNTNhMDAKHwjMsbDvroy_BBD8uqWhBhjvMSAMMI6_u5kGOAJA7AcaAmhsIiAxZTIwYTgyN2U0ZDYxODFiMTA5ZWU0ZTE2ZmU3NmZmOQ; ssid_ucp_sso_v1=1.0.0-KDlmMGFiYWI5ZmEwN2FjNDIwNTQ5NmJhNjA3YmNjZjA0NWMwNTNhMDAKHwjMsbDvroy_BBD8uqWhBhjvMSAMMI6_u5kGOAJA7AcaAmhsIiAxZTIwYTgyN2U0ZDYxODFiMTA5ZWU0ZTE2ZmU3NmZmOQ; msToken=8pGzTdrbxXaw3S8O-6eqGTQMug5sSjiGypvAG6XzcpMY1IHpTSEa3JWSoqBwFUGA3lP-KGyTAOeREl3Gnue3EbM_iC5WL4emUFG9MCQ5c3FqsE_2mHuxPjo=; odin_tt=82ccd7301c6e6437db6f1b6900696d0095901c2f46f30eef47ece0cdd5d4b1bd4db85116364cfdc6aa0b2f3d174ba18e92d528756f1a4d9b4377d3b648839c79; passport_auth_status=c8b7c174e1e776c5aed9736fd422637b%2C93cce2a85124d839b6ed0d613660d30b; passport_auth_status_ss=c8b7c174e1e776c5aed9736fd422637b%2C93cce2a85124d839b6ed0d613660d30b; uid_tt=bd3b018ebfb6b519a5ed79b11f7e2825; uid_tt_ss=bd3b018ebfb6b519a5ed79b11f7e2825; sid_tt=249042a9ff5ef895b72b1139075272b3; sessionid=249042a9ff5ef895b72b1139075272b3; sessionid_ss=249042a9ff5ef895b72b1139075272b3; bd_ticket_guard_client_data=eyJiZC10aWNrZXQtZ3VhcmQtdmVyc2lvbiI6MiwiYmQtdGlja2V0LWd1YXJkLWl0ZXJhdGlvbi12ZXJzaW9uIjoxLCJiZC10aWNrZXQtZ3VhcmQtY2xpZW50LWNlcnQiOiItLS0tLUJFR0lOIENFUlRJRklDQVRFLS0tLS1cbk1JSUNGVENDQWJxZ0F3SUJBZ0lVV0ppSEE0aDhwRW1aOFdmc0YxSVlkcE1TOUdRd0NnWUlLb1pJemowRUF3SXdcbk1URUxNQWtHQTFVRUJoTUNRMDR4SWpBZ0JnTlZCQU1NR1hScFkydGxkRjluZFdGeVpGOWpZVjlsWTJSellWOHlcbk5UWXdIaGNOTWpNd05EQXlNVEEwT0RJNFdoY05Nek13TkRBeU1UZzBPREk0V2pBbk1Rc3dDUVlEVlFRR0V3SkRcblRqRVlNQllHQTFVRUF3d1BZbVJmZEdsamEyVjBYMmQxWVhKa01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERcbkFRY0RRZ0FFQkZyUWhsNE5MYmJMaFhCVmUxRFJvQmo1RHNYSXl4R2JpRkJjRTdCUkNmWUVHcHg5V3dGYldVbnlcbmI4L0pmWS85MXZCbDg5MDRyN1lwWUo1Ulo5STRJNk9CdVRDQnRqQU9CZ05WSFE4QkFmOEVCQU1DQmFBd01RWURcblZSMGxCQ293S0FZSUt3WUJCUVVIQXdFR0NDc0dBUVVGQndNQ0JnZ3JCZ0VGQlFjREF3WUlLd1lCQlFVSEF3UXdcbktRWURWUjBPQkNJRUlKSWZDc09EUHlZSmZFQXJhQU5DVnBxN2x2SWlCK0oxZENxR2l4dURDNEI1TUNzR0ExVWRcbkl3UWtNQ0tBSURLbForcU9aRWdTamN4T1RVQjdjeFNiUjIxVGVxVFJnTmQ1bEpkN0lrZURNQmtHQTFVZEVRUVNcbk1CQ0NEbmQzZHk1a2IzVjVhVzR1WTI5dE1Bb0dDQ3FHU000OUJBTUNBMGtBTUVZQ0lRQzJiRFNNM2tlMEFsV2NcblBBYXBjOXJwOEhLSVl0UVlPa3ZyYm0zM2wrMXhmQUloQUx2dGM1b3gvZVhROUFxKy9qRlNwcVdKczB4T0t3R1dcblRmYmNMVFQ5MnFLZ1xuLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLVxuIn0=; bd_ticket_guard_server_data=; publish_badge_show_info=%220%2C0%2C0%2C1680432510632%22; LOGIN_STATUS=1; store-region=cn-bj; store-region-src=uid; passport_fe_beating_status=true; sid_guard=249042a9ff5ef895b72b1139075272b3%7C1680432511%7C5183999%7CThu%2C+01-Jun-2023+10%3A48%3A30+GMT; sid_ucp_v1=1.0.0-KDAzOTE3NjhkODJlOWQ5ZWQ0ZTM2MjY0NzZiMGVmMzM0ZWZiMDJhOWYKGwjMsbDvroy_BBD_uqWhBhjvMSAMOAJA7AdIBBoCbGYiIDI0OTA0MmE5ZmY1ZWY4OTViNzJiMTEzOTA3NTI3MmIz; ssid_ucp_v1=1.0.0-KDAzOTE3NjhkODJlOWQ5ZWQ0ZTM2MjY0NzZiMGVmMzM0ZWZiMDJhOWYKGwjMsbDvroy_BBD_uqWhBhjvMSAMOAJA7AdIBBoCbGYiIDI0OTA0MmE5ZmY1ZWY4OTViNzJiMTEzOTA3NTI3MmIz',}def validate(self, request, response):if response.status_code != 200:raise Exception("response code not 200") #拋出異常則重試if response.json.get('comments','null') == 'null':return False #不符合要求的話,返回False 拋棄當前請求def parse(self, request, response):comments_list = response.json.get('comments')print('eee', comments_list)for i in comments_list:item = douyin_pinlun_item.DouyinPinlunItem()item['text_pinglun'] = i.get('text')cid = i.get('cid')item['digg_count'] = i.get('digg_count', 'null')item['nickname'] = i.get("user").get("nickname")url2 = f"/aweme/v1/web/comment/list/reply/?device_platform=webapp&aid=6383&channel=channel_pc_web&item_id={self.aweme_id}&comment_id={cid}&cursor=0&count=3&item_type=0&pc_client_type=1&version_code=170400&version_name=17.4.0&cookie_enabled=true&screen_width=1536&screen_height=864&browser_language=zh-CN&browser_platform=Win32&browser_name=Chrome&browser_version=97.0.4692.71&browser_online=true&engine_name=Blink&engine_version=97.0.4692.71&os_name=Windows&os_version=10&cpu_core_num=16&device_memory=8&platform=PC&downlink=10&effective_type=4g&round_trip_time=100&webid=7146895385552848424"url = self.vmrun(url2)yield feapder.Request(f"https://www.douyin.com" + url, callback=self.xpath)yield itemdef xpath(self, request, response): # 獲取二級評論print('wwwwwww', response.text)comments = response.json.get('comments')for i in comments:item = douyin_pinlun_item.DouyinPinlunItem()item['text_pinglun'] = i.get('text')item['digg_count'] = i.get('digg_count', 'null')item['nickname'] = i.get("user").get("nickname")yield itemif __name__ == "__main__":DouyinPinglun(thread_count=1).start()總結
以上是生活随笔為你收集整理的feapder爬虫框架使用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 咳咳,建模大神教你如何用ZBrush雕刻
- 下一篇: 网卡的基本构成