當前位置：首頁 > 编程语言 > python >内容正文

python

python爬取app中的音频_Python爬取抖音APP，只需要十行代码

發布時間：2024/10/8 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取app中的音频_Python爬取抖音APP，只需要十行代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

環境說明

環境：

python 3.7.1

centos 7.4

pip 10.0.1

部署

[root@localhost ~]# python3.7 --version

Python 3.7.1

[root@localhost ~]#

[root@localhost ~]# pip3 install douyin

有時候因為網絡原因會安裝失敗，這時重新執行上面的命令即可，直到安裝完成。

導入douyin模塊

[root@localhost ~]# python3.7

>>>import douyin

>>>

導入如果報錯的話，可能douyin模塊沒有安裝成功。

下面我們開始爬…爬抖音小視頻和音樂咯

[root@localhost douyin]# python3.7 dou.py

幾分鐘后…我們來看看爬的成果

可以看到視頻配的音樂被存儲成了 mp3 格式的文件，抖音視頻存儲成了 mp4 文件。

嗯…不錯，哈哈。

py腳本

作者說，能爬抖音上所有熱門話題和音樂下的相關視頻都爬取到，并且將爬到的視頻下載下來，同時還要把視頻所配的音樂也單獨下載下來，不僅如此，所有視頻的相關信息如發布人、點贊數、評論數、發布時間、發布人、發布地點等等信息都需要爬取下來，并存儲到 MongoDB 數據庫。

import douyin

from douyin.structures import Topic, Music

# 定義視頻下載、音頻下載、MongoDB 存儲的處理器

video_file_handler = douyin.handlers.VideoFileHandler(folder='./videos')

music_file_handler = douyin.handlers.MusicFileHandler(folder='./musics')

#mongo_handler = douyin.handlers.MongoHandler()

# 定義下載器，并將三個處理器當做參數傳遞

#downloader = douyin.downloaders.VideoDownloader([mongo_handler, video_file_handler, music_

file_handler])

downloader = douyin.downloaders.VideoDownloader([video_file_handler, music_file_handler])

# 循環爬取抖音熱榜信息并下載存儲

for result in douyin.hot.trend():

for item in result.data:

# 爬取熱門話題和熱門音樂下面的所有視頻，每個話題或音樂最多爬取 10 個相關視頻。

downloader.download(item.videos(max=10))

由于我這里沒有mongodb所以，把這mongodb相關的配置給注釋掉了。

====以下摘自作者====

代碼解讀

本庫依賴的其他庫有：aiohttp：利用它可以完成異步數據下載，加快下載速度

dateparser：利用它可以完成任意格式日期的轉化

motor：利用它可以完成異步 MongoDB 存儲，加快存儲速度

requests：利用它可以完成最基本的 HTTP 請求模擬

tqdm：利用它可以進行進度條的展示

數據結構定義

如果要做一個庫的話，一個很重要的點就是對一些關鍵的信息進行結構化的定義，使用面向對象的思維對某些對象進行封裝，抖音的爬取也不例外。

在抖音中，其實有很多種對象，比如視頻、音樂、話題、用戶、評論等等，它們之間通過某種關系聯系在一起，例如視頻中使用了某個配樂，那么視頻和音樂就存在使用關系；比如用戶發布了視頻，那么用戶和視頻就存在發布關系，我們可以使用面向對象的思維對每個對象進行封裝，比如視頻的話，就可以定義成如下結構：

class Video(Base):

def __init__(self, **kwargs):

"""

init video object

:param kwargs:

"""

super().__init__()

self.id = kwargs.get('id')

self.desc = kwargs.get('desc')

self.author = kwargs.get('author')

self.music = kwargs.get('music')

self.like_count = kwargs.get('like_count')

self.comment_count = kwargs.get('comment_count')

self.share_count = kwargs.get('share_count')

self.hot_count = kwargs.get('hot_count')

...

self.address = kwargs.get('address')

def __repr__(self):

"""

video to str

:return: str

"""

return '>' % (self.id, self.desc[:10].strip() if self.desc else None)

這里將一些關鍵的屬性定義成 Video 類的一部分，包括 id 索引、desc 描述、author 發布人、music 配樂等等，其中 author 和 music 并不是簡單的字符串的形式，它也是單獨定義的數據結構，比如 author 就是 User 類型的對象，而 User 的定義又是如下結構：

class User(Base):

def __init__(self, **kwargs):

"""

init user object

:param kwargs:

"""

super().__init__()

self.id = kwargs.get('id')

self.gender = kwargs.get('gender')

self.name = kwargs.get('name')

self.create_time = kwargs.get('create_time')

self.birthday = kwargs.get('birthday')

...

def __repr__(self):

"""

user to str

:return:

"""

return '>' % (self.alias, self.name)

所以說，通過屬性之間的關聯，我們就可以將不同的對象關聯起來，這樣顯得邏輯架構清晰，而且我們也不用一個個單獨維護字典來存儲了，其實這就和 Scrapy 里面的 Item 的定義是類似的。

請求和重試

實現爬取的過程就不必多說了，這里面其實用到的就是最簡單的抓包技巧，使用 Charles 直接進行抓包即可。抓包之后便可以觀察到對應的接口請求，然后進行模擬即可。

所以問題就來了，難道我要一個接口寫一個請求方法嗎？另外還要配置 Headers、超時時間等等的內容，那豈不是太費勁了，所以，我們可以將請求的方法進行單獨的封裝，這里我定義了一個 fetch 方法：

def _fetch(url, **kwargs):

"""

fetch api response

:param url: fetch url

:param kwargs: other requests params

:return: json of response

"""

response = requests.get(url, **kwargs)

if response.status_code != 200:

raise requests.ConnectionError('Expected status code 200, but got {}'.format(response.status_code))

return response.json()

這個方法留了一個必要參數，即 url，另外其他的配置我留成了 kwargs，也就是可以任意傳遞，傳遞之后，它會依次傳遞給 requests 的請求方法，然后這里還做了異常處理，如果成功請求，即可返回正常的請求結果。

定義了這個方法，在其他的調用方法里面我們只需要單獨調用這個 fetch 方法即可，而不需要再去關心異常處理，返回類型了。

好，那么定義好了請求之后，如果出現了請求失敗怎么辦呢？按照常規的方法，我們可能就會在外面套一層方法，然后記錄調用 fetch 方法請求失敗的次數，然后重新調用 fetch 方法進行重試，但這里可以告訴大家一個更好用的庫，叫做 retrying，使用它我們可以通過定義一個裝飾器來完成重試的操作。

比如我可以使用 retry 裝飾器這么裝飾 fetch 方法：

from retrying import retry

@retry(stop_max_attempt_number=retry_max_number, wait_random_min=retry_min_random_wait,

wait_random_max=retry_max_random_wait, retry_on_exception=need_retry)

def _fetch(url, **kwargs):

pass

這里使用了裝飾器的四個參數：stop_max_attempt_number：最大重試次數，如果重試次數達到該次數則放棄重試

wait_random_min：下次重試之前隨機等待時間的最小值

wait_random_max：下次重試之前隨機等待時間的最大值

retry_on_exception：判斷出現了怎樣的異常才重試

這里 retry_on_exception 參數指定了一個方法，叫做 need_retry，方法定義如下：

def need_retry(exception):

"""

need to retry

:param exception:

:return:

"""

result = isinstance(exception, (requests.ConnectionError, requests.ReadTimeout))

if result:

print('Exception', type(exception), 'occurred, retrying...')

return result

這里判斷了如果是 requests 的 ConnectionError 和 ReadTimeout 異常的話，就會拋出異常進行重試，否則不予重試。

所以，這樣我們就實現了請求的封裝和自動重試，是不是非常 Pythonic？

下載處理器的設計

為了下載視頻，我們需要設計一個下載處理器來下載已經爬取到的視頻鏈接，所以下載處理器的輸入就是一批批的視頻鏈接，下載器接收到這些鏈接，會將其進行下載處理，并將視頻存儲到對應的位置，另外也可以完成一些信息存儲操作。在設計時，下載處理器的要求有兩個，一個是保證高速的下載，另一個就是可擴展性要強，下面我們分別來針對這兩個特點進行設計：

高速下載，為了實現高速的下載，要么可以使用多線程或多進程，要么可以用異步下載，很明顯，后者是更有優勢的。

擴展性強，下載處理器要能下載音頻、視頻，另外還可以支持數據庫等存儲，所以為了解耦合，我們可以將視頻下載、音頻下載、數據庫存儲的功能獨立出來，下載處理器只負責視頻鏈接的主要邏輯處理和分配即可。

為了實現高速下載，這里我們可以使用 aiohttp 庫來完成，另外異步下載我們也不能一下子下載太多，不然網絡波動太大，所以我們可以設置 batch 式下載，可以避免同時大量的請求和網絡擁塞，主要的下載函數如下：

def download(self, inputs):

"""

download video or video lists

:param data:

:return:

"""

if isinstance(inputs, types.GeneratorType):

temps = []

for result in inputs:

print('Processing', result, '...')

temps.append(result)

if len(temps) == self.batch:

self.process_items(temps)

temps = []

else:

inputs = inputs if isinstance(inputs, list) else [inputs]

self.process_items(inputs)

這個 download 方法設計了多種數據接收類型，可以接收一個生成器，也可以接收單個或列表形式的視頻對象數據，接著調用了 process_items 方法進行了異步下載，其方法實現如下：

def process_items(self, objs):

"""

process items

:param objs: objs

:return:

"""

# define progress bar

with tqdm(total=len(objs)) as self.bar:

# init event loop

loop = asyncio.get_event_loop()

# get num of batches

total_step = int(math.ceil(len(objs) / self.batch))

# for every batch

for step in range(total_step):

start, end = step * self.batch, (step + 1) * self.batch

print('Processing %d-%d of files' % (start + 1, end))

# get batch of objs

objs_batch = objs[start: end]

# define tasks and run loop

tasks = [asyncio.ensure_future(self.process_item(obj)) for obj in objs_batch]

for task in tasks:

task.add_done_callback(self.update_progress)

loop.run_until_complete(asyncio.wait(tasks))

這里使用了 asyncio 實現了異步處理，并通過對視頻鏈接進行分批處理保證了流量的穩定性，另外還使用了 tqdm 實現了進度條的顯示。

我們可以看到，真正的處理下載的方法是 process_item，這里面會調用視頻下載、音頻下載、數據庫存儲的一些組件來完成處理，由于我們使用了 asyncio 進行了異步處理，所以 process_item 也需要是一個支持異步處理的方法，定義如下：

async def process_item(self, obj):

"""

process item

:param obj: single obj

:return:

"""

if isinstance(obj, Video):

print('Processing', obj, '...')

for handler in self.handlers:

if isinstance(handler, Handler):

await handler.process(obj)

這里我們可以看到，真正的處理邏輯都在一個個 handler 里面，我們將每個單獨的功能進行了抽離，定義成了一個個 Handler，這樣可以實現良好的解耦合，如果我們要增加和關閉某些功能，只需要配置不同的 Handler 即可，而不需要去改動代碼，這也是設計模式的一個解耦思想，類似工廠模式。

Handler 的設計

剛才我們講了，Handler 就負責一個個具體功能的實現，比如視頻下載、音頻下載、數據存儲等等，所以我們可以將它們定義成不同的 Handler，而視頻下載、音頻下載又都是文件下載，所以又可以利用繼承的思想設計一個文件下載的 Handler，定義如下：

from os.path import join, exists

from os import makedirs

from douyin.handlers import Handler

from douyin.utils.type import mime_to_ext

import aiohttp

class FileHandler(Handler):

def __init__(self, folder):

"""

init save folder

:param folder:

"""

super().__init__()

self.folder = folder

if not exists(self.folder):

makedirs(self.folder)

async def _process(self, obj, **kwargs):

"""

download to file

:param url: resource url

:param name: save name

:param kwargs:

:return:

"""

print('Downloading', obj, '...')

kwargs.update({'ssl': False})

kwargs.update({'timeout': 10})

async with aiohttp.ClientSession() as session:

async with session.get(obj.play_url, **kwargs) as response:

if response.status == 200:

extension = mime_to_ext(response.headers.get('Content-Type'))

full_path = join(self.folder, '%s.%s' % (obj.id, extension))

with open(full_path, 'wb') as f:

f.write(await response.content.read())

print('Downloaded file to', full_path)

else:

print('Cannot download %s, response status %s' % (obj.id, response.status))

async def process(self, obj, **kwargs):

"""

process obj

:param obj:

:param kwargs:

:return:

"""

return await self._process(obj, **kwargs)

這里我們還是使用了 aiohttp，因為在下載處理器中需要 Handler 支持異步操作，這里下載的時候就是直接請求了文件鏈接，然后判斷了文件的類型，并完成了文件保存。

視頻下載的 Handler 只需要繼承當前的 FileHandler 即可：

from douyin.handlers import FileHandler

from douyin.structures import Video

class VideoFileHandler(FileHandler):

async def process(self, obj, **kwargs):

"""

process video obj

:param obj:

:param kwargs:

:return:

"""

if isinstance(obj, Video):

return await self._process(obj, **kwargs)

這里其實就是加了類別判斷，確保數據類型的一致性，當然音頻下載也是一樣的。

異步 MongoDB 存儲

上面介紹了視頻和音頻處理的 Handler，另外還有一個存儲的 Handler 沒有介紹，那就是 MongoDB 存儲，平常我們可能習慣使用 PyMongo 來完成存儲，但這里我們為了加速，需要支持異步操作，所以這里有一個可以實現異步 MongoDB 存儲的庫，叫做 Motor，其實使用的方法差不太多，MongoDB 的連接對象不再是 PyMongo 的 MongoClient 了，而是 Motor 的 AsyncIOMotorClient，其他的配置基本類似。

在存儲時使用的是 update_one 方法并開啟了 upsert 參數，這樣可以做到存在即更新，不存在即插入的功能，保證數據的不重復性。

整個 MongoDB 存儲的 Handler 定義如下：

from douyin.handlers import Handler

from motor.motor_asyncio import AsyncIOMotorClient

from douyin.structures import *

class MongoHandler(Handler):

def __init__(self, conn_uri=None, db='douyin'):

"""

init save folder

:param folder:

"""

super().__init__()

if not conn_uri:

conn_uri = 'localhost'

self.client = AsyncIOMotorClient(conn_uri)

self.db = self.client[db]

async def process(self, obj, **kwargs):

"""

download to file

:param url: resource url

:param name: save name

:param kwargs:

:return:

"""

collection_name = 'default'

if isinstance(obj, Video):

collection_name = 'videos'

elif isinstance(obj, Music):

collection_name = 'musics'

collection = self.db[collection_name]

# save to mongodb

print('Saving', obj, 'to mongodb...')

if await collection.update_one({'id': obj.id}, {'$set': obj.json()}, upsert=True):

print('Saved', obj, 'to mongodb successfully')

else:

print('Error occurred while saving', obj)

可以看到我們在類中定義了 AsyncIOMotorClient 對象，并暴露了 conn_uri 連接字符串和 db 數據庫名稱，可以在聲明 MongoHandler 類的時候指定 MongoDB 的鏈接地址和數據庫名。

同樣的 process 方法，這里使用 await 修飾了 update_one 方法，完成了異步 MongoDB 存儲。

好，以上便是 douyin 庫的所有的關鍵部分介紹，這部分內容可以幫助大家理解這個庫的核心部分實現，另外可能對設計模式、面向對象思維以及一些實用庫的使用有一定的幫助。

總結

以上是生活随笔為你收集整理的python爬取app中的音频_Python爬取抖音APP，只需要十行代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：个人如何注册工作室必须知道的那些事
下一篇： 360借条可以提前还款吗