日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy源码阅读分析_2_启动流程

發(fā)布時間:2024/7/23 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Scrapy源码阅读分析_2_启动流程 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

?

From:https://blog.csdn.net/weixin_37947156/article/details/74436333

?

使用 PyCharm 打開下載好的 Scrapy 源碼(github:https://github.com/scrapy/scrapy)

?

?

scrapy命令

?

當用 scrapy 寫好一個爬蟲后,使用?scrapy crawl <spider_name>?命令就可以運行這個爬蟲,那么這個過程中到底發(fā)生了什么??scrapy?命令 從何而來?

實際上,當你成功安裝 scrapy 后,使用如下命令,就能找到這個命令:

$ which scrapy /usr/local/bin/scrapy

使用?vim?或其他編輯器打開它:$ vim /usr/local/bin/scrapy

其實它就是一個 python 腳本,而且代碼非常少。

#!/usr/bin/python3# -*- coding: utf-8 -*- import re import sysfrom scrapy.cmdline import executeif __name__ == '__main__':sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])sys.exit(execute())

安裝 scrapy 后,為什么入口點是這里呢? 原因是在 scrapy 的安裝文件?setup.py?中,聲明了程序的入口處:

from os.path import dirname, join from pkg_resources import parse_version from setuptools import setup, find_packages, __version__ as setuptools_versionwith open(join(dirname(__file__), 'scrapy/VERSION'), 'rb') as f:version = f.read().decode('ascii').strip()def has_environment_marker_platform_impl_support():"""Code extracted from 'pytest/setup.py'https://github.com/pytest-dev/pytest/blob/7538680c/setup.py#L31The first known release to support environment marker with range operatorsit is 18.5, see:https://setuptools.readthedocs.io/en/latest/history.html#id235"""return parse_version(setuptools_version) >= parse_version('18.5')extras_require = {}if has_environment_marker_platform_impl_support():extras_require[':platform_python_implementation == "PyPy"'] = ['PyPyDispatcher>=2.1.0',]setup(name='Scrapy',version=version,url='https://scrapy.org',description='A high-level Web Crawling and Web Scraping framework',long_description=open('README.rst').read(),author='Scrapy developers',maintainer='Pablo Hoffman',maintainer_email='pablo@pablohoffman.com',license='BSD',packages=find_packages(exclude=('tests', 'tests.*')),include_package_data=True,zip_safe=False,entry_points={'console_scripts': ['scrapy = scrapy.cmdline:execute']},classifiers=['Framework :: Scrapy','Development Status :: 5 - Production/Stable','Environment :: Console','Intended Audience :: Developers','License :: OSI Approved :: BSD License','Operating System :: OS Independent','Programming Language :: Python','Programming Language :: Python :: 2','Programming Language :: Python :: 2.7','Programming Language :: Python :: 3','Programming Language :: Python :: 3.4','Programming Language :: Python :: 3.5','Programming Language :: Python :: 3.6','Programming Language :: Python :: 3.7','Programming Language :: Python :: Implementation :: CPython','Programming Language :: Python :: Implementation :: PyPy','Topic :: Internet :: WWW/HTTP','Topic :: Software Development :: Libraries :: Application Frameworks','Topic :: Software Development :: Libraries :: Python Modules',],python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*',install_requires=['Twisted>=13.1.0','w3lib>=1.17.0','queuelib','lxml','pyOpenSSL','cssselect>=0.9','six>=1.5.2','parsel>=1.5','PyDispatcher>=2.0.5','service_identity',],extras_require=extras_require, )

entry_points 指明了入口是 cmdline.pyexecute 方法,在安裝過程中,setuptools 這個包管理工具,就會把上述那一段代碼生成放在可執(zhí)行路徑下。

這里也有必要說一下,如何用 python 編寫一個可執(zhí)行文件,其實非常簡單,只需要以下幾步即可完成:

  • 編寫一個帶有 main 方法的 python 模塊(首行必須注明 python 執(zhí)行路徑)
  • 去掉.py后綴名
  • 修改權限為可執(zhí)行:chmod +x 腳本

這樣,你就可以直接使用文件名執(zhí)行此腳本了,而不用通過 python <file.py> 的方式去執(zhí)行,是不是很簡單?

?

?

入口(execute.py)

?

既然現(xiàn)在已經(jīng)知道了 scrapy 的入口是?scrapy/cmdline.py?的?execute?方法,我們來看一下這個方法。

主要的運行流程已經(jīng)加好注釋,這里我總結出了每個流程執(zhí)行過程:

?

?

流程解析

?

初始化項目配置

這個流程比較簡單,主要是根據(jù)環(huán)境變量和?scrapy.cfg?初始化環(huán)境,最終生成一個?Settings?實例,來看代碼get_project_settings?方法(from scrapy.utils.project import inside_project, get_project_settings):

def get_project_settings():# 環(huán)境變量中是否有SCRAPY_SETTINGS_MODULE配置if ENVVAR not in os.environ:project = os.environ.get('SCRAPY_PROJECT', 'default')# 初始化環(huán)境,找到用戶配置文件settings.py,設置到環(huán)境變量SCRAPY_SETTINGS_MODULE中init_env(project)# 加載默認配置文件default_settings.py,生成settings實例settings = Settings()# 取得用戶配置文件settings_module_path = os.environ.get(ENVVAR)# 更新配置,用戶配置覆蓋默認配置if settings_module_path:settings.setmodule(settings_module_path, priority='project')# XXX: remove this hack# 如果環(huán)境變量中有其他scrapy相關配置則覆蓋pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")if pickled_settings:settings.setdict(pickle.loads(pickled_settings), priority='project')# XXX: deprecate and remove this functionalityenv_overrides = {k[7:]: v for k, v in os.environ.items() ifk.startswith('SCRAPY_')}if env_overrides:settings.setdict(env_overrides, priority='project')return settings

這個過程中進行了?Settings?配置初始化 (from scrapy.settings import Settings)

class Settings(BaseSettings):"""This object stores Scrapy settings for the configuration of internalcomponents, and can be used for any further customization.It is a direct subclass and supports all methods of:class:`~scrapy.settings.BaseSettings`. Additionally, after instantiationof this class, the new object will have the global default settingsdescribed on :ref:`topics-settings-ref` already populated."""def __init__(self, values=None, priority='project'):# Do not pass kwarg values here. We don't want to promote user-defined# dicts, and we want to update, not replace, default dicts with the# values given by the user# 調用父類構造初始化super(Settings, self).__init__()# 把default_settings.py的所有配置set到settings實例中self.setmodule(default_settings, 'default')# Promote default dictionaries to BaseSettings instances for per-key# priorities# 把attributes屬性也set到settings實例中for name, val in six.iteritems(self):if isinstance(val, dict):self.set(name, BaseSettings(val, 'default'), 'default')self.update(values, priority)

程序 加載 默認配置文件 default_settings.py 中的所有配置項設置到 Settings 中,且這個配置是有優(yōu)先級的。

這個默認配置文件 default_settings.py 是非常重要的,個人認為還是有必要看一下里面的內(nèi)容,這里包含了所有默認的配置例如:調度器類、爬蟲中間件類、下載器中間件類、下載處理器類等等。

在這里就能隱約發(fā)現(xiàn),scrapy 的架構是非常低耦合的,所有組件都是可替換的。什么是可替換呢?

例如:你覺得默認的調度器功能不夠用,那么你就可以按照它定義的接口標準,自己實現(xiàn)一個調度器,然后在自己的配置文件中,注冊自己寫的調度器模塊,那么 scrapy 的運行時就會用上你新寫的調度器模塊了!(scrapy-redis 就是替換 scrapy 中的模塊 來實現(xiàn)分布式

只要在默認配置文件中配置的模塊,都是可替換的。

?

檢查環(huán)境是否在項目中

?

def inside_project():# 檢查此環(huán)境變量是否存在(上面已設置)scrapy_module = os.environ.get('SCRAPY_SETTINGS_MODULE')if scrapy_module is not None:try:import_module(scrapy_module)except ImportError as exc:warnings.warn("Cannot import scrapy settings module %s: %s" % (scrapy_module, exc))else:return True# 如果環(huán)境變量沒有,就近查找scrapy.cfg,找得到就認為是在項目環(huán)境中return bool(closest_scrapy_cfg())

scrapy 命令有的是依賴項目運行的,有的命令則是全局的,不依賴項目的。這里主要通過就近查找?scrapy.cfg?文件來確定是否在項目環(huán)境中。

?

獲取可用命令并組裝成名稱與實例的字典

?

def _get_commands_dict(settings, inproject):# 導入commands文件夾下的所有模塊,生成{cmd_name: cmd}的字典集合cmds = _get_commands_from_module('scrapy.commands', inproject)cmds.update(_get_commands_from_entry_points(inproject))# 如果用戶自定義配置文件中有COMMANDS_MODULE配置,則加載自定義的命令類cmds_module = settings['COMMANDS_MODULE']if cmds_module:cmds.update(_get_commands_from_module(cmds_module, inproject))return cmdsdef _get_commands_from_module(module, inproject):d = {}# 找到這個模塊下所有的命令類(ScrapyCommand子類)for cmd in _iter_command_classes(module):if inproject or not cmd.requires_project:# 生成{cmd_name: cmd}字典cmdname = cmd.__module__.split('.')[-1]d[cmdname] = cmd()return ddef _iter_command_classes(module_name):# TODO: add `name` attribute to commands and and merge this function with# 迭代這個包下的所有模塊,找到ScrapyCommand的子類# scrapy.utils.spider.iter_spider_classesfor module in walk_modules(module_name):for obj in vars(module).values():if inspect.isclass(obj) and \issubclass(obj, ScrapyCommand) and \obj.__module__ == module.__name__ and \not obj == ScrapyCommand:yield obj

這個過程主要是,導入?commands?文件夾下的所有模塊,生成?{cmd_name: cmd}?字典集合,如果用戶在配置文件中配置了自定義的命令類,也追加進去。也就是說,自己也可以編寫自己的命令類,然后追加到配置文件中,之后就可以使用自己自定義的命令了。

?

解析執(zhí)行的命令并找到對應的命令實例

?

def _pop_command_name(argv):i = 0for arg in argv[1:]:if not arg.startswith('-'):del argv[i]return argi += 1

這個過程就是解析命令行,例如?scrapy crawl <spider_name>,解析出?crawl,通過上面生成好的命令字典集合,就能找到commands?模塊下的?crawl.py?下的?Command類 的實例。

?

scrapy命令實例解析命令行參數(shù)

?

找到對應的命令實例后,調用?cmd.process_options?方法(例如 scrapy/commands/crawl.py):

class Command(ScrapyCommand):requires_project = Truedef syntax(self):return "[options] <spider>"def short_desc(self):return "Run a spider"def add_options(self, parser):ScrapyCommand.add_options(self, parser)parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",help="set spider argument (may be repeated)")parser.add_option("-o", "--output", metavar="FILE",help="dump scraped items into FILE (use - for stdout)")parser.add_option("-t", "--output-format", metavar="FORMAT",help="format to use for dumping items with -o")def process_options(self, args, opts):# 首先調用了父類的process_options,解析統(tǒng)一固定的參數(shù)ScrapyCommand.process_options(self, args, opts)try:opts.spargs = arglist_to_dict(opts.spargs)except ValueError:raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)if opts.output:if opts.output == '-':self.settings.set('FEED_URI', 'stdout:', priority='cmdline')else:self.settings.set('FEED_URI', opts.output, priority='cmdline')feed_exporters = without_none_values(self.settings.getwithbase('FEED_EXPORTERS'))valid_output_formats = feed_exporters.keys()if not opts.output_format:opts.output_format = os.path.splitext(opts.output)[1].replace(".", "")if opts.output_format not in valid_output_formats:raise UsageError("Unrecognized output format '%s', set one"" using the '-t' switch or as a file extension"" from the supported list %s" % (opts.output_format,tuple(valid_output_formats)))self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')def run(self, args, opts):if len(args) < 1:raise UsageError()elif len(args) > 1:raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")spname = args[0]self.crawler_process.crawl(spname, **opts.spargs)self.crawler_process.start()if self.crawler_process.bootstrap_failed:self.exitcode = 1

這個過程就是解析命令行其余的參數(shù),固定參數(shù)?解析交給?父類?處理,例如輸出位置等。其余不同的參數(shù)由不同的命令類解析。

?

初始化CrawlerProcess

?

最后初始化?CrawlerProcess?實例,然后運行對應命令實例的?run?方法。

cmd.crawler_process = CrawlerProcess(settings) _run_print_help(parser, _run_command, cmd, args, opts)

如果運行命令是?scrapy crawl <spider_name>,則運行的就是?commands/crawl.py?的?run看上面代碼中 run 方法

run?方法中調用了?CrawlerProcess?實例的?crawl?和?start,就這樣整個爬蟲程序就會運行起來了。

先來看?CrawlerProcess?初始化:(scrapy/crawl.py)

class CrawlerProcess(CrawlerRunner):def __init__(self, settings=None, install_root_handler=True):# 調用父類初始化super(CrawlerProcess, self).__init__(settings)# 信號和log初始化install_shutdown_handlers(self._signal_shutdown)configure_logging(self.settings, install_root_handler)log_scrapy_info(self.settings)

構造方法中調用了父類?CrawlerRunner?的構造:

class CrawlerRunner(object):def __init__(self, settings=None):if isinstance(settings, dict) or settings is None:settings = Settings(settings)self.settings = settings# 獲取爬蟲加載器self.spider_loader = _get_spider_loader(settings)self._crawlers = set()self._active = set()self.bootstrap_failed = False

初始化時,調用了??_get_spider_loader?方法:

def _get_spider_loader(settings):""" Get SpiderLoader instance from settings """# 讀取配置文件中的SPIDER_MANAGER_CLASS配置項if settings.get('SPIDER_MANAGER_CLASS'):warnings.warn('SPIDER_MANAGER_CLASS option is deprecated. ''Please use SPIDER_LOADER_CLASS.',category=ScrapyDeprecationWarning, stacklevel=2)cls_path = settings.get('SPIDER_MANAGER_CLASS',settings.get('SPIDER_LOADER_CLASS'))loader_cls = load_object(cls_path)try:verifyClass(ISpiderLoader, loader_cls)except DoesNotImplement:warnings.warn('SPIDER_LOADER_CLASS (previously named SPIDER_MANAGER_CLASS) does ''not fully implement scrapy.interfaces.ISpiderLoader interface. ''Please add all missing methods to avoid unexpected runtime errors.',category=ScrapyDeprecationWarning, stacklevel=2)return loader_cls.from_settings(settings.frozencopy())

默認配置文件中的?spider_loader?配置是?spiderloader.SpiderLoader(scrapy/spiderloader.py)

@implementer(ISpiderLoader) class SpiderLoader(object):"""SpiderLoader is a class which locates and loads spidersin a Scrapy project."""def __init__(self, settings):# 配置文件獲取存放爬蟲腳本的路徑self.spider_modules = settings.getlist('SPIDER_MODULES')self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')self._spiders = {}self._found = defaultdict(list)# 加載所有爬蟲self._load_all_spiders()def _check_name_duplicates(self):dupes = ["\n".join(" {cls} named {name!r} (in {module})".format(module=mod, cls=cls, name=name)for (mod, cls) in locations)for name, locations in self._found.items()if len(locations)>1]if dupes:msg = ("There are several spiders with the same name:\n\n""{}\n\n This can cause unexpected behavior.".format("\n\n".join(dupes)))warnings.warn(msg, UserWarning)def _load_spiders(self, module):for spcls in iter_spider_classes(module):self._found[spcls.name].append((module.__name__, spcls.__name__))self._spiders[spcls.name] = spclsdef _load_all_spiders(self):# 組裝成{spider_name: spider_cls}的字典for name in self.spider_modules:try:for module in walk_modules(name):self._load_spiders(module)except ImportError as e:if self.warn_only:msg = ("\n{tb}Could not load spiders from module '{modname}'. ""See above traceback for details.".format(modname=name, tb=traceback.format_exc()))warnings.warn(msg, RuntimeWarning)else:raiseself._check_name_duplicates()@classmethoddef from_settings(cls, settings):return cls(settings)def load(self, spider_name):"""Return the Spider class for the given spider name. If the spidername is not found, raise a KeyError."""try:return self._spiders[spider_name]except KeyError:raise KeyError("Spider not found: {}".format(spider_name))def find_by_request(self, request):"""Return the list of spider names that can handle the given request."""return [name for name, cls in self._spiders.items()if cls.handles_request(request)]def list(self):"""Return a list with the names of all spiders available in the project."""return list(self._spiders.keys())

爬蟲加載器會加載所有的爬蟲腳本,最后生成一個?{spider_name: spider_cls}?的字典。

?

執(zhí)行 crawl 和 start 方法

?

CrawlerProcess?初始化完之后,調用?crawl?方法:

class CrawlerRunner(object):def __init__(self, settings=None):if isinstance(settings, dict) or settings is None:settings = Settings(settings)self.settings = settingsself.spider_loader = _get_spider_loader(settings)self._crawlers = set()self._active = set()self.bootstrap_failed = False@propertydef spiders(self):warnings.warn("CrawlerRunner.spiders attribute is renamed to ""CrawlerRunner.spider_loader.",category=ScrapyDeprecationWarning, stacklevel=2)return self.spider_loaderdef crawl(self, crawler_or_spidercls, *args, **kwargs):# 創(chuàng)建crawlercrawler = self.create_crawler(crawler_or_spidercls)return self._crawl(crawler, *args, **kwargs)def _crawl(self, crawler, *args, **kwargs):self.crawlers.add(crawler)# 調用Crawler的crawl方法d = crawler.crawl(*args, **kwargs)self._active.add(d)def _done(result):self.crawlers.discard(crawler)self._active.discard(d)self.bootstrap_failed |= not getattr(crawler, 'spider', None)return resultreturn d.addBoth(_done)def create_crawler(self, crawler_or_spidercls):# 如果是字符串,則從spider_loader中加載這個爬蟲類if isinstance(crawler_or_spidercls, Crawler):return crawler_or_spidercls# 否則創(chuàng)建Crawlerreturn self._create_crawler(crawler_or_spidercls)def _create_crawler(self, spidercls):if isinstance(spidercls, six.string_types):spidercls = self.spider_loader.load(spidercls)return Crawler(spidercls, self.settings)def stop(self):"""Stops simultaneously all the crawling jobs taking place.Returns a deferred that is fired when they all have ended."""return defer.DeferredList([c.stop() for c in list(self.crawlers)])@defer.inlineCallbacksdef join(self):"""join()Returns a deferred that is fired when all managed :attr:`crawlers` havecompleted their executions."""while self._active:yield defer.DeferredList(self._active)

這個過程會創(chuàng)建?Cralwer?實例,然后調用它的?crawl?方法:(scrapy/crawl.py 中 class Crawler )

@defer.inlineCallbacksdef crawl(self, *args, **kwargs):assert not self.crawling, "Crawling already taking place"self.crawling = Truetry:# 到現(xiàn)在,才是實例化一個爬蟲實例self.spider = self._create_spider(*args, **kwargs)# 創(chuàng)建引擎self.engine = self._create_engine()# 調用爬蟲類的start_requests方法start_requests = iter(self.spider.start_requests())# 執(zhí)行引擎的open_spider,并傳入爬蟲實例和初始請求yield self.engine.open_spider(self.spider, start_requests)yield defer.maybeDeferred(self.engine.start)except Exception:# In Python 2 reraising an exception after yield discards# the original traceback (see https://bugs.python.org/issue7563),# so sys.exc_info() workaround is used.# This workaround also works in Python 3, but it is not needed,# and it is slower, so in Python 3 we use native `raise`.if six.PY2:exc_info = sys.exc_info()self.crawling = Falseif self.engine is not None:yield self.engine.close()if six.PY2:six.reraise(*exc_info)raise

最后調用?start?方法:

def start(self, stop_after_crawl=True):"""This method starts a Twisted `reactor`_, adjusts its pool size to:setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache basedon :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.If `stop_after_crawl` is True, the reactor will be stopped after allcrawlers have finished, using :meth:`join`.:param boolean stop_after_crawl: stop or not the reactor when allcrawlers have finished"""if stop_after_crawl:d = self.join()# Don't start the reactor if the deferreds are already firedif d.called:returnd.addBoth(self._stop_reactor)reactor.installResolver(self._get_dns_resolver())# 配置reactor的池子大小(可修改REACTOR_THREADPOOL_MAXSIZE調整)tp = reactor.getThreadPool()tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))reactor.addSystemEventTrigger('before', 'shutdown', self.stop)# 開始執(zhí)行reactor.run(installSignalHandlers=False) # blocking call

reactor 是個什么東西呢?它是 Twisted 模塊的 事件管理器,只要把需要執(zhí)行的事件方法注冊到 reactor 中,然后調用它的 run 方法,它就會幫你執(zhí)行注冊好的事件方法,如果遇到 網(wǎng)絡IO 等待,它會自動幫你切換可執(zhí)行的事件方法,非常高效。

大家不用在意 reactor 是如何工作的,你可以把它想象成一個線程池,只是采用注冊回調的方式來執(zhí)行事件。

到這里,爬蟲的之后調度邏輯就交由引擎 ExecuteEngine 處理了。

在每次執(zhí)行 scrapy 命令 時,主要經(jīng)過環(huán)境、配置初始化,加載命令類 和 爬蟲模塊,最終實例化執(zhí)行引擎,交給引擎調度處理的流程,下篇文章會講解執(zhí)行引擎是如何調度和管理各個組件工作的。

?

?

?

總結

以上是生活随笔為你收集整理的Scrapy源码阅读分析_2_启动流程的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。