0
点赞
收藏
分享

微信扫一扫

Scrapy源码分析之Robots协议(第一期)

写这篇文章的目的是为了记录一下自己对scrapy源码的阅读分析及重写过程,文章若有错误的地方,欢迎各位大佬阅读指正!☀️


特别声明:本公众号文章只作为学术研究,不用于其它用途。


 目录

①    问题思考

②    源码分析

③    源码重写

④    心得分享


一、问题思考


1. 什么是Robots协议?

Robots协议(也称为爬虫协议、机器人协议等)的全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过Robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。



2. Scrapy的Robots协议定义在哪里?

定义在settings.py文件中。

参数:ROBOTSTXT_OBEY = True

说明:Robots协议默认是开启状态,网站管理员在网站域名的根目录下放一个robots.txt文本文件,里面可以指定不同的网络爬虫能访问的页面和禁止访问的页面,指定的页面由正则表达式表示。网络爬虫在采集这个网站之前,首先获取到这个文件,然后解析到其中的规则,然后根据规则来采集网站的数据。



3. 如何关闭Robots协议遵守?有哪些方法?

1)ROBOTSTXT_OBEY = True 将该参数的值改为False

2)?



4. 源码如何实现Robots协议检测?

暂时还不知道,带着这些问题?我们直接阅读源码吧!



二、源码阅读


说明:本次源码分析使用的是scrapy2.6.1版本,大家不要迷路了!

1. 先构建一个小的demo,代码如下:

# spiders目录
# -*- coding: utf-8 -*-
import scrapy




class BaiduSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = ['http://baidu.com/']


def parse(self, response):
print(response.text)


settings.py文件:

# settings目录
# -*- coding: utf-8 -*-
BOT_NAME = 'scrapy_demo'


SPIDER_MODULES = ['scrapy_demo.spiders']
NEWSPIDER_MODULE = 'scrapy_demo.spiders'




# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_demo (+http://www.yourdomain.com)'


# Obey robots.txt rules
ROBOTSTXT_OBEY = True # 先使用默认的True


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32


2. 运行代码后,结果如下:

2022-03-16 18:50:01 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapy_demo)
2022-03-16 18:50:01 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.7.9 (v3.7.9:13c94747c7, Aug 15 2020, 01:31:08) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Darwin-21.3.0-x86_64-i386-64bit
2022-03-16 18:50:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_demo',
'NEWSPIDER_MODULE': 'scrapy_demo.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_demo.spiders']}
2022-03-16 18:50:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-03-16 18:50:01 [scrapy.extensions.telnet] INFO: Telnet Password: 90a5b28dfa1da593
2022-03-16 18:50:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-16 18:50:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-16 18:50:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-16 18:50:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-16 18:50:02 [scrapy.core.engine] INFO: Spider opened
2022-03-16 18:50:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-16 18:50:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-16 18:50:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://baidu.com/robots.txt> (referer: None)
2022-03-16 18:50:02 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://baidu.com/>
2022-03-16 18:50:02 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-16 18:50:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2702,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.261867,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 16, 10, 50, 2, 381648),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'memusage/max': 64581632,


从上面的代码中我们可以看到Robots协议已经生效了,截图如下:

Scrapy源码分析之Robots协议(第一期)_ide_02



3. 从上图我们可以看到机器人协议的下载中间件处理了该请求,为了方便大家了解整个流程,我需要粘贴一张scrapy流程图,如下:

Scrapy源码分析之Robots协议(第一期)_中间件_03



总结:观察这张图,结合我们刚刚启动的scrapy代码,能够清楚的分析出Engine从Spiders里拿出第一个Requests交给Scheduler去调度处理,Scheduler将获取的Requests按一定规则入队列处理后,分发给Downloader去下载;而在这个环节中,还会有一个下载中间件Downloader Middlewares。在Scheduler将请求下发给Downloader去下载过程中,下载中间件Downloader Middlewares会对该请求进行二次加工处理,确保这个请求到下载器中执行了一定的规则和逻辑。

通过上面的总结讲解后,我们应该可以确定这个Robots协议就是在Robots协议下载中间件中执行了一定操作,才能满足是否遵循Robots协议。


4.  直接上RobotsTxtMiddleware源代码如下:

"""
This is a middleware to respect robots.txt policies. To activate it you must
enable this middleware and enable the ROBOTSTXT_OBEY setting.


"""


import logging


from twisted.internet.defer import Deferred, maybeDeferred
from scrapy.exceptions import NotConfigured, IgnoreRequest
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached
from scrapy.utils.log import failure_to_exc_info
from scrapy.utils.misc import load_object


logger = logging.getLogger(__name__)




class RobotsTxtMiddleware:
DOWNLOAD_PRIORITY = 1000


def __init__(self, crawler):
if not crawler.settings.getbool('ROBOTSTXT_OBEY'):
raise NotConfigured
self._default_useragent = crawler.settings.get('USER_AGENT', 'Scrapy')
self._robotstxt_useragent = crawler.settings.get('ROBOTSTXT_USER_AGENT', None)
self.crawler = crawler
self._parsers = {}
self._parserimpl = load_object(crawler.settings.get('ROBOTSTXT_PARSER'))


# check if parser dependencies are met, this should throw an error otherwise.
self._parserimpl.from_crawler(self.crawler, b'')


@classmethod
def from_crawler(cls, crawler):
return cls(crawler)


def process_request(self, request, spider):
if request.meta.get('dont_obey_robotstxt'):
return
d = maybeDeferred(self.robot_parser, request, spider)
d.addCallback(self.process_request_2, request, spider)
return d


def process_request_2(self, rp, request, spider):
if rp is None:
return


useragent = self._robotstxt_useragent
if not useragent:
useragent = request.headers.get(b'User-Agent', self._default_useragent)
if not rp.allowed(request.url, useragent):
logger.debug("Forbidden by robots.txt: %(request)s",
{'request': request}, extra={'spider': spider})
self.crawler.stats.inc_value('robotstxt/forbidden')
raise IgnoreRequest("Forbidden by robots.txt")


def robot_parser(self, request, spider):
url = urlparse_cached(request)
netloc = url.netloc


if netloc not in self._parsers:
self._parsers[netloc] = Deferred()
robotsurl = f"{url.scheme}://{url.netloc}/robots.txt"
robotsreq = Request(
robotsurl,
priority=self.DOWNLOAD_PRIORITY,
meta={'dont_obey_robotstxt': True}
)
dfd = self.crawler.engine.download(robotsreq)
dfd.addCallback(self._parse_robots, netloc, spider)
dfd.addErrback(self._logerror, robotsreq, spider)
dfd.addErrback(self._robots_error, netloc)
self.crawler.stats.inc_value('robotstxt/request_count')


if isinstance(self._parsers[netloc], Deferred):
d = Deferred()


def cb(result):
d.callback(result)
return result
self._parsers[netloc].addCallback(cb)
return d
else:
return self._parsers[netloc]


def _logerror(self, failure, request, spider):
if failure.type is not IgnoreRequest:
logger.error("Error downloading %(request)s: %(f_exception)s",
{'request': request, 'f_exception': failure.value},
exc_info=failure_to_exc_info(failure),
extra={'spider': spider})
return failure


def _parse_robots(self, response, netloc, spider):
self.crawler.stats.inc_value('robotstxt/response_count')
self.crawler.stats.inc_value(f'robotstxt/response_status_count/{response.status}')
rp = self._parserimpl.from_crawler(self.crawler, response.body)
rp_dfd = self._parsers[netloc]
self._parsers[netloc] = rp
rp_dfd.callback(rp)


def _robots_error(self, failure, netloc):
if failure.type is not IgnoreRequest:
key = f'robotstxt/exception_count/{failure.type}'
self.crawler.stats.inc_value(key)
rp_dfd = self._parsers[netloc]
self._parsers[netloc] = None
rp_dfd.callback(None)


5. 源码阅读分析

1. 查看源代码__init__函数,我们可以发现:ROBOTSTXT_OBEY参数设置为Flase,这个中间件就触发异常不继续执行了,代码如下:

def __init__(self, crawler):
if not crawler.settings.getbool('ROBOTSTXT_OBEY'):
raise NotConfigured
self._default_useragent = crawler.settings.get('USER_AGENT', 'Scrapy')
self._robotstxt_useragent = crawler.settings.get('ROBOTSTXT_USER_AGENT', None)
self.crawler = crawler
self._parsers = {}
self._parserimpl = load_object(crawler.settings.get('ROBOTSTXT_PARSER'))


# check if parser dependencies are met, this should throw an error otherwise.
self._parserimpl.from_crawler(self.crawler, b'')

class NotConfigured(Exception):
"""Indicates a missing configuration situation"""
pass


说明:触发异常后,也不会报错,scrapy的源码将该错误进行了忽略,这也是为什么我们看不到报错的原因!



2. 我们继续往下说,当ROBOTSTXT_OBEY参数设置为True时,代码将会往下执行,检测访问的该请求是否含有机器人协议。代码如下:

def process_request(self, request, spider):
if request.meta.get('dont_obey_robotstxt'):
return
d = maybeDeferred(self.robot_parser, request, spider)
d.addCallback(self.process_request_2, request, spider)
return d


def robot_parser(self, request, spider):
url = urlparse_cached(request)
netloc = url.netloc


if netloc not in self._parsers:
self._parsers[netloc] = Deferred()
robotsurl = f"{url.scheme}://{url.netloc}/robots.txt"
robotsreq = Request(
robotsurl,
priority=self.DOWNLOAD_PRIORITY,
meta={'dont_obey_robotstxt': True}
)
dfd = self.crawler.engine.download(robotsreq)
dfd.addCallback(self._parse_robots, netloc, spider)
dfd.addErrback(self._logerror, robotsreq, spider)
dfd.addErrback(self._robots_error, netloc)
self.crawler.stats.inc_value('robotstxt/request_count')


当一个请求经过下载中间件,首先会过process_request函数,在这个函数中,由于我们并没有在meta参数中设置dont_obey_robotstxt属性!所以这段代码会继续执行到下面,执行twisted.internet.defer的异步方法maybeDeferred去处理。在robot_parser函数中对该请求进行特殊处理,查看指定域名的网站下面是否含有robots.txt文件,这也是我们前面提到的内容。如果指定域名的服务器根目录有这个文件,则会去发送该请求,此时:


meta={'dont_obey_robotstxt': True}


是为了保证该请求只执行一次。引擎最后将该请求交给下载器去下载,如果下载成功stats统计中间件会将:

obotstxt/request_count参数value递增1。

robotstxt/response_count参数value递增1。

robotstxt/response_status_count/{response.status} 参数递增1。

如果该请求响应结果为异常或者错误,是下面的统计结果:

'robotstxt/exception_count/{failure.type}'参数的value递增1。


最后执行下面的代码:

def process_request_2(self, rp, request, spider):
if rp is None:
return
useragent = self._robotstxt_useragent
if not useragent:
useragent = request.headers.get(b'User-Agent', self._default_useragent)
if not rp.allowed(request.url, useragent):
logger.debug("Forbidden by robots.txt: %(request)s",
{'request': request}, extra={'spider': spider})
self.crawler.stats.inc_value('robotstxt/forbidden')
raise IgnoreRequest("Forbidden by robots.txt")


Scrapy源码分析之Robots协议(第一期)_源码分析_04


发现和我们运行的log几乎一致吧,这就是整个代码的运行流程!



三、源码重写


我们对scrapy作者的代码阅读完毕后,即可进行魔性玩法,如下:


方案一:

# -*- coding: utf-8 -*-
import scrapy




class BaiduSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = ['http://baidu.com/']


def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], callback=self.parse, meta={'dont_obey_robotstxt': True})


def parse(self, response):
print(response.text)


"""
此方案为 ROBOTSTXT_OBEY = True时,即可跳过robots协议中间件检测。
需要将:dont_obey_robotstxt = True.
"""


方案二:

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': None,
}



方案三:

class RobotsTxtMiddleware:
DOWNLOAD_PRIORITY = 1000


def __init__(self, crawler):
raise NotConfigured


@classmethod
def from_crawler(cls, crawler):
return cls(crawler)



四、心得分享


需要对scrapy的运行机制有所了解,感谢大家阅读,欢迎大家多多交流指正☀️



    我是TheWeiJun有着执着的追求,信奉终身成长,不定义自己,热爱技术但不拘泥于技术,爱好分享,喜欢读书和乐于结交朋友,欢迎加我微信与我交朋友。

    分享日常学习中关于爬虫、逆向和分析的一些思路,文中若有错误的地方,欢迎大家多多交流指正☀️


Scrapy源码分析之Robots协议(第一期)_源码分析_06




举报

相关推荐

0 条评论