0
点赞
收藏
分享

微信扫一扫

Python爬虫:scrapy利用splash爬取动态网页


依赖库:

pip install scrapy-splash

配置settings.py

# splash服务器地址
SPLASH_URL = 'http://localhost:8050'

# 支持cache_args(可选)
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# 下载中间件设置
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}

# 设置去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# 启用缓存系统
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

代码示例

将原有​​Request​​​ 替换成 ​​SplashRequest​​即可

# -*- coding: utf-8 -*-
import scrapy
from scrapy import cmdline
from scrapy_splash import SplashRequest


class ToscrapeJsSpider(scrapy.Spider):
name = "toscrape_js"
allowed_domains = ["toscrape.com"]
start_urls = (
'http://quotes.toscrape.com/js/',
)

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, args={"timeout": 5, "image": 0, 'wait': '5'})

def parse(self, response):
quotes = response.css(".quote .text::text").extract()
for quote in quotes:
print(quote)


if __name__ == '__main__':
cmdline.execute("scrapy crawl toscrape_js".split())

​SplashRequest​​参数说明

url: 待爬页面

headers: 与​​Request​​相同

cookies:与​​Request​​相同

args: {dict}传递给​​Splash​​的参数, wait很重要,需要给js足够的执行时间

cache_args: {list}让​​Splash​​缓存的参数

endpoint: ​​Splash​​端点服务,默认​​render.html​

splash_url: ​​Splash​​服务器地址,默认为配置文件中的​​SPLASH_URL​

设置代理:

args={
'proxy': 'http://proxy_ip:proxy_port'
}


参考:
​​using proxy with scrapy-splash​​




举报

相关推荐

0 条评论