1. 循环读取下一页解析:
spidertest\spidertest\spiders\bt.py:
import scrapy
from urllib import parse
from scrapy.http import Request
# 继承了scrapy.Spider
class JobboleSpider(scrapy.Spider):
# 执行Spider的名称
name = "jobbole"
# 允许的域名
allowed_domains = ["python.jobbole.com"]
# 允许爬取的url
start_urls = ['http://python.jobbole.com/all-posts/']
def parse(self, response):
# 1. 解析列表页中的所有文章url并交给scrapy下载后并进行解析
post_nodes = response.css("#archive .floated-thumb .post-thumb a")
for post_node in post_nodes:
image_url = post_node.css("img::attr(src)").extract_first("")
// 获取本身节点的href值
post_url = post_node.css("::attr(href)").extract_first("")
yield Request(url=parse.urljoin(response.url, post_url),
meta={"front_image_url": image_url},
callback=self.parse_detail)
# 2. 获取下一页并交给scrapy进行下载,下载完成后交给parse
// 这个标签即有两个类名:.next、.page-numbers
next_url = response.css(".next.page-numbers::attr(href)").extract_first("")
if next_url:
yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
def parse_detail(self, response):
# 操作
pass
注:
①. response有几种,这里是HtmlResponse,还有TextResponse:
a. _DEFAULT_ENCODING是"ascii",而最终是encoding="utf-8"
b. body是整个html源文件的内容
②. parse.urljoin():
a. 把一个基地址和相对地址智能连接成一个绝对地址.
③. 例子:
urljoin("http://xx.com/1/aaa.html", "bbbb.html")
'http://xx.com/1/bbbb.html'
urljoin("http://xx.com/1/aaa.html", "2/bbbb.html")
'http://xx.com/1/2/bbbb.html'
urljoin("http://xx.com/1/aaa.html", "http://xx.com/3/ccc.html")
'http://xx.com/3/ccc.html'
urljoin("http://xx.com/1/aaa.html", "javascript:void(0)")
'javascript:void(0)'