背景

又到了一年中最繁忙的春季招聘时间，IT世界的各路大神蠢蠢欲动，我等小白在夹缝中寻找生存空间。为了更方便获取特定领域的招聘信息，我连夜肝了三个小时，终于使用Scrapy框架成功获取特定的招聘信息，并保存为json格式文件，作为数据分析和查询的原始文件。

Scrapy工作原理

engine 引擎，类似于一个中间件，负责控制数据流在系统中的所有组件之间流动，可以理解为“传话者”
spider 爬虫，负责解析response和提取Item
downloader 下载器，负责下载网页数据给引擎
scheduler 调度器，负责将url入队列，默认去掉重复的url
item pipelines 管道，负责处理被spider提取出来的Item数据

1）从Spider中获取初始URL，交给ENGINE，告诉引擎转交给SCHEDULER
2）引擎将初始URL给调度器，调度器安排入队列
3）调度器告诉引擎已经安排好了，并把URL给引擎，告诉引擎，给下载器进行下载
4）引擎将URL交给下载器，下载器下载页面源码
5）下载器告诉引擎已经下载好了，并把页面源码RESPONSE给到引擎
6）引擎拿着response给到spider，spider解析数据，提取数据
7）spider将提取到的数据给到引擎，告诉引擎，帮我把新的URL给到调度器入队列，把信息给到Item Piplines进行保存
8）Item Piplines将提取到的数据保存，保存好后告诉引擎，可以进行下一个URL的提取了
9）循环3-8步，直至调度器中没有URL，关闭网站。

创建项目

创建一个爬虫项目test_spider。cmd，cd到将要存放项目的目录中。

scrapy startproject test_spider

示例：

(base) D:\MyProjects\Code600\com\spider\scrapy>scrapy startproject TXmovies # 爬虫项目名称
New Scrapy project 'TXmovies', using template directory 'd:\anaconda\lib\site-packages\scrapy\templates\project', created in:
    D:\MyProjects\Code600\com\spider\scrapy\TXmovies

You can start your first spider with:
    cd TXmovies
    scrapy genspider example example.com

(base) D:\MyProjects\Code600\com\spider\scrapy>cd TXmovies # 进入项目目录

(base) D:\MyProjects\Code600\com\spider\scrapy\TXmovies>scrapy genspider txms v.qq.com # 爬虫名称 起始URL
Created spider 'txms' using template 'basic' in module:
  TXmovies.spiders.txms

爬虫项目目录：
爬虫项目页面
爬虫目录结构：

其中

_init_.py 初始化文件
items.py 存放的是要爬取的字段。Item是保存爬取到的数据的容器。
middlewares.py 中间件
pipeines.py 管道文件，负责处理被spider提取出来的Item，例如数据持久化（将爬取的结果保存到文件/数据库中）
settings.py 配置文件
spiders spider核心代码的目录

创建爬虫

创建爬虫spider文件。：scrapy genspider 爬虫名称 允许爬取的域

scrapy genspider test_spider example.com

确定数据爬取思路

先确定要爬的网站和需要提取的信息（如信息类型，标题、url、作者、发布日期等）
查看源码和element中的代码是否一致。爬虫返回的response是源码。因此如果element中的代码和源码不一致的话，就不能直接看element中的代码，需要查看网页源代码
查看分页规律，看能否直接提取下一页的url，如何解析得到的详情内容
观察信息所在的代码结构，定位元素、提取信息。
如何保存提取到的数据，保存到json中
以上思路全部搞清楚后，写码，先item.py

编写对象：item.py

明确想要抓取的目标，定义需要爬取的信息（字段）

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MuchongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 类型
    subject = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 链接
    link = scrapy.Field()
    # 发帖人
    author = scrapy.Field()
    # 日期
    date = scrapy.Field()


class DetailsItem(scrapy.Item):
    title = scrapy.Field()
    detail = scrapy.Field()

制作爬虫：muchongrecruit.py

解析数据，并提取信息和新的URL。

import scrapy

from com.spider.scrapy.muchong.muchong.items import MuchongItem, DetailsItem


class MuchongSpider(scrapy.Spider):
    name = 'muchongre'
    allowed_domains = ['www.muchong.com']

    # start_urls = ['http://www.muchong.com/f-430-{}-typeid-2303']
    def start_requests(self):
        base_url = 'http://www.muchong.com/f-430-{}-typeid-2306'
        for page in range(1, 5):
            print('正在抓取：第%s页' % page)
            url = base_url.format(page)
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        results_table = response.xpath('//table[@class = "xmc_bpt"]/tbody')
        for table in results_table:
            subject = table.xpath('./tr/th[2]/span/a/text()').extract_first()
            title = table.xpath('./tr/th[2]/a/text()').extract_first()
            link = 'http://www.muchong.com' + table.xpath('./tr/th[2]/a/@href').extract_first()
            author = table.xpath('./tr/th[3]/cite/a/text()').extract_first()
            date = table.xpath('./tr/th[3]/span//text()').extract()
            date = [x.strip() for x in date if x.strip() != ''][0]
            print(subject, ' ', title, ' ', link, ' ', author, ' ', date)
            item = MuchongItem()
            item['subject'] = subject
            item['title'] = title
            item['link'] = link
            item['author'] = author
            item['date'] = date
            yield item
            yield scrapy.Request(url=item['link'], callback=self.details)

    def details(self, response):
        item = DetailsItem()
        title = response.xpath(
            '//div[@id="maincontent"]/table/tbody[2]//td[@class="plc_mind"]//div[@class="plc_Con"]/h1//text()').extract()
        title = [x.strip() for x in title if x.strip() != '']
        title = " ".join(title)
        detail = response.xpath(
            '//div[@id="maincontent"]/table/tbody[2]//td[@class="plc_mind"]//div[@class="t_fsz"]//text()').extract()
        detail = [x.strip() for x in detail if x.strip() != '']
        detail = " ".join(detail)
        item['title'] = title
        item['detail'] = detail
        yield item

存储内容：pipelines.py

设计管道存储内容。当spider收集好Item后，会将Item（由字典组成的列表）传递到Item Pipeline，这些Item Pipeline组件按定义的顺序处理Item。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json

from com.spider.scrapy.muchong.muchong.items import MuchongItem, DetailsItem


class MuchongPipeline:
    def __init__(self):
        self.info = open('abstract.json', 'w', encoding='utf-8')
        self.detail = open('detail.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False)

        if isinstance(item, MuchongItem):
            self.info.write(content + ',\n')

        if isinstance(item, DetailsItem):
            self.detail.write(content + ',\n')

        return item

    def close_spider(self,spider):
        self.info.close()
        self.detail.close()

调试和运行爬虫：run.py

在跟setting.py同级的目录中创建一个run.py的文件，执行该文件可以对爬虫进行调试和运行。代码如下：

# -*- coding: utf-8 -*-#

# -------------------------------------------------------------------------------
# Name:         run
# Description:
# Author:       PANG
# Date:         2022/1/19
# -------------------------------------------------------------------------------
from scrapy import cmdline


name = 'muchong' # 爬虫名称
cmd = 'scrapy crawl {0}'.format(name)
cmdline.execute(cmd.split())

启动爬虫

进入项目根目录，运行如下代码启动spider。

scrapy crawl testing_spider

结果如下：
执行结果

未解决的问题

爬虫的数据内容没有过滤；
将数据存储到数据库中更为合适。
提高爬虫的速度也是需要做的工作。

面试题：将爬取的数据一份存储到本地一份存储到数据库，如何实现？
a.管道文件中一个管道类对应的是将数据存储到一种平台
b.爬虫文件提交的item只会给管道文件中第一个被执行的管道类接受
c.process_item中的return item 表示将item传递给下一个即将被执行的管道类。也就是说return item可以将item交给下一个管道类，在编码时要注意这一点。