爬取个简单的网站(企鹅聘一)-CFANZ编程社区

爬取简单的网站

创建项目

scrapy  startproject  Tencent       #项目名字

**项目分析**

```

.
├── myspider 存放项目的代码
│ ├── init.py
│ ├── items.py 项目的目标文件(用于建模)
│ ├── middlewares.py 中间键文件(自定义一些功能)
│ ├── pipelines.py 项目的管道文件(用于数据的后续处理)
│ ├── pycache
│ ├── settings.py scrapy项目的设置文件
│ └── spiders 存放创建的爬虫文件
│ ├── init.py
│ └── pycache
└── scrapy.cfg 项目的配置文件(远程部署配置)
“`

创建爬虫

cd Tencent 文件中

scrapy  genspider  tencent  tencent.com     #爬虫名    域名过滤（就是爬不到别的网站）

items项目的目标文件(用于建模) 爬取的那些数据

# -*- coding: utf-8 -*-
import scrapy

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 职位名称
    title = scrapy.Field()

    # 链接
    link = scrapy.Field()

    # 项目类别
    category = scrapy.Field()

    # 人数
    num = scrapy.Field()

    # 地点
    address = scrapy.Field()

    # 发布时间
    pub_data = scrapy.Field()

spider存放创建的爬虫文件

# -*- coding: utf-8 -*-
import scrapy

from Tencent.items import TencentItem


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):

        # 获取所有数据节点
        note_list = response.xpath('//*[@class="even"]|//*[@class="odd"]')

        # 验证
        # print(len(note_list))

        # 遍历节点，获取响应数据
        for note in note_list:

            # 实例化对象
            item = TencentItem()

            item['title'] = note.xpath('./td[1]/a/text()').extract()[0]


            # extract_first()确定只有一条数据的时候使用，提取不到值默认赋值为None
            # 有的数据为空，使用extract_first
            item['link'] = 'https://hr.tencent.com/' + note.xpath('./td[1]/a/@href').extract_first()

            item['category'] = note.xpath('./td[2]/text()').extract_first()

            item['num'] = note.xpath('./td[3]/text()').extract_first()

            item['pub_data'] = note.xpath('./td[2]/text()').extract_first()

            # print(item)

            yield item


        # next_link ='https://hr.tencent.com/' + response.xpath('//*[@id="next"]/@href').extract()[0]

        next_link = response.xpath('//*[@id="next"]/@href').extract_first()

        if next_link is not None:

            next_link = 'https://hr.tencent.com/' + next_link

        # 调用Request方法scrapy ,            回调函数，next_url有谁来处理
            yield scrapy.Request(url=next_link, callback=self.parse)

        # pass

pipelines 项目的管道文件(用于数据的后续处理)如sh数据的保存

import json

class TencentPipeline(object):

    def __init__(self): #实例化一个文件

        self.file = open('tencent11.json', 'w')

    def process_item(self, item, spider):

        if isinstance(item, TencentItem):

            data_list = dict(item)  #将文件转变为字典的列表

            str_data = json.dumps(data_list, ensure_ascii=False) + ',\n'   #将字典转化为json字符串

            self.file.write(str_data)

        return item

    def __del__(self):

        self.file.close()

**setting**scrapy项目的设置文件

ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 300,
}