建立第一个SCRAPY的具体过程-CFANZ编程社区

1。安装SCRAPY

2。进入CMD：执行：SCRAPY显示：

Scrapy 1.8.0 - no active project

Usage:

scrapy <command> [options] [args]

Available commands:

bench Run quick benchmark test

fetch Fetch a URL using the Scrapy downloader

genspider Generate new spider using pre-defined templates

runspider Run a self-contained spider (without creating a project)

settings Get settings values

shell Interactive scraping console

startproject Create new project

version Print Scrapy version

view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

出现上面的内容则表示SCRAPY安装成功

3。建立放置爬虫的文件夹d:\crapy

4.进入d:\crapy

d:\crapy>

5.建立爬虫项目：scrapy startproject cnblog

New Scrapy project 'cnblog', using template directory 'd:\python\python37\lib\site-packages\scrapy\templates\project', created in:

D:\crapy\cnblog

You can start your first spider with:

cd cnblog

scrapy genspider example example.com

上面的提示表示建立了一个名称叫cnblog的爬虫项目，指明了项目应用的模板及位置：即当前位置下建立了一个与项目同名的文件夹；要想开始爬虫必须进入新建立的文件夹（cnblog)来建立爬虫

6。建立第一个爬虫

D:\crapy>cd cnblog

D:\crapy\cnblog>scrapy genspider cnblog cnblogs.com #指定爬虫名称为cnblogs时出错，提示不能与当前项目同名

Cannot create a spider with the same name as your project

D:\crapy\cnblog>scrapy genspider cnbloga cnblogs.com

Created spider 'cnbloga' using template 'basic' in module

cnblog.spiders.cnbloga

#建立了第一个爬虫名称为“cnbloga",爬取的DOMAIN为“cnblogs.com",只爬取域名内的信息，这是爬取范围限定；并且指定的应用模板为“basic"

7。打开相应的爬虫文件：d:\crapy\cnblog\cnblog\spider\cnbloga.py

# -*- coding: utf-8 -*-
import scrapy


class CnblogaSpider(scrapy.Spider):
    name = 'cnbloga'
    allowed_domains = ['cnblogs.com']
    start_urls = ['']

    def parse(self, response):
        pass

第一行引用爬虫；声明一个类：Cnblogaspider,继承于scrapy.Spider;爬虫的名称“ cnbloga";爬取的范围'cnblogs.com';开始爬取的网址为;

默认方法parse,即每得到相应的网址，就交给这个方法来处理；

8。运行爬虫：

d:\crapy\cnblog>scrapy crawl cnbloga#‘cnbloga'为相应的爬虫的名称