python笔记---＞scrapy(今天学的笔记)-CFANZ编程社区

scrapy个人笔记

创建一个项目：

scrapy startproject xxx

创建一个爬虫项目：

Created spider 'test01

scrapy_spider

srapy下创建的爬虫

items 预处理预先处理数据在这里确定下可以不写。

middlewares 中间键在这里定义中间键的内容可以用来处理 ip池 ,up 池 ,selenium之类。

piplines 管道文件, 用来保存数据，保存到数据库中，cv中，表格中。

settings 设置一些数据，管道，中间件在这里开启。

# 设置一些数据 管道 中间键在这里开启 
BOT_NAME = 'Text01'

SPIDER_MODULES = ['Text01.spiders']
NEWSPIDER_MODULE = 'Text01.spiders'

robot协议会告诉搜索引擎，那些可以爬取那些不可以爬取，一般是默认遵守(True)，爬虫一般不是遵守，遵守还爬啥。

# 管道的开启 
ITEM_PIPELINES = {
    'Text01.pipelines.Text01Pipeline': 300,
}

scrapy_运行

scrapy不是右键运行而是框架运行(控制台输入代码)

执行代码:

def parse(self, response):
    # response 是下载器传过来的内容
""" 提取数据
        使用response.xpath返回的list类型的数据
        使用response.extract()返回的是一个包含字符串的列表
        使用response.extract_first()返回的是列表中的第一个字符串，列表没有返回为空
"""
# 使用xpath提取数据 需要用get提取一个数据 或者使用getall获取全部数据
name = response.xpath(r"/html/body/div[1]/section[7]/div/div[2]/div[*]/div/div[1]/h3/a/text()").getall()
print(name)
# 使用re提取
# name = response.xpath(r"/html/body/div[1]/section[7]/div/div[2]/div[*]/div/div[1]/h3/a/text()")
# name2 = str(name)
#
# a = re.findall(r"' data='(.*?)'",name2)
# print(a) # 打印响应的内容

递增

 a = 0                    # 全局变量
 for i in range (10):
    a += 1                # a = a+1 自动递增10遍
    print(a)

piplines 管道开启：

在settings里面开启管道

# 提取出来的数据给管道
#  print(names)
#  每次循环都会去到管道中 因为有yield item
 for name in names:
     item = {}
     item['name'] = name
     print(name)
 # yield item 管道会提取item里面的数据 yield是暂停的作用 这里可以重复的执行
     yield item

#开启管道
ITEM_PIPELINES = {
   'Text01.pipelines.Text01Pipeline': 300,
}

开启多个管道

# 可以执行多个任务，开启多个管道，在settings中添加管道即可
# 这是第一个任务
class Text01Pipeline:
    def process_item(self, item, spider):
        # a是追加模式 w是写入且前面有数据会覆盖 r 只读 w 只写 a 追加
        with open('text.txt','a',encoding='utf-8')as f:
            f.write(str(item)+'\n')
        return item
    
 # 这就是第二个任务，在settings中再开启一个管道即可
class Text01Pipeline2:
    def process_item(self, item, spider):
        # a是追加模式 w是写入且前面有数据会覆盖 r 只读 w 只写 a 追加
        with open('text.txt','a',encoding='utf-8')as f:
            f.write(str(item)+'\n')
        return item

使用预处理

类似于计划

item中写入需要的数据
文件(spider)中导入包
实例化

在导入item包中的类并且实例化的时候可以多但是不能少

第175行时

是 item=TestItem01()

而不是 item = TestItem01 会出现

“TypeError: 'ItemMeta' object does not support item assignment" 翻译为--> TypeError:“ItemMeta”对象不支持项分配"

import re

import scrapy
# scrapy startproject xxx 开启项目
# scrapy genspider test01  xxx.com   创建一个爬虫项目

# 导包 items中的类
# from Text01.items import Text01Item
from ..items import Text01Item

class Test01Spider(scrapy.Spider):
    name = 'test01'
    # 爬虫的域 -->范围 你的url不能超过这个范围
    allowed_domains = ['sixstaredu.com']
    # 开始的url
    start_urls = ['https://www.sixstaredu.com/teacher']

    def parse(self, response):
        # response 是下载器传过来的内容
        """ 提取数据
        使用response.xpath返回的list类型的数据
        使用response.extract()返回的是一个包含字符串的列表
        使用response.extract_first()返回的是列表中的第一个字符串，列表没有返回为空
        """

        # 使用re提取
        # name = response.xpath(r"/html/body/div[1]/section[7]/div/div[2]/div[*]/div/div[1]/h3/a/text()")
        # name2 = str(name)
        #
        # a = re.findall(r"' data='(.*?)'",name2)
        # print(a) # 打印响应的内容
        # 使用xpath提取数据 需要用get提取一个数据 或者使用getall
        position = response.xpath('//*[@id="content-container"]/div/div[*]/div/div[1]/div/text()').getall()
        names = response.xpath(r'//*[@id="content-container"]/div/div[*]/div/div[1]/h3/a/text()').getall()
       # 提取出来的数据给管道
       #  每次循环都会去到管道中 因为有yield item
       #  for name in names:
        for i in range(len(names)):
            item = Text01Item()   # 实例化 item  第175行
            item['name'] = names[i]
            item['position'] = position[i].strip()   
            print(item)
        # yield item 管道会提取item里面的数据 yield是暂停的作用 这里可以重复的执行
            yield item

预处理的数据和写入时传入的数据个数不相同时会出现键错误

因为position没有打开 ==》position = scrapy.Field()

# 预处理 预先处理数据 自己预计需要爬取的内容 在这里确定下 可以不写
import scrapy


class Text01Item(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    # position = scrapy.Field()

多管道的开启

在输入数据完成后别忘了开启之前写的内容的管道

# 其实就是多加一个Text01.pipelines.xx(自己所创建的管道)
ITEM_PIPELINES = {
   'Text01.pipelines.Text01Pipeline': 300,
   'Text01.pipelines.MySqlPipline': 300,
}

连接数据库时需要注意自身所给的值否则可能报错

import pymysql


class Text01Pipeline:
    """存入文件中"""
    def process_item(self, item, spider):
        # a是追加模式 w是写入且前面有数据会覆盖 r 只读 w 只写 a 追加
        with open('text.txt','a',encoding='utf-8')as f:
            f.write(str(item)+'\n')
        # print(item)
        return item
class MySqlPipline(object):
    def process_item(self, item, spider):
        # 连接数据库
        link ={
            'host': 'localhost',  # 主机
            'port': 3306,                       # 端口号
            'user': 'root',                     # 自己的账户
            'password': 'chen20010911',         # 自己的密码
            'db': 'text'                        # 导入数据库
        }
        # 1.建立链接
        con = pymysql.connect(**link)   # 拆包的概念 不定长参数
        # 2.建立游标对象,你所操作查询出来的数据，增删改查，都在游标对象中
        current = con.cursor()
        # 3.执行sql语句 使用游标对象开始插入数据
        current.execute('insert into scrapy value ("%s","%s")'%(item['name'], item['position']))
        # 4提交数据
        con.commit()
        # 5关闭连接
        con.close()
        current.close()

MySql 没有开启时

打开此电脑
点击服务
找到mysql并点击启动

(完)