文章目录
- 爬虫整体流程
- 实战
爬虫整体流程
实战
- 58同城抓取流程
- 进入成都小区页面(https://cd.58.com/xiaoqu/),确定抓取目标
- 观察页面,获取各行政区的链接
- 分行政区抓取各小区的URL
- 进入各小区详情页面,抓取名字、价格、地址、年份等信息
- 抓取小区二手房页面第一页的价格,在管道中求该小区房价的平均价格
- 抓取小区出租房页面第一页的URL,进入详情页抓取名称、价格、房型等信息
需要抓取的有:
- 各行政区下的小区列表的小区URL
- 小区详情页中该小区的名字、价格、地址、年份等信息
- 各小区二手房页面的二手房每平米价格,并在管道中计算出小区真实平均房价
- 各小区租房页面列表中每个租房房源的URL
- 每个租房房源的详情页中,该房源的名称、价格、房型等信息
- 58同城抓取代码模块示意图:
- 代码实现
- 创建scrapy文件,命名为city_58,并创建爬虫文件spider_city_58
>>> scrapy startproject city_58
>>> cd city_58
>>> scrapy genspider spider_city_58 58.com
- 编写items.py文件,这个文件主要负责定义具体项目的内容
import scrapy
class City58Item(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
price = scrapy.Field()
last_updated = scrapy.Field()
class City58ItemXiaoQu(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
reference_price = scrapy.Field()
address = scrapy.Field()
times = scrapy.Field()
class City58ItemXiaoChuZuQuInfo(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
zu_price = scrapy.Field()
type = scrapy.Field()
mianji = scrapy.Field()
chuzu_price_pre = scrapy.Field()
url = scrapy.Field()
price_pre = scrapy.Field()
- 在city_58文件夹下创建utils文件夹,文件夹下创建一个parse.py文件,parse.py共有5个函数,分别负责匹配列表页所有的小区url,小区的详情页信息,二手房的详情页信息,出租页面详情页的url,出租页面详情页
# code:utf8
from pyquery import PyQuery
def parse(response):
"""
抓取小区列表页面: http://cd.58.com/xiaoqu/11487/
返回列表页所有的小区url
:param:response
:return
"""
jpy = PyQuery(response.text)
tr_list = jpy('#infolist > div.listwrap > table > tbody > tr').items()
result = set() #result为set集合(不允许重复元素)
for tr in tr_list:
url = tr(' td.info > ul > li.tli1 > a').attr('href') #爬取各个小区的url
result.add(url)
return result
def xiaoqu_parse(response):
"""
小区详情页匹配代码样例url:http://cd.58.com/xiaoqu/shenxianshudayuan/
返回这个小区的详细信息的dict字典,主要信息包括小区名称,小区参考房价,小区地址,小区建筑年代
:param:response
:return:
"""
result = dict()
jpy = PyQuery(response.text)
result['name'] = jpy('body > div.bodyItem.bheader > div > h1 > span').text()
result['reference_price'] = jpy('body > div.bodyItem.bheader > div > dl > dd:nth-child(1) > span.moneyColor').text()
result['address'] = jpy('body > div.bodyItem.bheader > div > dl > dd:nth-child(3) > span.ddinfo')\
.text().replace('查看地图', '') #得到地址详情,去除“查看地图”,如 “ 紫荆西路6号 查看地图”,将“查看地图”替换为“”
result['times'] = jpy('body > div.bodyItem.bheader > div > dl > dd:nth-child(5)').text().split()
result['times'] = result['times'][2] #取出建筑年代
return result
def get_ershou_price_list(response):
"""
页面链接样例:http://cd.58.com/xiaoqu/shenxianshudayuan/ershoufang/
匹配二手房列表页面的所有房价信息
返回一个价格的列表list
:param:response
:return:
"""
jpy = PyQuery(response.text)
price_tag = jpy('td.tc > span:nth-child(3)').text().split()
price_tag = [i[:-3] for i in price_tag] #遍历price_tag截取到倒数第三个元素
return price_tag
def chuzu_list_pag_get_detail_url(response):
"""
页面链接样例:http://cd.58.com/xiaoqu/shenxianshudayuan/chuzu/
获取出租列表页所有详情页url
返回一个url的列表list
:param:response
:return:
"""
jpy = PyQuery(response.text)
a_list = jpy('tr > td.t > a.t').items()
url_list = [a.attr('href') for a in a_list] #遍历a_list
return url_list
def get_chuzu_house_info(response):
"""
获取出租详情页的相关信息
返回一个dict包含:出租页标题,出租价格,房屋面积,房屋类型(几室几厅)
:param:response
:return:
"""
jpy = PyQuery(response.text)
result = dict()
result['name'] = jpy('body > div.main-wrap > div.house-title > h1').text()
result['zu_price'] = jpy('body > div.main-wrap > div.house-basic-info > div.house-basic-right.fr > '
'div.house-basic-desc > div.house-desc-item.fl.c_333 > div > span.c_ff552e > b').text()
result['type'] = jpy('body > div.main-wrap > div.house-basic-info > div.house-basic-right.fr > div.house-basic-desc'
' > div.house-desc-item.fl.c_333 > ul > li:nth-child(2) > span:nth-child(2)').text()
result['type'], result['mianji'], *_ = result['type'].split()
return result
if __name__ == '__main__':
import requests
r = requests.get('http://cd.58.com/zufang/31995551807162x.shtml')
get_chuzu_house_info(r)
- 编写spider_city_58.py,这个文件主要负责网页的爬取以及网页之间的跳转
# -*- coding: utf-8 -*-
import scrapy #导入scrapy包
from scrapy.http import Request #导入Request包
from ..utils.parse import parse, \ #从utils文件夹中导入parse文件中的这些类
xiaoqu_parse, get_ershou_price_list, \
chuzu_list_pag_get_detail_url,\
get_chuzu_house_info
from ..items import City58ItemXiaoQu, City58ItemXiaoChuZuQuInfo #从items文件导入这些类
from traceback import format_exc
class SpiderCity58Spider(scrapy.Spider):
name = 'spider_city_58'
allowed_domains = ['58.com']
host = 'cd.58.com'
xianqu_url_format = 'http://{}/xiaoqu/{}/'
# xianqu_code = list()
xianqu_code = list(range(103, 118))
xianqu_code.append(21611)
def start_requests(self): #重写start_requests函数
start_urls = ['http://{}/xiaoqu/{}/'.format(self.host, code) for code in self.xianqu_code]
for url in start_urls:
yield Request(url) #遍历所有区域
def parse(self, response):
"""
第一步抓取所有的小区
http://cd.58.com/xiaoqu/21611/
:param response:
:return:
"""
url_list = parse(response) #调用utils文件夹中parse文件中的parse方法,得到所有小区的url
for url in url_list:
yield Request(url,
callback=self.xiaoqu_detail_pag, #回调xiaoqu_detail_pag方法
errback=self.error_back,
priority=4
)
def xiaoqu_detail_pag(self, response):
"""
第二步抓取小区详情页信息
http://cd.58.com/xiaoqu/shenxianshudayuan/
:param response:
:return:
"""
_ = self
data = xiaoqu_parse(response)
item = City58ItemXiaoQu()
item.update(data)
item['id'] = response.url.split('/')[4]
yield item
# 二手房
url = 'http://{}/xiaoqu/{}/ershoufang/'.format(self.host, item['id'])
yield Request(url,
callback=self.ershoufang_list_pag, #回调ershoufang_list_pag方法
errback=self.error_back,
meta={'id': item['id']},
priority=3)
# 出租房
url_ = 'http://{}/xiaoqu/{}/chuzu/'.format(self.host, item['id'])
yield Request(url_,
callback=self.chuzu_list_pag, #回调chuzu_list_pag方法
errback=self.error_back,
meta={'id': item['id']},
priority=2)
def ershoufang_list_pag(self, response):
"""
第三步抓取二手房详情页信息
http://cd.58.com/xiaoqu/shenxianshudayuan/ershoufang/
:param response:
:return:
"""
_ = self
price_list = get_ershou_price_list(response)
yield {'id': response.meta['id'], 'price_list': price_list}
def chuzu_list_pag(self, response):
"""
第四步抓取出租房详情页url
http://cd.58.com/xiaoqu/shenxianshudayuan/chuzu/
:param response:
:return:
"""
_ = self
url_list = chuzu_list_pag_get_detail_url(response)
for url in url_list:
yield response.request.replace(url=url, callback=self.chuzu_detail_pag, priority=1) #回调chuzu_detail_pag方法
# yield Request(url, callback=)
def chuzu_detail_pag(self, response):
"""
第五步抓取出租房详情页信息
:param response:
:return:
"""
_ = self
data = get_chuzu_house_info(response)
item = City58ItemXiaoChuZuQuInfo()
item.update(data)
item['id'] = response.meta['id']
item['url'] = response.url
yield item
def error_back(self, e):
_ = e
self.logger.error(format_exc()) #打出报错信息
- 编写middleware.py 文件,这个文件主要负责挂代理,防止封IP
可以在网上寻找免费的代理IP,也可以购买高质量的代理IP
from scrapy import signals
#from .utils.proxy_swift import proxy_pool
class ProxyMiddleware(object):
def process_request(self, request, spider):
#传入代理服务器,下面语句需要替换为自己的代理方式
request.meta['proxy'] = 'http://{}'.format(proxy_pool.pop())
print()
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
- 编写pipeline.py,这个文件主要负责数据处理以及数据入库
from scrapy.exceptions import DropItem
from pymongo import MongoClient
from scrapy.conf import settings
from pymongo.errors import DuplicateKeyError
from traceback import format_exc
from .items import City58ItemXiaoQu, City58ItemXiaoChuZuQuInfo
class City58Pipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
self.client = None
self.db = None
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGODB_URI'), #提取出了mongodb配置
mongo_db=settings.get('MONGODB_DATABASE', 'items')
)
def open_spider(self, spider):
_ = spider
self.client = MongoClient(self.mongo_uri) #连接数据库
self.db = self.client[self.mongo_db]
self.db['city58_info'].ensure_index('id', unique=True) #在表city58_info中建立索引,并保证索引的唯一性
self.db['city58_chuzu_info'].ensure_index('url', unique=True) #在表city58_chuzu_info中建立索引,并保证索引的唯一性
def close_spider(self, spider):
_ = spider
self.client.close() #关闭数据库
def process_item(self, item, spider):
try:
if isinstance(item, City58ItemXiaoQu): #判断是否是小区的item
self.db['city58_info'].update({'id': item['id']}, {'$set': item}, upsert=True) #通过id判断,有就更新,没有就插入
elif isinstance(item, City58ItemXiaoChuZuQuInfo): #判断是否是小区出租信息的item
try:
fangjia = HandleFangjiaPipline.price_per_square_meter_dict[item['id']] #把HandleFangjiaPipline管道的字典price_per_square_meter_dict中每平米平均价格赋值给fangjia
# del HandleFangjiaPipline.price_per_square_meter_dict[item['id']]
item['price_pre'] = fangjia #赋值给item
self.db['city58_chuzu_info'].update({'url': item['url']}, {'$set': item}, upsert=True) #通过url判断,有就更新,没有就插入
except Exception as e:
print(e) #打印错误
except DuplicateKeyError:
spider.logger.debug(' duplicate key error collection') #唯一键冲突报错
except Exception as e:
_ = e
spider.logger.error(format_exc())
return item
class HandleZuFangPipline(object):
def process_item(self, item, spider):
_ = spider, self
# self.db[self.collection_name].insert_one(dict(item))
if isinstance(item, City58ItemXiaoChuZuQuInfo) and 'mianji' in item: #判断进来的item是否是City58ItemXiaoChuZuQuInfo,是否含有面积参数
item['chuzu_price_pre'] = int(item['zu_price']) / int(item['mianji']) #租金除以面积得到平均价格
return item #继续传递item
class HandleFangjiaPipline(object):
price_per_square_meter_dict = dict() #声明一个dict()
def process_item(self, item, spider):
_ = spider
if isinstance(item, dict) and 'price_list' in item: #判断传进来的item是否是个字典,并且是否含有price_list
item['price_list'] = [int(i) for i in item['price_list']] #遍历price_list
if item['price_list']:
self.price_per_square_meter_dict[item['id']] = sum(item['price_list']) / len(item['price_list']) #得到每个小区的平均价格
else:
self.price_per_square_meter_dict[item['id']] = 0
raise DropItem()
return item #继续传递item
- 编写settings.py,这个文件主要负责激活管道和中间件,制定管道的顺序
# -*- coding: utf-8 -*-
# Scrapy settings for city_58 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'city_58'
SPIDER_MODULES = ['city_58.spiders']
NEWSPIDER_MODULE = 'city_58.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'city_58 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.3 #下载速度
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'city_58.middlewares.City58SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'city_58.middlewares.ProxyMiddleware': 543, #代理中间件
}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#先进行数据处理,再进行数据入库
ITEM_PIPELINES = {
'city_58.pipelines.HandleZuFangPipline': 300, # 租房平均每平米价格
'city_58.pipelines.HandleFangjiaPipline': 310, #小区平均价格
'city_58.pipelines.City58Pipeline': 320, #储存入库
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downlo,mhader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
MONGODB_HOST = '127.0.0.1' #本地数据库
MONGODB_PORT = '27017' #数据库端口
MONGODB_URI = 'mongodb://{}:{}'.format(MONGODB_HOST, MONGODB_PORT)
MONGODB_DATABASE = 'test' #数据库名字