python-xpath，爬取猪八戒网（半成品）-CFANZ编程社区

python-xpath，爬取猪八戒网（半成品）

数据未进行清洗

xpath

/ 层级关系

text() 拿文本

python-xpath，爬取猪八戒网（半成品）_xml

https://blog.csdn.net/KELLENSHAW/article/details/127877476

爬取

https://task.zbj.com/hall/list-all-0-p1?kw=HTML

python-xpath，爬取猪八戒网（半成品）_HTML_02

先定位小盒子的div

然后通过检查，

xpath://*[@id="hall-list-wrap"]/div[4]/div[1]/div[1]/div[1]/div[1]

大盒子的就是：//*[@id="hall-list-wrap"]/div[4]/div[1]/div[1]/div[1]/div

然后就是通过遍历找到

标签=小盒子接下去的路径

价格

信息

遇到的困难就是写标签的时候不匹配

爬虫出现空列表或者长度为0是怎么回事？

https://blog.csdn.net/lzz781699880/article/details/81133398

[<Element div at 0x18c0fa23d00>] 这个错误，感觉我很常见

后来怎么解决忘记了

使用lxml时，报错ValueError:can only parse strings

python-xpath，爬取猪八戒网（半成品）_xml_03

https://blog.csdn.net/weixin_42994523/article/details/107748670

丢没有爬出来。。。

数据清洗不会。。。md这个列表不知道怎么搞

python-xpath，爬取猪八戒网（半成品）_xml_04

# -*- coding = utf-8 -*-
# @Time : 2023/4/7 17:28
# @Author : 路人甲
# @File : 爬猪八戒.py
# @Software: PyCharm
from lxml import etree

import requests


url='https://task.zbj.com/hall/list-all-0-p1?kw=HTML'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.7 Safari/537.36'}
#对网站发起请求
page_test=requests.get(url=url, headers=headers)

# 爬取网址全部内容
# print(page_test.text)

# 这里是将从互联网上获取的源码数据加载到该对象中
tree=etree.HTML(page_test.text)


divs = tree.xpath('//*[@id="hall-list-wrap"]/div[4]/div[1]/div[1]/div[1]/div')

for list in divs:
    title = list.xpath('./a/div/div[1]/text()')
    price = list.xpath('./a/div/div[3]/text()')
    desc = list.xpath('./a/div/p/text()')
		
    print(price)

0 条评论