xpath 使用参考:https://zhuanlan.zhihu.com/p/135422455json 使用参考:https://zhuanlan.zhihu.com/p/265602471(本文未用)
先用如下两种方法判断需要爬取的资源是否在返回的 html 中
一.
1)
2)
二
copy 两个 xpath 路径 找规律
三
得出一个 统一的 xpath
四
编写代码提取数据
注:此处用了 flask 框架将爬到的结果可视化展示,爬虫部分包含在其中
此处用微博热搜做例子
# 导入库
from lxml import etree
import requests
from flask import Flask
app = Flask(__name__)
@app.route("/")
def index():
url = "https://tophub.today/n/KqndgxeLl9"
headers = {"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36", "cookie": "UM_distinctid=17da3319278185-027c9de93a6d7d-3d72065b-1fa400-17da33192791dc; CNZZDATA1276310587=1711630976-1639114410-https%253A%252F%252Fwww.baidu.com%252F%7C1639114410; Hm_lvt_3b1e939f6e789219d8629de8a519eab9=1639120475,1639120902; Hm_lpvt_3b1e939f6e789219d8629de8a519eab9=1639120902", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"}
req = requests.get(url, headers=headers)
html = req.text
tree = etree.HTML(html)
res = tree.xpath('//*[@id="page"]/div[2]/div[2]/div[1]/div[2]/div/div[1]/table/tbody/tr/td[2]/a/text()')
t = "";
for i in res:
t += "<h2>"+i+"</h2>"
return t
app.run()