☘前言☘
读完这篇博客,你可以学到什么?
- excel的读取和写入方式
- 获取接口的方式
- 多线程提高爬虫效率
- 云端服务器部署web服务实时更新
这篇博客里,我将在上一篇文章的基础上进一步来发掘网站接口api提高访问速度,进一步多线程发掘更快的速度,最后部署到web服务器来实现实时更新排行榜的访问。
没有读过上一篇文章没有基础的同学可以先看一下之前的文章【从零开始的python生活①】手撕爬虫扒一扒力扣的用户刷题数据
全文大约阅读时间: 20min
主要内容
一、改进原因
二、接口的获取
三、数据的读入与查询写回
四、多线程请求信息
五、web前端的书写
六、其他补充
七、写在最后
"""
兴磊的代码
CSDN主页:https://blog.csdn.net/qq_17593855
"""
__author__ = '兴磊'
__time__ = '2022/1/27'
import pandas as pd
import re
import time
from urllib.parse import urlencode
import requests
import json
from concurrent.futures import ThreadPoolExecutor
headers={
"x-csrftoken":'',
"Referer":"https://leetcode-cn.com",
}
utf = '''
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "https://hm.baidu.com/hm.js?f114c8d036eda9fc450e6cbc06a31ebc";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
'''
payload = {"operation_name": "userPublicProfile",
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"kingley"}'
}
def int_csrf():
global headers
sess= requests.session()
sess.head("https://leetcode-cn.com/graphql/")
headers['x-csrftoken'] = sess.cookies["csrftoken"]
def chaxun(username):
payload['variables'] = json.dumps({"userSlug" : f"{username}"})
res= requests.post("https://leetcode-cn.com/graphql/"+"?"+urlencode(payload),headers = headers)
if res.status_code != 200:
return -1
return res.json()['data']['userProfilePublicProfile']['submissionProgress']['acTotal']
def get_html(df,cmap="Set3"):
df.sort_values("力扣题数", ascending=False, inplace=True)
del df['力扣主页']
del df['CSDN主页']
del df['''B站主页
(主要用于发奖励的时候找得到对应的人)''']
r = (
df.style.hide_index()
.background_gradient(cmap=cmap, subset=["力扣题数"])
)
#print(r.render())
html = '<div>' + r.render() + '</div>'
html = re.sub("#T_.+?(row\d+)_col\d+", r".\1", html)
with open("style.css") as f:
css = "<style>" + f.read() + "</style>"
css = css.format(fontsize=28, justify="center")
html = utf + css + html
return html
if __name__ == '__main__':
int_csrf()
df = pd.read_excel('111.xlsx')
#读取一整列的数据
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
nums = executor.map(chaxun, df.力扣主页.str.extract(
r"leetcode-cn.com/u/([^/]+)(?:/|$)", expand=False))
df['力扣题数']=list(nums)
with open("/www/xxxx/score.html", 'w', encoding="u8") as f:
f.write(get_html(df))
print("耗时:", time.time() - start)