最近这瓜一波接一波,先是力宏,后是薇娅,今早上看到微博热搜#收入几十亿是种什么体验#,看完评论我哭了。
1.抓取微博评论尝试
打卡浏览器F12,获取的请求连接,分析响应信息,数据都在返回的json数据中,根据需要的属性取值
1def fun():
2 url = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4716527267873175&is_show_bulletin=2&is_mix=0&count=10&uid=2504747281"
3 headers = {
4 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
5 }
6 response = requests.get(url, headers=headers)
7 response.encoding = 'utf-8'
8 resp = response.json()
9 print(json.dumps(resp, indent=2))
2.分页获取评论
继续分析翻页请求,之前爬取知乎数据时,知乎是通过设置count进行翻页的,看看微博这里是怎么控制翻页的。
通过在F12中观察请求链接,发现首次请求地址没有max_id字段,但在随后的分页中都会在请求地址中添加max_id,并且每页的是不同的。有意思的是上一页的请求会返回下一页的max_id,把max_id拼接到url中就可以实现翻页
我们抓取前1000页的评论数据。
1import requests
2from bs4 import BeautifulSoup
3
4
5# 访问请求
6def request(max_id):
7 if max_id == 0:
8 url = 'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4716527267873175&is_show_bulletin=2&is_mix=0&count=10&uid=2504747281'
9 else:
10 url = 'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4716527267873175&is_show_bulletin=2&is_mix=0&count=10&uid=2504747281&max_id={0}'.format(
11 max_id)
12 headers = {
13 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
14 }
15
16 response = requests.get(url, headers=headers)
17 response.encoding = 'utf-8'
18 resp = response.json()
19 dataparse(resp)
20 return resp['max_id']
21
22
23# 数据解析
24def dataparse(response):
25 for item in response['data']:
26 userid = item['user']['id']
27 author = item['user']['name']
28 time = timeformat(item['created_at'])
29 text = item['text']
30 soup = BeautifulSoup(text, 'html.parser')
31 content = "".join(soup.find_all(text=True))
32 like_counts = item['like_counts']
33 print(str(userid) + " " + author + " " + time + " " + content + " " + str(like_counts))
34
35
36# 主函数
37def main():
38 max_id = 0
39 count = 0
40 while count < 1000:
41 count += 1
42 max_id = request(max_id)
3.数据存储CSV
4.数据分析
先来展示一下清洗后的数据
1import pandas as pd
2
3pd.set_option('display.max_columns', None) # 显示完整的列
4pd.set_option('display.max_rows', None) # 显示完整的行
5pd.set_option('display.expand_frame_repr', False) # 设置不折叠数据
6
7def etl2():
8 # 删除重复记录和缺省值
9 df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
10 print(df.head(10))
我们使用pkuseq中文分词工具进行分词,并使用四川大学机器智能实验室停用词库,展示排名前30出现的词。
1def etl2():
2 # 删除重复记录和缺省值
3 df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
4 # print(df.head(10))
5
6 # 停用词
7 stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt', encoding='utf-8')])
8
9 # jieba分词
10 content = df['content'].values.tolist()
11 seg = pkuseg.pkuseg(model_name='web')
12 result = seg.cut(" ".join(content))
13
14 # 分词结果
15 all_words = [word for word in result if len(word) > 1 and word not in stopwords]
16
17 # 词频统计Top30
18 wordcount = Counter(all_words).most_common(30)
19 x1_data, y1_data = list(zip(*wordcount))
20
21 print(x1_data)
22 print(y1_data)
23
24
25('不用', '评论', '东西', '会员', '喜欢', '随便', '体验', '外卖', '上班', '几十亿', '酸奶', '单车', '担心', '心疼', '价格', '奶茶', '犹豫', '做梦', '共享', '房子', '自由', '图片', '心酸', '肯定', '视频', '演员', '公交', '家人', '包养', '块钱')
26(1200, 221, 206, 158, 158, 127, 114, 95, 94, 91, 90, 87, 85, 84, 84, 82, 78, 75, 75, 68, 66, 63, 63, 60, 56, 54, 54, 53, 52, 52)
果然,有钱就是可以任性,什么都“不用”考虑。
5.数据可视化
展示热评前十条:
1def etl3():
2 # 删除重复记录和缺省值
3 df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
4 print(df.sort_values(by="like", ascending=False).head(10))
关键词词云展示:
1def etl2():
2 # 删除重复记录和缺省值
3 df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
4 # print(df.head(10))
5
6 # 停用词
7 stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt', encoding='utf-8')])
8
9 # jieba分词
10 content = df['content'].values.tolist()
11 seg = pkuseg.pkuseg(model_name='web')
12 result = seg.cut(" ".join(content))
13
14 # 分词结果
15 all_words = [word for word in result if len(word) > 1 and word not in stopwords]
16
17 # 词频统计Top30
18 wordcount = Counter(all_words).most_common(30)
19 x1_data, y1_data = list(zip(*wordcount))
20
21 print(x1_data)
22 print(y1_data)
23
24 # 词云生成
25 cloud = WordCloud(scale=4,
26 font_path='./simfang.ttf',
27 background_color='black',
28 max_words=100,
29 max_font_size=60,
30 random_state=20)
31
32 my_wordcloud = cloud.generate(" ".join(all_words))
33
34 plt.imshow(my_wordcloud)
35 plt.axis("off")
36 plt.show()
“有些人笑着笑着就哭了”
当然这都是普通打工人的臆想,有钱人的快乐我们想象不到。
可能花钱对他们来说是最无聊的事情吧。