微博热搜：收入几十亿是种什么体验，看完评论我哭了-CFANZ编程社区

最近这瓜一波接一波，先是力宏，后是薇娅，今早上看到微博热搜#收入几十亿是种什么体验#，看完评论我哭了。

微博热搜：收入几十亿是种什么体验，看完评论我哭了_ajax

微博热搜：收入几十亿是种什么体验，看完评论我哭了_ajax_02

1.抓取微博评论尝试

打卡浏览器F12，获取的请求连接，分析响应信息，数据都在返回的json数据中，根据需要的属性取值

微博热搜：收入几十亿是种什么体验，看完评论我哭了_ajax_03

1def fun():
2    url = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4716527267873175&is_show_bulletin=2&is_mix=0&count=10&uid=2504747281"
3    headers = {
4        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
5    }
6    response = requests.get(url, headers=headers)
7    response.encoding = 'utf-8'
8    resp = response.json()
9    print(json.dumps(resp, indent=2))

微博热搜：收入几十亿是种什么体验，看完评论我哭了_数据_04

2.分页获取评论

继续分析翻页请求，之前爬取知乎数据时，知乎是通过设置count进行翻页的，看看微博这里是怎么控制翻页的。

微博热搜：收入几十亿是种什么体验，看完评论我哭了_json_05

通过在F12中观察请求链接，发现首次请求地址没有max_id字段，但在随后的分页中都会在请求地址中添加max_id，并且每页的是不同的。有意思的是上一页的请求会返回下一页的max_id，把max_id拼接到url中就可以实现翻页

微博热搜：收入几十亿是种什么体验，看完评论我哭了_数据_06

我们抓取前1000页的评论数据。

1import requests
 2from bs4 import BeautifulSoup
 3
 4
 5# 访问请求
 6def request(max_id):
 7    if max_id == 0:
 8        url = 'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4716527267873175&is_show_bulletin=2&is_mix=0&count=10&uid=2504747281'
 9    else:
10        url = 'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4716527267873175&is_show_bulletin=2&is_mix=0&count=10&uid=2504747281&max_id={0}'.format(
11            max_id)
12    headers = {
13        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
14    }
15
16    response = requests.get(url, headers=headers)
17    response.encoding = 'utf-8'
18    resp = response.json()
19    dataparse(resp)
20    return resp['max_id']
21
22
23# 数据解析
24def dataparse(response):
25    for item in response['data']:
26        userid = item['user']['id']
27        author = item['user']['name']
28        time = timeformat(item['created_at'])
29        text = item['text']
30        soup = BeautifulSoup(text, 'html.parser')
31        content = "".join(soup.find_all(text=True))
32        like_counts = item['like_counts']
33        print(str(userid) + " " + author + " " + time + " " + content + " " + str(like_counts))
34
35
36# 主函数
37def main():
38    max_id = 0
39    count = 0
40    while count < 1000:
41        count += 1
42        max_id = request(max_id)

微博热搜：收入几十亿是种什么体验，看完评论我哭了_数据_07

3.数据存储CSV

微博热搜：收入几十亿是种什么体验，看完评论我哭了_数据_08

4.数据分析

先来展示一下清洗后的数据

1import pandas as pd
 2
 3pd.set_option('display.max_columns', None)   # 显示完整的列
 4pd.set_option('display.max_rows', None)  # 显示完整的行
 5pd.set_option('display.expand_frame_repr', False)  # 设置不折叠数据
 6
 7def etl2():
 8    # 删除重复记录和缺省值
 9    df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
10    print(df.head(10))

微博热搜：收入几十亿是种什么体验，看完评论我哭了_数据_09

我们使用pkuseq中文分词工具进行分词，并使用四川大学机器智能实验室停用词库，展示排名前30出现的词。

1def etl2():
 2    # 删除重复记录和缺省值
 3    df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
 4    # print(df.head(10))
 5
 6    # 停用词
 7    stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt', encoding='utf-8')])
 8
 9    # jieba分词
10    content = df['content'].values.tolist()
11    seg = pkuseg.pkuseg(model_name='web')
12    result = seg.cut(" ".join(content))
13
14    # 分词结果
15    all_words = [word for word in result if len(word) > 1 and word not in stopwords]
16
17    # 词频统计Top30
18    wordcount = Counter(all_words).most_common(30)
19    x1_data, y1_data = list(zip(*wordcount))
20
21    print(x1_data)
22    print(y1_data)
23
24
25('不用', '评论', '东西', '会员', '喜欢', '随便', '体验', '外卖', '上班', '几十亿', '酸奶', '单车', '担心', '心疼', '价格', '奶茶', '犹豫', '做梦', '共享', '房子', '自由', '图片', '心酸', '肯定', '视频', '演员', '公交', '家人', '包养', '块钱')
26(1200, 221, 206, 158, 158, 127, 114, 95, 94, 91, 90, 87, 85, 84, 84, 82, 78, 75, 75, 68, 66, 63, 63, 60, 56, 54, 54, 53, 52, 52)

微博热搜：收入几十亿是种什么体验，看完评论我哭了_ajax_10

果然，有钱就是可以任性，什么都“不用”考虑。

5.数据可视化

展示热评前十条：

1def etl3():
2    # 删除重复记录和缺省值
3    df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
4    print(df.sort_values(by="like", ascending=False).head(10))

微博热搜：收入几十亿是种什么体验，看完评论我哭了_ajax_11

关键词词云展示：

1def etl2():
 2    # 删除重复记录和缺省值
 3    df = pd.read_csv('./data4.csv').drop_duplicates().dropna()
 4    # print(df.head(10))
 5
 6    # 停用词
 7    stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt', encoding='utf-8')])
 8
 9    # jieba分词
10    content = df['content'].values.tolist()
11    seg = pkuseg.pkuseg(model_name='web')
12    result = seg.cut(" ".join(content))
13
14    # 分词结果
15    all_words = [word for word in result if len(word) > 1 and word not in stopwords]
16
17    # 词频统计Top30
18    wordcount = Counter(all_words).most_common(30)
19    x1_data, y1_data = list(zip(*wordcount))
20
21    print(x1_data)
22    print(y1_data)
23
24    # 词云生成
25    cloud = WordCloud(scale=4,
26                      font_path='./simfang.ttf',
27                      background_color='black',
28                      max_words=100,
29                      max_font_size=60,
30                      random_state=20)
31
32    my_wordcloud = cloud.generate(" ".join(all_words))
33
34    plt.imshow(my_wordcloud)
35    plt.axis("off")
36    plt.show()