0
点赞
收藏
分享

微信扫一扫

playwright获取B站用户评论

彩虹_bd07 2024-01-15 阅读 26

一、简介

使用自动化工具playwright获取B站视频下方评论区的用户名、用户性别、评论内容以及IP属地。

playwright获取B站用户评论_playwright

playwright获取B站用户评论_playwright_02

二、获取思路

进入视频页面,在Network中,发现评论文件存储在“main?oid=XXXX”中,且随着鼠标不断下滑,不断更新。

playwright获取B站用户评论_爬虫_03

playwright获取B站用户评论_python_04

那么,我们只需要设置模拟用户鼠标操作,且在下滑过程中设定好监听事件,不断获取评论内容并保存。直到下拉到评论的底部。解析获取到的json文件。

def monitor_response(self, res):
        if "api.bilibili.com/x/v2/reply/wbi/main?oid=" in res.url:
            data = res.json()["data"]["replies"]
            index = res.url[-10:]
            self.comments_message[index] = data

三、完整代码

首先得进入到b站首页手动登录。

playwright获取B站用户评论_python_05

(这一步骤在之前python+playwright爬取招聘网站_进击no猪排的技术博客_51CTO博客 中的 二(二)②Ⅱ 有类似的介绍。)

获取评论json文件

# bilibi_comments.py

from playwright.sync_api import sync_playwright
import json
class comments_links_scrapy():
    def __init__(self):
        self.url = input("请将网址粘贴到此处:")
        self.comments_message = {}
        
    def monitor_response(self, res):
        if "api.bilibili.com/x/v2/reply/wbi/main?oid=" in res.url:
            data = res.json()["data"]["replies"]
            index = res.url[-10:]
            print(res.url)
            self.comments_message[index] = data

    def scrapy(self):
        try:
            with sync_playwright() as p:
                browser = p.chromium.connect_over_cdp('http://localhost:12345/')
                page = browser.contexts[0].pages[0]
                page.on("response", self.monitor_response)
                page.goto(self.url)
                page.wait_for_timeout(3000)

                previous_height = 0
                while True:
                    current_height = page.evaluate('document.documentElement.scrollHeight')
                    if previous_height < current_height:
                        previous_height = current_height
                        page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
                        page.wait_for_timeout(1000)
                    else:
                        break
                with open(f"comments.json", mode="a", encoding="utf-8") as fp:
                    json.dump(self.comments_message, fp)

                browser.close()

        except Exception:
            pass

if __name__ == '__main__':
    test = comments_links_scrapy()
    test.scrapy()

解析json文件另存为csv文件

# get_comments.py

import json
import csv

with open("comments.json", "r", encoding="utf-8") as f:
    data = json.loads(f.read())

with open('comments.csv', 'w', newline='', encoding='utf-8-sig') as f2:
    colnames = ['用户名', '用户性别', '评论内容', 'IP属地']
    writer = csv.DictWriter(f2, fieldnames=colnames)
    writer.writeheader()
    for item in data:
        for i in data[item]:
            writer.writerow({
                '用户名': i["member"]["uname"],
                '用户性别': i["member"]["sex"],
                '评论内容': i["content"]["message"],
                'IP属地': i["reply_control"]["location"][5:]
            })

(纯练练手,抓取过程似乎不太稳定。)

举报

相关推荐

0 条评论