0
点赞
收藏
分享

微信扫一扫

零基础爬虫regex练习「音乐抓取」


你好,我是悦创。

今天上爬虫基础私教课的时候,花了 1min 编写并部署的网站,作为正则表达式的练习。有兴趣的也可以试一试

遇到问题,评论必回复!

1. 需求

  1. 抓取目标网站:​​https://bornforthis.cn/web_runing/crawler/regex/index.html​​
  2. 技术限制:
  1. requests
  2. re
  3. Python 基础语法
  1. 抓取目标音乐
  2. 存储制定路径:/data/music/

2. 导入所需库

import requests
import

3. 编写代码

import requests
import re
from requests.exceptions import RequestException
from urllib.parse import urljoin

BASE = "https://bornforthis.cn/web_runing/crawler/regex/"


def requests_fun(url, binary=False):
try:
response = requests.get(url)
if response.status_code == 200:
if binary:
return response.content
else:
return response.text
return None
except RequestException as e:
return None


def save_music(path, binary):
with open(path, "wb") as f:
f.write(binary)


def parse(html):
pattern = '<a.*?href="(.*?)".*?</a>'
result = re.findall(pattern, html)
return result


def joint(url_lst):
url_list = []
for url in url_lst:
url = urljoin(BASE, url)
url_list.append(url)
return url_list


def postfix(url):
music_name = url.split("/")[-1]
return music_name


def main():
url = "https://bornforthis.cn/web_runing/crawler/regex/index.html"
html = requests_fun(url)
# url_lst = parse(html)
url_list = joint(parse(html))
# print(url_list)
for url in url_list:
# print(url)
binary_content = requests_fun(url, binary=True)
# print(binary_content)
save_music(f"data/music/{postfix(url)}", binary_content)


if __name__ == '__main__':
main()


举报

相关推荐

0 条评论