0

点赞

收藏

分享

实用机器学习中文版-1.3网页数据抓取

沪钢木子 2022-03-11 阅读 49

标签: pytorch 深度学习 cnn

文章目录

1. web scraping 网页数据抓取
2. web scraping tools 网页抓取工具

1. web scraping 网页数据抓取

the goal is to extract data from website 目标是从网站中提取数据
（1）noisy,weak labels,can be spammy 嘈杂的、弱的标签可能是垃圾信息
（2）available at scale 可以大规模获取
（3）price comparison/tracking website 价格比较/跟踪网站
many ML datasets are obtained by web scraping 许多ML数据集由web抓取获得
（1）imagenet,kinetics
web crawling vs scrapping 网页爬虫vs网页抓取
（1）crawling爬虫: indexing whole pages on internet 索引整个网页在互联网上
（2）scraping抓取：scraping particular data from web pages of a website 从一个网站的网页抓取特定的数据

2. web scraping tools 网页抓取工具

“curl” often doesn’t work “curl”通常不起作用;website owners use various ways to stop bots网站所有者使用各种方法来阻止机器人
use headless browser:a web browser without a GUI 使用无头浏览器:一种没有GUI的web浏览器
you need a lot of new IPs，easy to get through public clouds 你需要大量新ip，并且能够轻松通过公共云
in all IPV4 IPS,AWS owns 1.75%,azure 0.55%,GCP 0.25%

0 条评论

关注