文章目录
1. web scraping 网页数据抓取
- the goal is to extract data from website 目标是从网站中提取数据
(1)noisy,weak labels,can be spammy 嘈杂的、弱的标签可能是垃圾信息
(2)available at scale 可以大规模获取
(3)price comparison/tracking website 价格比较/跟踪网站 - many ML datasets are obtained by web scraping 许多ML数据集由web抓取获得
(1)imagenet,kinetics - web crawling vs scrapping 网页爬虫vs网页抓取
(1)crawling爬虫: indexing whole pages on internet 索引整个网页在互联网上
(2)scraping抓取:scraping particular data from web pages of a website 从一个网站的网页抓取特定的数据
2. web scraping tools 网页抓取工具
- “curl” often doesn’t work “curl”通常不起作用;website owners use various ways to stop bots网站所有者使用各种方法来阻止机器人
- use headless browser:a web browser without a GUI 使用无头浏览器:一种没有GUI的web浏览器
- you need a lot of new IPs,easy to get through public clouds 你需要大量新ip,并且能够轻松通过公共云
- in all IPV4 IPS,AWS owns 1.75%,azure 0.55%,GCP 0.25%