【斯坦福21秋季：实用机器学习】1.3 Web Scraping

B站：https://www.bilibili.com/video/BV1JM4y137kK/?spm_id_from=333.788.videocard.0
课程主页：https://c.d2l.ai/stanford-cs329p/

1.3 网页数据抓取

Web scraping 网站抓取
Web scraping tools 网站抓取工具
Case study - house price prediction 案例分析
Legal Consideration 法律问题

Web scraping 网站抓取

The goal is to extract data from website
- Noisy, weak labels, can be spammy (无用)
- Available at scale
- E.g. price comparison/ tracking website(追踪网站)
Many ML datasets are obtained by web scraping
- Image Net
- Kinetics
Web crawling VS scrapping
- Crawling: indexing whole pages on Internet
- Scraping: scraping particular data from web pages of a website

Web scraping tools 网站抓取工具

Curl doesn’t work
Website owners use various ways to stop bots
Use headless browser：a web browser without a GUI
Selenium + webdriver

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(
	chrome_options=chrome_options)

page = chrome.get(url)

A lot of new IPs, easy to get through public clouds

Case study - house price prediction 案例分析

Query houses sold in near Stanford
Craw individual pages
- Get the house IDs from the index pages（BeautifulSoap）
- The house detail page by ID
- Extract data
  Identify the HTML elements through Inspect
  Repeat the previous process to extract other field data
Cost
- Use AWS EC2 t3.small(2GB memory, 2 vCPUs)
- 2GB is necessary as the browser needs a lot memory
- CPU and bandwidth are usually not an issue
Crawl images
- Get all images URLs
- The crawling cost is still reasonable
- Storing these images is expensive