B站:https://www.bilibili.com/video/BV1JM4y137kK/?spm_id_from=333.788.videocard.0
课程主页:https://c.d2l.ai/stanford-cs329p/
1.3 网页数据抓取
- Web scraping 网站抓取
- Web scraping tools 网站抓取工具
- Case study - house price prediction 案例分析
- Legal Consideration 法律问题
Web scraping 网站抓取
-
The goal is to extract data from website
- Noisy, weak labels, can be spammy (无用)
- Available at scale
- E.g. price comparison/ tracking website(追踪网站)
-
Many ML datasets are obtained by web scraping
- Image Net
- Kinetics
-
Web crawling VS scrapping
- Crawling: indexing whole pages on Internet
- Scraping: scraping particular data from web pages of a website
Web scraping tools 网站抓取工具
- Curl doesn’t work
Website owners use various ways to stop bots - Use headless browser:a web browser without a GUI
Selenium + webdriver
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(
chrome_options=chrome_options)
page = chrome.get(url)
- A lot of new IPs, easy to get through public clouds
Case study - house price prediction 案例分析
-
Query houses sold in near Stanford
-
Craw individual pages
- Get the house IDs from the index pages(BeautifulSoap)
- The house detail page by ID
- Extract data
Identify the HTML elements through Inspect
Repeat the previous process to extract other field data
-
Cost
- Use AWS EC2 t3.small(2GB memory, 2 vCPUs)
- 2GB is necessary as the browser needs a lot memory
- CPU and bandwidth are usually not an issue
-
Crawl images
- Get all images URLs
- The crawling cost is still reasonable
- Storing these images is expensive
Legal Consideration 法律问题
- NOT scrape data have sensitive information
- NOT scrape copyrighted data
- Follow the Terms of Service that explicitly prohibits web scraping
- Consult a lawyer if you are doing it for profit