0
点赞
收藏
分享

微信扫一扫

【斯坦福21秋季:实用机器学习】1.3 Web Scraping

在觉 2022-01-10 阅读 40

B站:https://www.bilibili.com/video/BV1JM4y137kK/?spm_id_from=333.788.videocard.0
课程主页:https://c.d2l.ai/stanford-cs329p/


1.3 网页数据抓取


Web scraping 网站抓取

  • The goal is to extract data from website

    • Noisy, weak labels, can be spammy (无用)
    • Available at scale
    • E.g. price comparison/ tracking website(追踪网站)
  • Many ML datasets are obtained by web scraping

    • Image Net
    • Kinetics
  • Web crawling VS scrapping

    • Crawling: indexing whole pages on Internet
    • Scraping: scraping particular data from web pages of a website

Web scraping tools 网站抓取工具

  • Curl doesn’t work
    Website owners use various ways to stop bots
  • Use headless browser:a web browser without a GUI
    Selenium + webdriver
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(
	chrome_options=chrome_options)

page = chrome.get(url)
  • A lot of new IPs, easy to get through public clouds

Case study - house price prediction 案例分析

  • Query houses sold in near Stanford

  • Craw individual pages

    • Get the house IDs from the index pages(BeautifulSoap)
    • The house detail page by ID
    • Extract data
      Identify the HTML elements through Inspect
      Repeat the previous process to extract other field data
  • Cost

    • Use AWS EC2 t3.small(2GB memory, 2 vCPUs)
    • 2GB is necessary as the browser needs a lot memory
    • CPU and bandwidth are usually not an issue
  • Crawl images

    • Get all images URLs
    • The crawling cost is still reasonable
    • Storing these images is expensive

Legal Consideration 法律问题

  • NOT scrape data have sensitive information
  • NOT scrape copyrighted data
  • Follow the Terms of Service that explicitly prohibits web scraping
  • Consult a lawyer if you are doing it for profit
举报

相关推荐

0 条评论