Scraping data from the web R语言网络抓取-CFANZ编程社区

Scraping data from the web R语言网络抓取

工具 rvest

rvest:

rvest核心函数：

URL: https://nethouseprices.com/house-prices/Lanarkshire/GLASGOW
安装包

install.packages("robotstxt")
install.packages("tidyverse")
install.packages("rvest")
install.packages("stringr")

library(robotstxt)
paths_allowed("http://www.imdb.com")

page <- read_html("https://nethouseprices.com/house-prices/Lanarkshire/GLASGOW")
page

在这里插入图片描述

## 类型
typeof(page)

class(page)

name <- page %>%
  html_nodes("strong a")

在这里插入图片描述

titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()  ## 提取文字，并储存在titles里
titles

在这里插入图片描述

type <- page %>%
  html_nodes(".street-details-row") %>%
  ## 提取文字，并储存在titles里
  html_text()  
type

price <- page %>%
  html_nodes(".street-details-price-row") %>%
  ## 提取文字，并储存在titles里
  html_text() #%>%
price

在这里插入图片描述
逗号正常用str_remove()去除，但是这里英镑符号有点难去除，str_remove系列和gsub()都无法实现。

x = c("£3","4£")
gsub("£","",x)

在这里插入图片描述

对比发现是两种英镑符号不一样

没有更快的解决办法了直接简单粗暴循环，利用str_sub()去除每个字符串的第一位（感谢这次数据结构十分整齐）

i=1
while (i <= length(price)){
  price[i]=str_sub(price[i],2)
  i=i+1
}

整合进tibble

glasgow_house_price3 <- tibble(
  address = name,
  prices = as.numeric(price),
  types = type
)

0 条评论