数据采集技术之在Python中Libxml模块安装与使用XPath-CFANZ编程社区

数据采集技术之在Python中Libxml模块安装与使用XPath

为了使用XPath技术，对爬虫抓取的网页数据进行抽取（如标题、正文等等），之后在Windows下安装libxml2模块（安装后使用的是Libxml模块），该模块含有xpath。

准备

需要的软件包：

Python 2.7
lxml-2.3.4.win32-py2.7.‌exe 安装最好使用已打包的exe，这个包可以自动安装好lxml来使用

安装

Python2.7的安装这里不再赘述

lxml的安装，直接运行exe，会自动找到py27的目录进行安装

使用XPath抽取

下面用一个实例来验证，程序来自redice’s Blog的文章：

libxml2库的安装，xpath的使用

#coding:utf-8
 
import codecs
import sys
#不加如下行，无法打印Unicode字符，产生UnicodeEncodeError错误。?
sys . stdout = codecs . lookup ( 'iso8859-1' ) [ - 1 ] ( sys . stdout )
 
from lxml import etree
 
html = r '' '<div>
    <div>redice</div>
    <div id="email">redice@163.com</div>
    <div name="address">中国</div>
    <div>http://www.redicecn.com</div>
</div>' ''
 
tree = etree . HTML ( html )
 
#获取email。email所在的div的id为email
nodes = tree . xpath ( "//div[@id='email']" )
print nodes [ 0 ] . text
 
#获取地址。地址所在的div的name为address
nodes = tree . xpath ( "//div[@name='address']" )
print nodes [ 0 ] . text
 
#获取博客地址。博客地址位于email之后兄弟节点的第二个
nodes = tree . xpath ( "//div[@id='email']/following-sibling::div[2]" )
print nodes [ 0 ] . text