(一步一步学爬虫（3）网页解析之pyquery的使用)

3.3 一步一步学爬虫（3）网页解析之pyquery的使用

本来不想再抄写这部分内容，但是看了下这个方法的使用，有这么多重要的功能，还是抄写在这里，方便自己查阅，书本太厚，真的不如App方便。上一篇的BeautifulSoup的方法，有许多不方便的，再学习pyquery的强大功能，特别是CSS方法。

3.3.1 准备工作

还是安装 pip3 install pyquery

3.3.2 初始化

在pyquery库解析HTML文本的时候，需要把这个页面初始化为一个pyquery对象。

字符串初始化

# -*- coding: UTF-8 -*-
html = '''
<html>
  <body>
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))

上面使用CSS选择器，传入li节点，可以选择所有的li节点了。

<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>

URL初始化

from pyquery import PyQuery as pq
doc = pq(url='https://cuiqingcai.com')
print(doc('title'))

运行结果。

<title>静觅丨崔庆才的个人站点 - Python爬虫教程</title>

文件初始化

除了上面两种情况，还可以传入本地文件名，进行初始化。

doc = pq(filename='demo.html')
print(doc('li'))

3.3.3 基本CSS选择器

html = '''
<html>
  <body>
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))

这个代码的意思是，用CSS选择器，选取id为container的节点，再选取其内部class为list的节点内部的所有li节点。
当然，可以在上代码基础上，继续调用text方法，得到里面的内容。
```
for item in doc('#container .list li').items():
    print(item.text())
```
得到如下结果。
```
first item
second item
third item
fourth item
fifth item
```
- 显然，用这个办法，比正则表达式，还要省事。

3.3.4 查找节点

（1）子节点

接着上面HTML代码，再用find方法，加上其参数CSS选择器，查找子节点。

from pyquery import PyQuery as pq
doc = pq(html)

items = doc('.list')
print(type(items))
print(items)

lis = items.find('li')
print(type(lis))
print(lis)

运行结果如下：

<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>

上面find方法会把所有符合条件的节点都选择出来，查到的是子孙节点，结果是PyQuery类型。

要查找子节点，要用children方法，如下：

items = doc('.list')
lis = items.children()
print(type(lis))
print(lis)

（2）父节点

父节点就是把children换成parent即可。

items = doc('.list')
lis = items.parent()
print(type(lis))
print(lis)

结果是上层div的内容。

<class 'pyquery.pyquery.PyQuery'>
<div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>

如果想获取祖先节点，可以用parents方法。

items = doc('.list')
lis = items.parents()
print(type(lis))
print(lis)

这个方法会把所有父节点、祖先节点都选了出来，要想定位到某一个祖先节点，再加上一个CSS选择器即可。如：

# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)

items = doc('.list')
lis = items.parents('.wrap')
print(type(lis))
print(lis)

这样其他祖先节点，就显示不出来了。

（3）兄弟节点

还是上面的例子，兄弟节点用到了siblings方法。

# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)

li = doc('.list .item-0.active')
lis = li.siblings()
print(type(lis))
print(lis)

很显然，除了第三个，其余的兄弟节点都选出来了。

<class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
       <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
       <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
       <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>

要想固定某一个，也是用刚才定位祖先节点的方法，用CSS选择器。
```
print(lis('.active')
```

结果如下：

<class 'pyquery.pyquery.PyQuery'>
<li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>

3.3.5 遍历节点

接上面继续举例

from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(lis)
for li in lis:
    print(li, type(li))

结果生成生成器对象，再进行遍历，得到每一个节点。

<generator object PyQuery.items at 0x0000021F4CB48190>
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
         <class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
         <class 'pyquery.pyquery.PyQuery'>
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
         <class 'pyquery.pyquery.PyQuery'>
<li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
         <class 'pyquery.pyquery.PyQuery'>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
       <class 'pyquery.pyquery.PyQuery'>

（1）获取信息

爬取网页，主要是获取属性和文本等信息。

（2）获取属性

```python
# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq

doc = pq(html)
a = doc('.item-0.active a')
print(a, type(a))
print(a.attr('href'))
print(a.attr.href)
```

运行结果如下：

<a rel="nofollow" href="link3.html">third item</a> <class 'pyquery.pyquery.PyQuery'>
link3.html
link3.html

上面调用attr方法，用两种形式，都得到了想要的属性。但是找到多个同样属性的时候，就只显示第一个，是要用遍历了。代码如下：
```
from pyquery import PyQuery as pq

doc = pq(html)
a = doc('a')
for item in a.items():
	print(item.attr.href)
```

结果。

D:\Programs\Python\Python310\python.exe D:\Programs\PythonProject\Practice\temp.py 
link1.html
link2.html
link3.html
link4.html
link5.html

（3）获取文本

两种方法，一个是text()方法，另一个是html()方法。

from pyquery import PyQuery as pq

doc = pq(html)
third = doc('.item-0.active')
print(third)
print(third.text())
print(third.html())
all = doc('.list')
print(all.text())

得到了下面三种不同的结果。

<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        
third item
<a rel="nofollow" href="link3.html">third item</a>
first item
second item
third item
fourth item
fifth item

第一个结果是，当前li节点的所有信息。
第二个结果是，当前li节点里的文本内容。
第三个结果是，当前li节点里的HTML内容。
第四个结果是，list节点下的所有文本内容。

3.3.6 节点操作

（1）addClass和removeClass

上面代码中，加入一个addClass和removeClass两个看看效果。

doc = pq(html)
third = doc('.item-0.active')
print(third)
third.removeClass('active')
print(third)
third.addClass('active')
print(third)

结果一看便知。

<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        
<li class="item-0"><a rel="nofollow" href="link3.html">third item</a></li>
        
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>

（2）attr、text和html

```python
# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq

doc = pq(html)
third = doc('.item-0.active')
print(third)
third.attr('name','link')
print(third)
third.text('changed item')
print(third)
third.html('aaaaaaaaaa没有third了')
print(third)
```

运行结果。

<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        
<li class="item-0 active" name="link"><a rel="nofollow" href="link3.html">third item</a></li>
        
<li class="item-0 active" name="link">changed item</li>
        
<li class="item-0 active" name="link">aaaaaaaaaa没有third了</li>

（3）remove

上代码

html = '''

  <div class="wrap">
    Hello,World
    <p>This is a paragraph.</p>
  </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())

结果如下：
```
Hello,World
This is a paragraph.
```
- 这时我们只想要Hello,World，怎么办呢？
- 可以用remove方法。
```
wrap.find('p').remove()
print(wrap.text())
```
- 这样一下就解决了。

3.3.7 伪类选择器

# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq

doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)