0
点赞
收藏
分享

微信扫一扫

一步一步学爬虫(3)网页解析之pyquery的使用

(一步一步学爬虫(3)网页解析之pyquery的使用)

3.3 一步一步学爬虫(3)网页解析之pyquery的使用

本来不想再抄写这部分内容,但是看了下这个方法的使用,有这么多重要的功能,还是抄写在这里,方便自己查阅,书本太厚,真的不如App方便。 上一篇的BeautifulSoup的方法,有许多不方便的,再学习pyquery的强大功能,特别是CSS方法。

3.3.1 准备工作

  • 还是安装 pip3 install pyquery

3.3.2 初始化

  • 在pyquery库解析HTML文本的时候,需要把这个页面初始化为一个pyquery对象。

字符串初始化

# -*- coding: UTF-8 -*-
html = '''
<html>
  <body>
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
  • 上面使用CSS选择器,传入li节点,可以选择所有的li节点了。
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>

URL初始化

from pyquery import PyQuery as pq
doc = pq(url='https://cuiqingcai.com')
print(doc('title'))
  • 运行结果。
<title>静觅丨崔庆才的个人站点 - Python爬虫教程</title>

文件初始化

  • 除了上面两种情况,还可以传入本地文件名,进行初始化。
doc = pq(filename='demo.html')
print(doc('li'))

3.3.3 基本CSS选择器

html = '''
<html>
  <body>
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))
  • 这个代码的意思是,用CSS选择器,选取id为container的节点,再选取其内部class为list的节点内部的所有li节点。

  • 当然,可以在上代码基础上,继续调用text方法,得到里面的内容。

    for item in doc('#container .list li').items():
        print(item.text())
    
  • 得到如下结果。

    first item
    second item
    third item
    fourth item
    fifth item
    
    • 显然,用这个办法,比正则表达式,还要省事。

3.3.4 查找节点

(1)子节点

  • 接着上面HTML代码,再用find方法,加上其参数CSS选择器,查找子节点。

    from pyquery import PyQuery as pq
    doc = pq(html)
    
    items = doc('.list')
    print(type(items))
    print(items)
    
    lis = items.find('li')
    print(type(lis))
    print(lis)
    
  • 运行结果如下:

    <class 'pyquery.pyquery.PyQuery'>
    <ul class="list">
            <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
            <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
            <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
            <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
            <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
          </ul>
        
    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
            <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
            <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
            <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
            <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
    
  • 上面find方法会把所有符合条件的节点都选择出来,查到的是子孙节点,结果是PyQuery类型。

  • 要查找子节点,要用children方法,如下:

    items = doc('.list')
    lis = items.children()
    print(type(lis))
    print(lis)
    

(2)父节点

  • 父节点就是把children换成parent即可。

    items = doc('.list')
    lis = items.parent()
    print(type(lis))
    print(lis)
    
  • 结果是上层div的内容。

    <class 'pyquery.pyquery.PyQuery'>
    <div id="container">
          <ul class="list">
            <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
            <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
            <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
            <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
            <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
          </ul>
        </div>
    
  • 如果想获取祖先节点,可以用parents方法。

    items = doc('.list')
    lis = items.parents()
    print(type(lis))
    print(lis)
    
  • 这个方法会把所有父节点、祖先节点都选了出来,要想定位到某一个祖先节点,再加上一个CSS选择器即可。如:

    # -*- coding: UTF-8 -*-
    html = '''
    <html>
      <div class="wrap">
        <div id="container">
          <ul class="list">
            <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
            <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
            <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
            <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
            <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
          </ul>
        </div>
      </div>
    </html>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    
    items = doc('.list')
    lis = items.parents('.wrap')
    print(type(lis))
    print(lis)
    
    • 这样其他祖先节点,就显示不出来了。

(3)兄弟节点

  • 还是上面的例子,兄弟节点用到了siblings方法。

    # -*- coding: UTF-8 -*-
    html = '''
    <html>
      <div class="wrap">
        <div id="container">
          <ul class="list">
            <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
            <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
            <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
            <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
            <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
          </ul>
        </div>
      </div>
    </html>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    
    li = doc('.list .item-0.active')
    lis = li.siblings()
    print(type(lis))
    print(lis)
    
  • 很显然,除了第三个,其余的兄弟节点都选出来了。

    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
           <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
           <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
           <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
    
  • 要想固定某一个,也是用刚才定位祖先节点的方法,用CSS选择器。

    print(lis('.active')
    
  • 结果如下:

    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
    

3.3.5 遍历节点

  • 接上面继续举例

    from pyquery import PyQuery as pq
    doc = pq(html)
    lis = doc('li').items()
    print(lis)
    for li in lis:
        print(li, type(li))
    
  • 结果生成生成器对象,再进行遍历,得到每一个节点。

    <generator object PyQuery.items at 0x0000021F4CB48190>
    <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
             <class 'pyquery.pyquery.PyQuery'>
    <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
             <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
             <class 'pyquery.pyquery.PyQuery'>
    <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
             <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
           <class 'pyquery.pyquery.PyQuery'>
    

(1)获取信息

  • 爬取网页,主要是获取属性和文本等信息。

(2)获取属性

```python
# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq

doc = pq(html)
a = doc('.item-0.active a')
print(a, type(a))
print(a.attr('href'))
print(a.attr.href)
```
  • 运行结果如下:

    <a rel="nofollow" href="link3.html">third item</a> <class 'pyquery.pyquery.PyQuery'>
    link3.html
    link3.html
    
  • 上面调用attr方法,用两种形式,都得到了想要的属性。但是找到多个同样属性的时候,就只显示第一个,是要用遍历了。代码如下:

    from pyquery import PyQuery as pq
    
    doc = pq(html)
    a = doc('a')
    for item in a.items():
    	print(item.attr.href)
    
  • 结果。

    D:\Programs\Python\Python310\python.exe D:\Programs\PythonProject\Practice\temp.py 
    link1.html
    link2.html
    link3.html
    link4.html
    link5.html
    

(3)获取文本

  • 两种方法,一个是text()方法,另一个是html()方法。

    from pyquery import PyQuery as pq
    
    doc = pq(html)
    third = doc('.item-0.active')
    print(third)
    print(third.text())
    print(third.html())
    all = doc('.list')
    print(all.text())
    
  • 得到了下面三种不同的结果。

    <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
            
    third item
    <a rel="nofollow" href="link3.html">third item</a>
    first item
    second item
    third item
    fourth item
    fifth item
    
  • 第一个结果是,当前li节点的所有信息。

  • 第二个结果是,当前li节点里的文本内容。

  • 第三个结果是,当前li节点里的HTML内容。

  • 第四个结果是,list节点下的所有文本内容。

3.3.6 节点操作

(1)addClass和removeClass

  • 上面代码中,加入一个addClass和removeClass两个看看效果。

    doc = pq(html)
    third = doc('.item-0.active')
    print(third)
    third.removeClass('active')
    print(third)
    third.addClass('active')
    print(third)
    
  • 结果一看便知。

    <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
            
    <li class="item-0"><a rel="nofollow" href="link3.html">third item</a></li>
            
    <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
    

(2)attr、text和html

```python
# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq

doc = pq(html)
third = doc('.item-0.active')
print(third)
third.attr('name','link')
print(third)
third.text('changed item')
print(third)
third.html('aaaaaaaaaa没有third了')
print(third)
```
  • 运行结果。

    <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
            
    <li class="item-0 active" name="link"><a rel="nofollow" href="link3.html">third item</a></li>
            
    <li class="item-0 active" name="link">changed item</li>
            
    <li class="item-0 active" name="link">aaaaaaaaaa没有third了</li>
    

(3)remove

  • 上代码

    html = '''
    
      <div class="wrap">
        Hello,World
        <p>This is a paragraph.</p>
      </div>
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    wrap = doc('.wrap')
    print(wrap.text())
    
  • 结果如下:

    Hello,World
    This is a paragraph.
    
    • 这时我们只想要Hello,World,怎么办呢?
    • 可以用remove方法
    wrap.find('p').remove()
    print(wrap.text())
    
    • 这样一下就解决了。 在这里插入图片描述

3.3.7 伪类选择器

# -*- coding: UTF-8 -*-
html = '''
<html>
  <div class="wrap">
    <div id="container">
      <ul class="list">
        <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
        <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
        <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
        <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
        <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
      </ul>
    </div>
  </div>
</html>
'''
from pyquery import PyQuery as pq

doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)
  • 上面选择了第一个节点、最后一个节点、第二个、第三个之后的、偶数位置的、包含second文本的li节点。

3.3.8 总结

  • 这个方法有许多强大的地方,详情可以参考官方文档。
  • http://pyquery.readthedocs.io。
举报

相关推荐

0 条评论