pip3 install lxml
pip3 install beautifulsoup4
导入
from bs4 import BeautifulSoup
实例化
- 本地对象
fp=open(',/test.html','r',encoding='utf-8)
html=BeautifulSoup(fp,'lxml')
- 网络对象(page_text为requests请求获得)
html=BeautifulSoup(page_text,'lxml')
实例
html='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
1. 选择节点
1.1 获取节点
result=html.head
#运行结果
<head><title>The Dormouse's story</title></head>
1.2 多个节点只获取第一个匹配的节点
result=html.p
#运行结果
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
1.3 嵌套选择
result=html.head.title
#运行结果
<title>The Dormouse's story</title>
2. 提取信息
2.1 获取节点名称
result=html.title.name
#运行结果
title
2.2 获取节点属性
result=html.p['name']
#运行结果
dromouse
2.3 获取节点内容
result=html.title.string
#运行结果
The Dormouse's story
实例
html='''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
HELLO
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
3. 子节点
3.1 直接子节点
- contents:返回列表
result=html.p.contents
#运行结果
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n HELLO\n ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
- children:返回迭代器
result=html.p.children
for i,child in enumerate(result):
print(i,child)
#运行结果
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
HELLO
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.
3.2 所有子孙节点
- descendants:返回迭代器
result=html.p.descendants
for i,child in enumerate(result):
print(i,child)
#运行结果
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <span>Elsie</span>
4 Elsie
5
6
HELLO
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.
4. 兄弟节点
4.1 上一个兄弟节点
result=html.a.previous_sibling
#运行结果
Once upon a time there were three little sisters; and their names were
4.2 下一个兄弟节点
result=html.next_sibling
#运行结果
HELLO
4.3 前面的所有兄弟节点
result=list(enumerate(html.a.previous_siblings))
#运行结果
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
4.4 后面的所有兄弟节点
result=list(enumerate(html.a.next_siblings))
#运行结果
[(0, '\n HELLO\n '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')]
实例
html='''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
'''
5. 父节点
5.1 直接父节点
result=html.a.parent
#运行结果
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
5.2 所有祖先节点
result=list(enumerate(html.a.parents))
#运行结果
[(0, <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>)]
实例
html='''
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
</p>
'''
6. 提取信息
6.1 调用string、attrs等属性
result=html.a.next_sibling.string
#运行结果
Lacie
6.2 将包含多个节点的生成器转为列表,取出元素,再调用属性获取
print(list(html.a.parents)[0].attrs['class'])
#运行结果
['story']
实例
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
7. 方法选择器
7.1 find 查找
- 只返回第一个匹配的元素
result=html.find(name='ul')
#运行结果
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
-
其他用法,下文find_all同样适用
find_parents —— find_parent
find_next_sibling —— find_next_siblings find_previous_sibling —— find_previous_siblings
find_next —— find_all_next find_previous —— find_all_previous
7.2 find_all 查找所有
- 传入参数name,参数值为ul,意即查询所有ul节点,返回列表
result=html.find_all(name='ul')
#运行结果
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
- 输出为Tag类型,可继续嵌套
result=html.find_all(name='ul')
for ul in result:
print(ul.find_all(name='li'))
#运行结果
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
- 遍历每个li节点,获取文本内容
result=html.find_all(name='ul')
for ul in result:
a=ul.find_all(name='li')
for li in a:
print(li.string)
#运行结果
Foo
Bar
Jay
Foo
Bar
7.3 查询特定属性节点
result=html.find_all(id='list-1')
#运行结果
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
- 特例:class属性之后必须加_,即class_
result=html.find_all(class_='list')
#运行结果
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
7.4 text 匹配文本
- 可传入正则表达式,也可传入字符串
result=html.find_all(text=re.compile('oo'))
#运行结果
['Foo', 'Foo']
8. CSS选择器
8.1 嵌套选择
result=html.select('ul')
for ul in result:
print(ul.select('li'))
8.2 获取属性
- 依然可用原来的方法获取属性
result=html.select('ul')
for ul in result:
print(ul.['id'])
8.3 获取文本
- 依然可用原来的方法获取属性
result=html.select('li')
for li in result:
print(li.string)
- 或者使用get_text()
result=html.select('li')
for li in result:
print(li.get_text())