一般修改HTML标签的属性的话,用正则表达式替换,但是有个缺点,就是容易有遗漏,要经过全面的测试才可以写出一个可用的正则表达式。用Python的lxml模块就可以避免写正则了,好轮子,为何不用呢?还有bs4模块,也是基于lxml来实现的。lxml有两种方式可以移除标签属性,下面一一写出示例代码:
xpath
import lxml
from HTMLParser import HTMLParser
html_string = u'''
<img src="http://abc.com/1.jpg" width="1" height="2" style="width:1px;hegiht:23px;"/>
'''
html = lxml.html.fromstring(html_string)
for tag in html.xpath(u'//*[@style]'):
tag.attrib.pop(u'style')
for tag in html.xpath(u'//*[@height]'):
tag.attrib.pop(u'height')
for tag in html.xpath(u'//*[@width]'):
tag.attrib.pop(u'width')
print(HTMLParser().unescape(lxml.html.tostring(html)))
# 如果不想用HTMLParser,可以用如下代码:
print(tostring(html, encoding="utf-8").decode('utf-8'))
这种方法是通过html实例的xpath获取属性,然后pop下,就移除了。
clean
import lxml.html.clean as clean
safe_attrs = set(['src', 'alt', 'href', 'title'])
cleaner = clean.Cleaner(safe_attrs=safe_attrs)
html_string = u'''
<img src="http://abc.com/1.jpg" width="1" height="2" style="width:1px;hegiht:23px;"/>
'''
cleaned_html = cleaner.clean_html(html_string)
print(cleaned_html)
结果都是:
<img src="http://abc.com/1.jpg">
BeautifulSoup
from BeautifulSoup import BeautifulSoup
def _remove_attrs(soup):
for tag in soup.findAll(True):
tag.attrs = None
return soup
def example():
doc = '<html><head><title>test</title></head><body id="foo" οnlοad="whatever"><p class="whatever">junk</p><div style="background: yellow;" id="foo" class="blah">blah</div></body></html>'
print 'Before:\n%s' % doc
soup = BeautifulSoup(doc)
clean_soup = _remove_attrs(soup)
print '\nAfter:\n%s' %
参考:
https://stackoverflow.com/questions/7470333/remove-certain-attributes-from-html-tags https://stackoverflow.com/questions/10037289/remove-class-attribute-from-html-using-python-and-lxml