Python移除HTML width hegiht style属性(remove attributes from HTML tags)-CFANZ编程社区

一般修改HTML标签的属性的话，用正则表达式替换，但是有个缺点，就是容易有遗漏，要经过全面的测试才可以写出一个可用的正则表达式。用Python的lxml模块就可以避免写正则了，好轮子，为何不用呢？还有bs4模块，也是基于lxml来实现的。lxml有两种方式可以移除标签属性，下面一一写出示例代码：

xpath

import lxml
from HTMLParser import HTMLParser

html_string = u'''
<img src="http://abc.com/1.jpg" width="1" height="2" style="width:1px;hegiht:23px;"/>
'''
html = lxml.html.fromstring(html_string)
for tag in html.xpath(u'//*[@style]'):
    tag.attrib.pop(u'style')
for tag in html.xpath(u'//*[@height]'):
    tag.attrib.pop(u'height')
for tag in html.xpath(u'//*[@width]'):
    tag.attrib.pop(u'width')
print(HTMLParser().unescape(lxml.html.tostring(html)))
# 如果不想用HTMLParser，可以用如下代码：
print(tostring(html, encoding="utf-8").decode('utf-8'))

这种方法是通过html实例的xpath获取属性，然后pop下，就移除了。

clean

import lxml.html.clean as clean
safe_attrs = set(['src', 'alt', 'href', 'title'])
cleaner = clean.Cleaner(safe_attrs=safe_attrs)
html_string = u'''
<img src="http://abc.com/1.jpg" width="1" height="2" style="width:1px;hegiht:23px;"/>
'''
cleaned_html = cleaner.clean_html(html_string)
print(cleaned_html)

结果都是：

<img src="http://abc.com/1.jpg">

BeautifulSoup

from BeautifulSoup import BeautifulSoup

def _remove_attrs(soup):
    for tag in soup.findAll(True): 
        tag.attrs = None
    return soup


def example():
    doc = '<html><head><title>test</title></head><body id="foo" οnlοad="whatever"><p class="whatever">junk</p><div style="background: yellow;" id="foo" class="blah">blah</div></body></html>'
    print 'Before:\n%s' % doc
    soup = BeautifulSoup(doc)
    clean_soup = _remove_attrs(soup)
    print '\nAfter:\n%s' %

参考：
https://stackoverflow.com/questions/7470333/remove-certain-attributes-from-html-tags https://stackoverflow.com/questions/10037289/remove-class-attribute-from-html-using-python-and-lxml