说明
每天10分钟,重构你的前端知识体系专栏笔记。
一、介绍
由于阅读标准有一定门槛,需要了解一些机制,这一篇介绍怎么用 JavaScript
代码去抽取标准中我们需要的信息。
二、HTML 标准
2.1、标准是如何描述一个标签的
1、采用 WHATWG
的 living standard
标准
WHATWG
:超文本应用技术工作组;排版引擎比较;网页超文本技术工作小组;网络超文本应用技术工作组;(来自有道的翻译)
Categories:
Flow content.
Phrasing content.
Embedded content.
If the element has a controls attribute: Interactive content.
Palpable content.
Contexts in which this element can be used:
Where embedded content is expected.
Content model:
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
Tag omission in text/html:
Neither tag is omissible.
Content attributes:
Global attributes
src — Address of the resource
crossorigin — How the element handles crossorigin requests
poster — Poster frame to show prior to video playback
preload — Hints how much buffering the media resource will likely need
autoplay — Hint that the media resource can be started automatically when the page is loaded
playsinline — Encourage the user agent to display video content within the element's playback area
loop — Whether to loop the media resource
muted — Whether to mute the media resource by default
controls — Show user agent controls
width — Horizontal dimension
height — Vertical dimension
DOM interface:
[Exposed=Window, HTMLConstructor]
interface HTMLVideoElement : HTMLMediaElement {
[CEReactions] attribute unsigned long width;
[CEReactions] attribute unsigned long height;
readonly attribute unsigned long videoWidth;
readonly attribute unsigned long videoHeight;
[CEReactions] attribute USVString poster;
[CEReactions] attribute boolean playsInline;
};
2、上面代码描述分为六个部分
-
Categories
:标签所属的分类。 -
Contexts in which this element can be used
:标签能够用在哪里。 -
Content model
:标签的内容模型。 -
Tag omission in text/html
:标签是否可以省略。 -
Content attributes
:内容属性。 -
DOM interface
:用WebIDL
定义的元素类型接口。
三、用代码分析 HTML 标准
HTML 标准描述用词非常的严谨,这给我们抓取数据带来了巨大的方便。
3.1、第1步:打开HTML标准网站
打开单页面版 HTML 标准 https://html.spec.whatwg.org/
3.2、第2步:获取元素文本
打开控制台执行下面代码:
Array.prototype.map.call(document.querySelectorAll(".element"), e=>e.innerText);
输出结果是107个元素:
// 大概就是这样子的,我截取了一部分
(107) ["Categories:↵None.↵Contexts in which this element c…: HTMLElement {↵ // also has obsolete members↵};", "Categories:↵None.↵Contexts in which this element c…ctor]↵interface HTMLHeadElement : HTMLElement {};", "Categories:↵Metadata content.↵Contexts in which th…nt {↵ [CEReactions] attribute DOMString text;↵};", …]
[0 … 99]
[100 … 106]
length: 107
__proto__: Array(0)
3.3、第3步:组装元素名
在第2步的基础上可以发现,返回的元素是没有元素名的,这一步主要是通过id属性获取,然后组装起来
var elementDefinations = Array.prototype.map.call(document.querySelectorAll(".element"), e => ({
text:e.innerText,
name:e.childNodes[0].childNodes[0].id.match(/the\-([\s\S]+)\-element:/)?RegExp.$1:null}));
console.log(elementDefinations);
输出结果:
// 大概就是这样子的,我截取了一部分
(107) [{…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, …]
[0 … 99]
[100 … 106]
length: 107
__proto__: Array(0)
// 但是{}里面就是类似这样的
{
name: "html"
text: "Categories:↵None.↵Contexts in which this element can be ....."
}
3.4、第4步:拆分文本
1、categories
把这些不带任何条件的 category 先保存起来,然后打印出来其它的描述看看:
for(let defination of elementDefinations) {
//console.log(defination.name + ":")
let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
defination.categories = [];
for(let category of categories) {
if(category.match(/^([^ ]+) content./))
defination.categories.push(RegExp.$1);
else
console.log(category)
}
}
输出结果:
2 None.
If the element is allowed in the body: flow content.
If the element is allowed in the body: phrasing content.
If the itemprop attribute is present: flow content.
If the itemprop attribute is present: phrasing content.
2 Sectioning root.
3 If the element's children include at least one li element: Palpable content.
None.
If the element's children include at least one name-value group: Palpable content.
2 None.
Sectioning root.
None.
If the element has an href attribute: Interactive content.
...
2、Content Model
先处理掉最简单点的部分,就是带分类的内容模型,再过滤掉带 If 的情况、Text 和 Transparent。
for(let defination of elementDefinations) {
//console.log(defination.name + ":")
let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
defination.contentModel = [];
let contentModel = defination.text.match(/Content model:\n([\s\S]+)\nTag omission in text\/html:/)[1].split("\n");
for(let line of contentModel)
if(line.match(/([^ ]+) content./))
defination.contentModel.push(RegExp.$1);
else if(line.match(/Nothing.|Transparent./));
else if(line.match(/^Text[\s\S]*.$/));
else
console.log(line)
}
输出结果:
A head element followed by a body element.
One or more h1, h2, h3, h4, h5, h6 elements, optionally intermixed with script-supporting elements.
3 Zero or more li and script-supporting elements.
Either: Zero or more groups each consisting of one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
......
3.5、第5步:编写 check 函数
有了 contentModel 和 category,要检查某一元素是否可以作为另一元素的子元素,就可以判断一下两边是否匹配。
1、设置索引
var dictionary = Object.create(null);
for(let defination of elementDefinations) {
dictionary[defination.name] = defination;
}
2、check函数
function check(parent, child) {
for(let category of child.categories)
if(parent.contentModel.categories.conatains(category))
return true;
if(parent.contentModel.names.conatains(child.name))
return true;
return false;
}
参考资料:https://html.spec.whatwg.org/