重学前端 35 # HTML标准分析-CFANZ编程社区

说明

每天10分钟，重构你的前端知识体系专栏笔记。

一、介绍

由于阅读标准有一定门槛，需要了解一些机制，这一篇介绍怎么用 JavaScript 代码去抽取标准中我们需要的信息。

二、HTML 标准

2.1、标准是如何描述一个标签的

1、采用 WHATWG 的 living standard 标准

WHATWG：超文本应用技术工作组；排版引擎比较；网页超文本技术工作小组；网络超文本应用技术工作组；（来自有道的翻译）

Categories:
Flow content.
Phrasing content.
Embedded content.
If the element has a controls attribute: Interactive content.
Palpable content.
Contexts in which this element can be used:
Where embedded content is expected.
Content model:
If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants.
If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
Tag omission in text/html:
Neither tag is omissible.
Content attributes:
Global attributes
src — Address of the resource
crossorigin — How the element handles crossorigin requests
poster — Poster frame to show prior to video playback
preload — Hints how much buffering the media resource will likely need
autoplay — Hint that the media resource can be started automatically when the page is loaded
playsinline — Encourage the user agent to display video content within the element's playback area
loop — Whether to loop the media resource
muted — Whether to mute the media resource by default
controls — Show user agent controls
width — Horizontal dimension
height — Vertical dimension
DOM interface:
[Exposed=Window, HTMLConstructor]
interface HTMLVideoElement : HTMLMediaElement {
[CEReactions] attribute unsigned long width;
[CEReactions] attribute unsigned long height;
readonly attribute unsigned long videoWidth;
readonly attribute unsigned long videoHeight;
[CEReactions] attribute USVString poster;
[CEReactions] attribute boolean playsInline;
};

2、上面代码描述分为六个部分

Categories：标签所属的分类。
Contexts in which this element can be used：标签能够用在哪里。
Content model：标签的内容模型。
Tag omission in text/html：标签是否可以省略。
Content attributes：内容属性。
DOM interface：用WebIDL 定义的元素类型接口。

三、用代码分析 HTML 标准

HTML 标准描述用词非常的严谨，这给我们抓取数据带来了巨大的方便。

3.1、第1步：打开HTML标准网站

打开单页面版 HTML 标准 https://html.spec.whatwg.org/

3.2、第2步：获取元素文本

打开控制台执行下面代码：

Array.prototype.map.call(document.querySelectorAll(".element"), e=>e.innerText);

输出结果是107个元素：

// 大概就是这样子的，我截取了一部分
(107) ["Categories:↵None.↵Contexts in which this element c…: HTMLElement {↵  // also has obsolete members↵};", "Categories:↵None.↵Contexts in which this element c…ctor]↵interface HTMLHeadElement : HTMLElement {};", "Categories:↵Metadata content.↵Contexts in which th…nt {↵  [CEReactions] attribute DOMString text;↵};", …]
[0 … 99]
[100 … 106]
length: 107
__proto__: Array(0)

3.3、第3步：组装元素名

在第2步的基础上可以发现，返回的元素是没有元素名的，这一步主要是通过id属性获取，然后组装起来

var elementDefinations = Array.prototype.map.call(document.querySelectorAll(".element"), e => ({
  text:e.innerText,
  name:e.childNodes[0].childNodes[0].id.match(/the\-([\s\S]+)\-element:/)?RegExp.$1:null}));
console.log(elementDefinations);

输出结果:

// 大概就是这样子的，我截取了一部分
(107) [{…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, …]
[0 … 99]
[100 … 106]
length: 107
__proto__: Array(0)

// 但是{}里面就是类似这样的
{
    name: "html"
    text: "Categories:↵None.↵Contexts in which this element can be ....."
}

3.4、第4步：拆分文本

1、categories

把这些不带任何条件的 category 先保存起来，然后打印出来其它的描述看看：

for(let defination of elementDefinations) {
  //console.log(defination.name + ":")
  let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
  defination.categories = [];
  for(let category of categories) {
    if(category.match(/^([^ ]+) content./))
      defination.categories.push(RegExp.$1);
    else
      console.log(category)  
  }
}

输出结果：

2 None.
 If the element is allowed in the body: flow content.
 If the element is allowed in the body: phrasing content.
 If the itemprop attribute is present: flow content.
 If the itemprop attribute is present: phrasing content.
2 Sectioning root.
3 If the element's children include at least one li element: Palpable content.
 None.
 If the element's children include at least one name-value group: Palpable content.
2 None.
 Sectioning root.
 None.
 If the element has an href attribute: Interactive content.
 ...

2、Content Model

先处理掉最简单点的部分，就是带分类的内容模型，再过滤掉带 If 的情况、Text 和 Transparent。

for(let defination of elementDefinations) {
  //console.log(defination.name + ":")
  let categories = defination.text.match(/Categories:\n([\s\S]+)\nContexts in which this element can be used:/)[1].split("\n");
  defination.contentModel = [];
  let contentModel = defination.text.match(/Content model:\n([\s\S]+)\nTag omission in text\/html:/)[1].split("\n");
  for(let line of contentModel)
    if(line.match(/([^ ]+) content./))
      defination.contentModel.push(RegExp.$1);
    else if(line.match(/Nothing.|Transparent./));
    else if(line.match(/^Text[\s\S]*.$/));
    else
      console.log(line)
}

输出结果：

A head element followed by a body element.
One or more h1, h2, h3, h4, h5, h6 elements, optionally intermixed with script-supporting elements.
3 Zero or more li and script-supporting elements.
Either: Zero or more groups each consisting of one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
......

3.5、第5步：编写 check 函数

有了 contentModel 和 category，要检查某一元素是否可以作为另一元素的子元素，就可以判断一下两边是否匹配。

1、设置索引

var dictionary = Object.create(null);

for(let defination of elementDefinations) {
  dictionary[defination.name] = defination;
}

2、check函数

function check(parent, child) {
  for(let category of child.categories)
    if(parent.contentModel.categories.conatains(category))
      return true;
  if(parent.contentModel.names.conatains(child.name))
      return true;
  return false;
}

参考资料：https://html.spec.whatwg.org/