0
点赞
收藏
分享

微信扫一扫

Substrait: Cross-Language Serialization for Relational Algebra


What is Substrait?

Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems. Substrait是一种用于描述结构化数据上的计算操作的格式。它是为跨不同语言和系统的互操作性而设计的。

How does it work?

Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools. Substrait为数据计算操作提供了一个定义良好的跨语言规范。这包括通用操作、自定义操作以及本规范的一个或多个序列化表示的一致声明。规范的重点是每个操作的语义。除了规范之外,Substrait生态系统还包括许多库和有用的工具。

We highly recommend the tutorial to learn how a Substrait plan is constructed.我们强烈推荐使用本教程来学习如何构建Substrait计划。

Benefits

  • Avoids every system needing to create a communication method between every other system – each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem. 避免了每个系统都需要在每个其他系统之间创建一种通信方法——每个系统只支持摄入和产生Substrait,它立即成为更大生态系统的一部分。
  • Makes every part of the system upgradable. There’s a new query engine that’s ten times faster? Just plug it in! 使系统的每个部分都可以升级。有一个新的查询引擎,它的速度是原来的十倍?只要插上电源!
  • Enables heterogeneous environments – run on a cluster of an unknown set of execution engines! 启用异构环境—在一组未知执行引擎的集群上运行!
  • The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!). Substrait计划的文本版本允许您在不需要可视化工具的情况下快速查看计划的功能(尽管也有Substrait可视化工具!)。

Example Use Cases

  • Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel) 在SQL解析器和执行引擎之间通信计算计划(例如,Calcite SQL解析到Arrow C++计算内核)
  • Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino) 序列化表示SQL视图的计划,以便在多个系统中一致使用(例如Spark和Trino中的Iceberg视图)
  • Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics. 向不同的执行引擎(例如Datafusion和Postgres)提交一个计划,并获得对语义的一致解释。
  • Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore) 创建一个可将现有最终用户计算表达式系统连接到现有最终用户处理引擎的替代计划生成实现(例如,在SingleStore内执行的Pandas操作)
  • Build a pluggable plan visualization tool (e.g. D3 based plan visualizer) 构建可插拔的计划可视化工具(例如,基于D3的计划可视化器)

SQL to Substrait tutorial

This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query. 这是一个入门教程,为已经熟悉SQL的读者学习Subslait的基础知识。我们将研究如何从一个示例查询中构造一个Subslait计划。

We’ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it’s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it’s useful to have experience reading them. 我们将以JSON形式呈现Subslait,使其对新用户具有相对的可读性。通常,Substrait是作为protobuf消息交换的,但出于调试目的,查看序列化的表单通常很有帮助。此外,单元测试将计划表示为JSON字符串并不罕见。因此,如果你正在使用Substrait进行开发,那么有阅读它们的经验是很有用的。

Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it’s own human-readable text format, but for now this tutorial will make due with what Protobuf provides. Substrait当前仅使用Protobuf定义。这里提供的JSON是Protobuf JSON输出,但它不是官方的Substrait文本格式。最终,Substrait将定义自己的可读文本格式,但目前本教程将使用Protobuf提供的内容。

Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We’ll look at them in that order, going from simplest to most complex until we can construct full plans. Substrait用于传达计划(主要是逻辑计划)。这些计划包含类型、模式、表达式、扩展和关系。我们将按照这个顺序来看待它们,从最简单到最复杂,直到我们能够构建完整的计划。

Substrait: Cross-Language Serialization for Relational Algebra_JSON


This tutorial won’t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities. 本教程不会涵盖每件作品的所有细节,但它会让你了解它们是如何连接在一起的。对于每个单独字段的详细参考,最好的地方是阅读protobuf定义。它们代表了规范的真实性来源,并得到了很好的评论,以解决歧义。

Problem Set up

To learn Substrait, we’ll build up to a specific query. We’ll be using the tables:

CREATE TABLE orders (
  product_id: i64 NOT NULL,
  quantity: i32 NOT NULL,
  order_date: date NOT NULL,
  price: decimal(10, 2)
);
CREATE TABLE products (
  product_id: i64 NOT NULL,
  categories: list<string NOT NULL> NOT NULL,
  details: struct<manufacturer: string, year_created: int32>,
  product_name: string
);

This orders table represents events where products were sold, recording how many (quantity) and at what price (price). The products table provides details for each product, with product_id as the primary key. And we’ll try to create the query:

SELECT
  product_name,
  product_id,
  sum(quantity * price) as sales
FROM
  orders
INNER JOIN
  products
ON
  orders.product_id = products.product_id
WHERE
  -- categories does not contain "Computers"
  INDEX_IN("Computers", categories) IS NULL
GROUP BY
  product_name,
  product_id

The query asked the question: For products that aren’t in the “Computer” category, how much has each product generated in sales?
However, Substrait doesn’t correspond to SQL as much as it does to logical plans. So to be less ambiguous, the plan we are aiming for looks like: 然而,Subslait与SQL的对应程度不如与逻辑计划的对应程度。因此,为了减少歧义,我们的目标计划看起来像:

|-+ Aggregate({sales = sum(quantity_price)}, group_by=(product_name, product_id))
  |-+ InnerJoin(on=orders.product_id = products.product_id)
    |- ReadTable(orders)
    |-+ Filter(INDEX_IN("Computers", categories) IS NULL)
      |- ReadTable(products)

Types and Schemas

As part of the Substrait plan, we’ll need to embed the data types of the input tables. In Substrait, each type is a distinct message, which at a minimum contains a field for nullability. For example, a string field looks like: 作为Substrait计划的一部分,我们需要嵌入输入表的数据类型。在Substrait中,每个类型都是一个不同的消息,其中至少包含一个可为null的字段。例如,字符串字段如下所示:

{
  "string": {
    "nullability": "NULLABILITY_NULLABLE"
  }
}

Nullability is an enum not a boolean, since Substrait allows NULLABILITY_UNSPECIFIED as an option, in addition to NULLABILITY_NULLABLE (nullable) and NULLABILITY_REQUIRED (not nullable). Nullability是一个枚举,而不是布尔值,因为Substrait除了允许Nullability_NULLABLE(可为null)和Nullability_REQUIRED(不能为null)之外,还允许NULLIBILITY_UNSPECIFIED作为选项。

Other types such as VarChar and Decimal have other parameters. For example, our orders.price column will be represented as: 其他类型(如VarChar和Decimal)具有其他参数。例如,我们的orders.price列将表示为:

{
  "decimal": {
    "precision": 10,
    "scale": 2,
    "nullability": "NULLABILITY_NULLABLE"
  }
}

Finally, there are nested compound types such as structs and list types that have other types as parameters. For example, the products.categories column is a list of strings, so can be represented as: 最后,还有嵌套的复合类型,如结构和列表类型,它们有其他类型作为参数。例如,products.categories列是字符串列表,因此可以表示为:

{
  "list": {
    "type": {
      "string": {
        "nullability": "NULLABILITY_REQUIRED"
      }
    },
    "nullability": "NULLABILITY_REQUIRED"
  }
}

To know what parameters each type can take, refer to the Protobuf definitions in type.proto.

Schemas of tables can be represented with a NamedStruct message, which is the combination of a struct type containing all the columns and a list of column names. For the orders table, this will look like: 表的架构可以用NamedStruct消息表示,该消息是包含所有列的结构类型和列名列表的组合。对于订单表,如下所示:

{
  "names": [
    "product_id",
    "quantity",
    "order_date",
    "price"
  ],
  "struct": {
    "types": [
      {
        "i64": {
          "nullability": "NULLABILITY_REQUIRED"
        }
      },
      {
        "i32": {
          "nullability": "NULLABILITY_REQUIRED"
        }
      },
      {
        "date": {
          "nullability": "NULLABILITY_REQUIRED"
        }
      },
      {
        "decimal": {
          "precision": 10,
          "scale": 2,
          "nullability": "NULLABILITY_NULLABLE"
        }
      }
    ],
    "nullability": "NULLABILITY_REQUIRED"
  }
}

Here, names is the names of all fields. In nested schemas, this includes the names of subfields in depth-first order. So for the products table, the details struct field will be included as well as the two subfields (manufacturer and year_created) right after. And because it’s depth first, these subfields appear before product_name. The full schema looks like: 这里,name是所有字段的名称。在嵌套模式中,这包括按深度优先顺序排列的子字段的名称。因此,对于products表,将包括details结构字段以及后面的两个子字段(制造商和创建年份)。由于深度优先,这些子字段会出现在product_name之前。完整的架构如下所示:

{
  "names": [
    "product_id",
    "categories",
    "details",
    "manufacturer",
    "year_created",
    "product_name"
  ],
  "struct": {
    "types": [
      {
        "i64": {
          "nullability": "NULLABILITY_REQUIRED"
        }
      },
      {
        "list": {
          "type": {
            "string": {
              "nullability": "NULLABILITY_REQUIRED"
            }
          },
          "nullability": "NULLABILITY_REQUIRED"
        }
      },
      {
        "struct": {
          "types": [
            {
              "string": {
                "nullability": "NULLABILITY_NULLABLE"
              },
              "i32": {
                "nullability": "NULLABILITY_NULLABLE"
              }
            }
          ],
          "nullability": "NULLABILITY_NULLABLE"
        }
      },
      {
        "string": {
          "nullability": "NULLABILITY_NULLABLE"
        }
      }
    ],
    "nullability": "NULLABILITY_REQUIRED"
  }
}

Expressions

The next basic building block we will need is expressions. Expressions can be one of several things, including: 我们需要的下一个基本构建块是表达式。表达式可以是以下内容之一:Field references、Literal values、Functions、Subqueries、Window Functions
Since some expressions such as functions can contain other expressions, expressions can be represented as a tree. Literal values and field references typically are the leaf nodes. 由于某些表达式(如函数)可以包含其他表达式,因此表达式可以表示为树。文本值和字段引用通常是叶节点。

For the expression INDEX_IN(categories, “Computers”) IS NULL, we have a field reference categories, a literal string “Computers”, and two functions— INDEX_IN and IS NULL.

Substrait: Cross-Language Serialization for Relational Algebra_字段_02


The field reference for categories is represented by:

{
  "selection": {
    "directReference": {
      "structField": {
        "field": 1
      }
    },
    "rootReference": {}
  }
}

Whereas SQL references field by names, Substrait always references fields numerically. This means that a Substrait expression only makes sense relative to a certain schema. As we’ll see later when we discuss relations, for a filter relation this will be relative to the input schema, so the 1 here is referring to the second field of products. SQL按名称引用字段,而Substrait总是用数字引用字段。这意味着一个Subslait表达式只对某个模式有意义。正如我们稍后讨论关系时所看到的,对于过滤器关系,这将是相对于输入模式的,因此这里的1指的是产品的第二个字段。
Protobuf may not serialize fields with integer type and value 0, since 0 is the default. So if you instead saw “structField”: {}, know that is is equivalent to “structField”: { “field”: 0 }.
“Computers” will be translated to a literal expression: “计算机”将被翻译成文字表达:

{
  "literal": {
    "string": "Computers"
  }
}

Both IS NULL and INDEX_IN will be scalar function expressions. Available functions in Substrait are defined in extension YAML files contained in https://github.com/substrait-io/substrait/tree/main/extensions. Additional extensions may be created elsewhere. IS NULL is defined as a is_null function in functions_comparison.yaml and INDEX_IN is defined as index_in function in functions_set.yaml.
First, the expression for INDEX_IN(“Computers”, categories) is: 首先,INDEX_IN(“计算机”,类别)的表达式为:

{
  "scalarFunction": {
    "functionReference": 1,
    "outputType": {
      "i64": {
        "nullability": "NULLABILITY_NULLABLE"
      }
    },
    "arguments": [
      {
        "value": {
          "literal": {
            "string": "Computers"
          }
        }
      },
      {
        "value": {
          "selection": {
            "directReference": {
              "structField": {
                "field": 1
              }
            },
            "rootReference": {}
          }
        }
      }
    ]
  }
}

functionReference will be explained later in the plans section. For now, understand that it’s a ID that corresponds to an entry in a list of function definitions that we will create later. functionReference将在稍后的平面图部分中进行解释。现在,请理解它是一个ID,它对应于我们稍后将创建的函数定义列表中的一个条目。

outputType defines the type the function outputs. We know this is a nullable i64 type since that is what the function definition declares in the YAML file. outputType定义函数输出的类型。我们知道这是一个可以为null的i64类型,因为这是函数定义在YAML文件中声明的。

arguments defines the arguments being passed into the function, which are all done positionally based on the function definition in the YAML file. The two arguments will be familiar as the literal and the field reference we constructed earlier. arguments定义了传递到函数中的参数,这些参数都是基于YAML文件中的函数定义在位置上完成的。这两个参数将作为我们前面构建的文本和字段引用而为人所熟悉。

To create the final expression, we just need to wrap this in another scalar function expression for IS NULL.要创建最终表达式,我们只需要将其封装在ISNULL的另一个标量函数表达式中。

{
  "scalarFunction": {
    "functionReference": 2,
    "outputType": {
      "bool": {
        "nullability": "NULLABILITY_REQUIRED"
      }
    },
    "arguments": [
      {
        "value": {
          "scalarFunction": {
            "functionReference": 1,
            "outputType": {
              "i64": {
                "nullability": "NULLABILITY_NULLABLE"
              }
            },
            "arguments": [
              {
                "value": {
                  "literal": {
                    "string": "Computers"
                  }
                }
              },
              {
                "value": {
                  "selection": {
                    "directReference": {
                      "structField": {
                        "field": 1
                      }
                    },
                    "rootReference": {}
                  }
                }
              }
            ]
          }
        }
      }
    ]
  }
}

To see what other types of expressions are available and what fields they take, see the Expression proto definition in algebra.proto.

Relations

In most SQL engines, a logical or physical plan is represented as a tree of nodes, such as filter, project, scan, or join. The left diagram below may be a familiar representation of our plan, where nodes feed data into each other moving from left to right. In Substrait, each of these nodes is a Relation. 在大多数SQL引擎中,逻辑或物理计划表示为节点树,如筛选器、项目、扫描或联接。下面的左图可能是我们计划的常见表示,其中节点从左到右相互输入数据。在Substrait中,这些节点中的每一个都是一个关系。

Substrait: Cross-Language Serialization for Relational Algebra_JSON_03


A relation that takes another relation as input will contain (or refer to) that relation. This is usually a field called input, but sometimes different names are used in relations that take multiple inputs. For example, join relations take two inputs, with field names left and right. In JSON, the rough layout for the relations in our plan will look like: 将另一个关系作为输入的关系将包含(或引用)该关系。这通常是一个称为input的字段,但有时在接受多个输入的关系中使用不同的名称。例如,联接关系采用两个输入,字段名分别为左和右。在JSON中,我们计划中关系的大致布局如下:

{
    "aggregate": {
        "input": {
            "join": {
                "left": {
                    "filter": {
                        "input": {
                            "read": {
                                ...
                            }
                        },
                        ...
                    }
                },
                "right": {
                    "read": {
                        ...
                    }
                },
                ...
            }
        },
        ...
    }
}

For our plan, we need to define the read relations for each table, a filter relation to exclude the “Computer” category from the products table, a join relation to perform the inner join, and finally an aggregate relation to compute the total sales.
对于我们的计划,我们需要定义每个表的读取关系、从产品表中排除“计算机”类别的筛选关系、执行内部联接的联接关系,以及最后计算总销售额的聚合关系。

The read relations are composed of a baseSchema and a namedTable field. The type of read is a named table, so the namedTable field is present with names containing the list of name segments (my_database.my_table). Other types of reads include virtual tables (a table of literal values embedded in the plan) and a list of files. See Read Definition Types for more details. The baseSchema is the schemas we defined earlier and namedTable are just the names of the tables. So for reading the orders table, the relation looks like:
读取关系由一个baseSchema和一个namedTable字段组成。读取的类型是命名表,因此namedTable字段的名称包含名称段列表(my_database.my_table)。其他类型的读取包括虚拟表(嵌入计划中的文字值表)和文件列表。有关详细信息,请参阅读取定义类型。baseSchema是我们之前定义的模式,namedTable只是表的名称。因此,对于读取订单表,关系如下:

{
  "read": {
    "namedTable": {
      "names": [
        "orders"
      ]
    },
    "baseSchema": {
      "names": [
        "product_id",
        "quantity",
        "order_date",
        "price"
      ],
      "struct": {
        "types": [
          {
            "i64": {
              "nullability": "NULLABILITY_REQUIRED"
            }
          },
          {
            "i32": {
              "nullability": "NULLABILITY_REQUIRED"
            }
          },
          {
            "date": {
              "nullability": "NULLABILITY_REQUIRED"
            }
          },
          {
            "decimal": {
              "scale": 10,
              "precision": 2,
              "nullability": "NULLABILITY_NULLABLE"
            }
          }
        ],
        "nullability": "NULLABILITY_REQUIRED"
      }
    }
  }
}

Read relations are leaf nodes. Leaf nodes don’t depend on any other node for data and usually represent a source of data in our plan. Leaf nodes are then typically used as input for other nodes that manipulate the data. For example, our filter node will take the products read relation as an input. 读取关系是叶节点。叶节点不依赖于任何其他节点获取数据,通常代表我们计划中的数据源。然后,叶节点通常被用作操作数据的其他节点的输入。例如,我们的过滤器节点将产品读取关系作为输入。
The filter node will also take a condition field, which will just be the expression we constructed earlier.过滤器节点还将采用一个条件字段,该字段将是我们之前构建的表达式。

{
  "filter": {
    "input": {
      "read": { ... }
    },
    "condition": {
      "scalarFunction": {
        "functionReference": 2,
        "outputType": {
          "bool": {
            "nullability": "NULLABILITY_REQUIRED"
          }
        },
        "arguments": [
          {
            "value": {
              "scalarFunction": {
                "functionReference": 1,
                "outputType": {
                  "i64": {
                    "nullability": "NULLABILITY_NULLABLE"
                  }
                },
                "arguments": [
                  {
                    "value": {
                      "literal": {
                        "string": "Computers"
                      }
                    }
                  },
                  {
                    "value": {
                      "selection": {
                        "directReference": {
                          "structField": {
                            "field": 1
                          }
                        },
                        "rootReference": {}
                      }
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}

The join relation will take two inputs. In the left field will be the read relation for orders and in the right field will be the filter relation (from products). The type field is an enum that allows us to specify we want an inner join. Finally, the expression field contains the expression to use in the join. Since we haven’t used the equals() function yet, we use the reference number 3 here. (Again, we’ll see at the end with plans how these functions are resolved.) The arguments refer to fields 0 and 4, which are indices into the combined schema formed from the left and right inputs. We’ll discuss later in Field Indices where these come from. 联接关系将接受两个输入。左边的字段是订单的读取关系,右边的字段是过滤关系(来自产品)。类型字段是一个枚举,它允许我们指定我们想要一个内部联接。最后,表达式字段包含要在联接中使用的表达式。由于我们还没有使用equals()函数,所以我们在这里使用参考数字3。(同样,我们将在计划的最后看到这些函数是如何解析的。)参数指的是字段0和4,它们是由左右输入形成的组合模式的索引。我们稍后将在字段索引中讨论这些数据的来源。

{
  "join": {
    "left": { ... },
    "right": { ... },
    "type": "JOIN_TYPE_INNER",
    "expression": {
      "scalarFunction": {
        "functionReference": 3,
        "outputType": {
          "bool": {
            "nullability": "NULLABILITY_NULLABLE"
          }
        },
        "arguments": [
          {
            "value": {
              "selection": {
                "directReference": {
                  "structField": {
                    "field": 0
                  }
                },
                "rootReference": {}
              }
            }
          },
          {
            "value": {
              "selection": {
                "directReference": {
                  "structField": {
                    "field": 4
                  }
                },
                "rootReference": {}
              }
            }
          }
        ]
      }
    }
  }
}

The final aggregation requires two things, other than the input. First is the groupings. We’ll use a single grouping expression containing the references to the fields product_name and product_id. (Multiple grouping expressions can be used to do cube aggregations.)
最后的聚合需要两件事,而不是输入。首先是分组。我们将使用一个分组表达式,其中包含对字段product_name和product_id的引用。(可以使用多个分组表达式进行多维数据集聚合。)

For measures, we’ll need to define sum(quantity * price) as sales. Substrait is stricter about data types, and quantity is an integer while price is a decimal. So we’ll first need to cast quantity to a decimal, making the Substrait expression more like sum(multiply(cast(decimal(10, 2), quantity), price)). Both sum() and multiply() are functions, defined in functions_arithmetic_demical.yaml. However cast() is a special expression type in Substrait, rather than a function.
对于度量,我们需要将总和(数量*价格)定义为销售额。Substrait对数据类型更严格,数量是整数,价格是小数。因此,我们首先需要将数量强制转换为十进制,使Substrait表达式更像sum(multiply(cast(decimal(10,2),quantity),price))。sum()和multiply()都是函数,在functions_almetic_demical.yaml中定义。但是,cast()是Substrait中的一种特殊表达式类型,而不是函数。

Finally, the naming with as sales will be handled at the end as part of the plan, so that’s not part of the relation. Since we are always using field indices to refer to fields, Substrait doesn’t record any intermediate field names.
最后,作为销售的命名最终将作为计划的一部分进行处理,因此这不是关系的一部分。由于我们总是使用字段索引来引用字段,因此Substrait不记录任何中间字段名称。

{
  "aggregate": {
    "input": { ... },
    "groupings": [
      {
        "groupingExpressions": [
          {
            "value": {
              "selection": {
                "directReference": {
                  "structField": {
                    "field": 0
                  }
                },
                "rootReference": {}
              }
            }
          },
          {
            "value": {
              "selection": {
                "directReference": {
                  "structField": {
                    "field": 7
                  }
                },
                "rootReference": {}
              }
            }
          },
        ]
      }
    ],
    "measures": [
      {
        "measure": {
          "functionReference": 4,
          "outputType": {
            "decimal": {
              "precision": 38,
              "scale": 2,
              "nullability": "NULLABILITY_NULLABLE"
            }
          },
          "arguments": [
            {
              "value": {
                "scalarFunction": {
                  "functionReference": 5,
                  "outputType": {
                    "decimal": {
                      "precision": 38,
                      "scale": 2,
                      "nullability": "NULLABILITY_NULLABLE"
                    }
                  },
                  "arguments": [
                    {
                      "value": {
                        "cast": {
                          "type": {
                            "decimal": {
                              "precision": 10,
                              "scale": 2,
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          },
                          "input": {
                            "selection": {
                              "directReference": {
                                "structField": {
                                  "field": 1
                                }
                              },
                              "rootReference": {}
                            }
                          }
                        }
                      }
                    },
                    {
                      "value": {
                        "selection": {
                          "directReference": {
                            "structField": {
                              "field": 3
                            }
                          },
                          "rootReference": {}
                        }
                      }
                    }
                  ]
                }
              }
            }
          ]
        }
      }
    ]
  }
}

Field indices

So far, we have glossed over the field indices. Now that we’ve built up each of the relations, it will be a bit easier to explain them.Throughout the plan, data always has some implicit schema, which is modified by each relation. Often, the schema can change within a relation–we’ll discuss an example in the next section. Each relation has it’s own rules in how schemas are modified, called the output order or emit order. For the purposes of our query, the relevant rules are:
到目前为止,我们已经掩盖了字段索引。既然我们已经建立了每一种关系,那么解释它们就会容易一些。在整个计划中,数据总是有一些隐含的模式,每个关系都会对其进行修改。通常,模式可以在关系中发生变化——我们将在下一节中讨论一个示例。在如何修改模式方面,每个关系都有自己的规则,称为输出顺序或发出顺序。就我们的查询而言,相关规则是:
For Read relations, their output schema is the schema of the table. 对于Read关系,它们的输出模式是表的模式。
For Filter relations, the output schema is the same as in the input schema. 对于Filter关系,输出模式与输入模式相同。
For Joins relations, the input schema is the concatenation of the left and then the right schemas. The output schema is the same. 对于Joins关系,输入模式是左模式和右模式的串联。输出模式是相同的。
For Aggregate relations, the output schema is the group by fields followed by the measures. 对于聚合关系,输出模式是由字段和度量值组成的组。

Sometimes it can be hard to tell what the implicit schema is. For help determining that, consider using the substrait-validator tool, described in Next Steps.

The diagram below shows the mapping of field indices within each relation and how each of the field references show up in each relations properties. 下图显示了每个关系中字段索引的映射,以及每个字段引用如何显示在每个关系属性中。

Substrait: Cross-Language Serialization for Relational Algebra_SQL_04

Column selection and emit

As written, the aggregate output schema will be:

0: product_id: i64
1: product_name: string
2: sales: decimal(32, 8)

But we want product_name to come before product_id in our output. How do we reorder those columns? You might be tempted to add a Project relation at the end. However, the project relation only adds columns; it is not responsible for subsetting or reordering columns. Instead, any relation can reorder or subset columns through the emit property. By default, it is set to direct, which outputs all columns “as is”. But it can also be specified as a sequence of field indices.
但我们希望product_name在输出中位于product_id之前。我们如何重新排列这些列?您可能会想在最后添加一个Project关系。但是,项目关系仅添加列;它不负责对列进行子集设置或重新排序。相反,任何关系都可以通过emit属性对列进行重新排序或子集设置。默认情况下,它设置为direct,从而“按原样”输出所有列。但它也可以指定为字段索引的序列。
For simplicity, we will add this to the final aggregate relation. We could also add it to all relations, only selecting the fields we strictly need in later relations. Indeed, a good optimizer would probably do that to our plan. And for some engines, the emit property is only valid within a project relation, so in those cases we would need to add that relation in combination with emit. But to keep things simple, we’ll limit the columns at the end within the aggregation relation.For our final column selection, we’ll modify the top-level relation to be:
为了简单起见,我们将把它添加到最终的聚合关系中。我们也可以将其添加到所有关系中,只选择我们在以后的关系中严格需要的领域。事实上,一个好的优化器可能会对我们的计划做这件事。对于某些引擎,emit属性仅在项目关系中有效,因此在这些情况下,我们需要将该关系与emit一起添加。但为了简单起见,我们将限制聚合关系中末尾的列。
对于我们的最终列选择,我们将把顶级关系修改为:

{
  "aggregate": {
    "input": { ... },
    "groupings": [ ... ],
    "measures": [ ... ],
    "common": {
      "emit": {
        "outputMapping": [1, 0, 2]
      }
    }
}

Plans

Now that we’ve constructed our relations, we can put it all into a plan. Substrait plans are the only messages that can be sent and received on their own. Recall that earlier, we had function references to those YAML files, but so far there’s been no place to tell a consumer what those function reference IDs mean or which extensions we are using. That information belongs at the plan level.
既然我们已经建立了关系,我们就可以把这一切都纳入一个计划。子延迟计划是唯一可以单独发送和接收的消息。回想一下,早些时候,我们有对那些YAML文件的函数引用,但到目前为止,还没有地方告诉消费者这些函数引用ID的含义或我们正在使用的扩展。该信息属于计划级别。
平面图的总体布局为 The overall layout for a plan is

{
  "extensionUris": [ ... ],
  "extensions": [ ... ],
  "relations": [
    {
      "root": {
        "names": [
          "product_name",
          "product_id",
          "sales"
        ],
        "input": { ... }
      }
    }
  ]
}

The relations field is a list of Root relations. Most queries only have one root relation, but the spec allows for multiple so a common plan could be referenced by other plans, sort of like a CTE (Common Table Expression) from SQL. The root relation provides the final column names for our query. The input to this relation is our aggregate relation (which contains all the other relations as children). 关系字段是根关系的列表。大多数查询只有一个根关系,但规范允许多个根关系。因此,一个公共计划可以被其他计划引用,有点像SQL中的CTE(common Table Expression)。根关系为我们的查询提供了最终的列名。这个关系的输入是我们的聚合关系(它包含作为子关系的所有其他关系)。
For extensions, we need to provide extensionUris with the locations of the YAML files we used and extensions with the list of functions we used and which extension they come from. 对于扩展,我们需要为extensionUris提供我们使用的YAML文件的位置,以及我们使用的函数列表和它们来自哪个扩展的扩展。
In our query, we used:

  • index_in (1), from functions_set.yaml,
  • is_null (2), from functions_comparison.yaml,
  • equal (3), from functions_comparison.yaml,
  • sum (4), from functions_arithmetic_decimal.yaml,
  • multiply (5), from functions_arithmetic_decimal.yaml.

So first we can create the three extension uris:

[
  {
    "extensionUriAnchor": 1,
    "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_set.yaml"
  },
  {
    "extensionUriAnchor": 2,
    "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml"
  },
  {
    "extensionUriAnchor": 3,
    "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic_decimal.yaml"
  }
]

Then we can create the extensions:

[
  {
    "extensionFunction": {
      "extensionUriReference": 1,
      "functionAnchor": 1,
      "name": "index_in"
    }
  },
  {
    "extensionFunction": {
      "extensionUriReference": 2,
      "functionAnchor": 2,
      "name": "is_null"
    }
  },
  {
    "extensionFunction": {
      "extensionUriReference": 2,
      "functionAnchor": 3,
      "name": "equal"
    }
  },
  {
    "extensionFunction": {
      "extensionUriReference": 3,
      "functionAnchor": 4,
      "name": "sum"
    }
  },
  {
    "extensionFunction": {
      "extensionUriReference": 3,
      "functionAnchor": 5,
      "name": "multiply"
    }
  }
]

Once we’ve added our extensions, the plan is complete. Our plan outputted in full is: final_plan.json.

{
  "version": {
    "minorNumber": 20,
    "producer": "validator-test"
  },
  "extensionUris": [
    {
      "extensionUriAnchor": 1,
      "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_set.yaml"
    },
    {
      "extensionUriAnchor": 2,
      "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml"
    },
    {
      "extensionUriAnchor": 3,
      "uri": "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic_decimal.yaml"
    }
  ],
  "extensions": [
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 1,
        "name": "index_in"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 2,
        "functionAnchor": 2,
        "name": "is_null"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 2,
        "functionAnchor": 3,
        "name": "equal"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 3,
        "functionAnchor": 4,
        "name": "sum"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 3,
        "functionAnchor": 5,
        "name": "multiply"
      }
    }
  ],
  "relations": [
    {
      "root": {
        "names": [
          "product_name",
          "product_id",
          "sales"
        ],
        "input": {
          "aggregate": {
            "input": {
              "join": {
                "left": {
                  "read": {
                    "namedTable": {
                      "names": [
                        "orders"
                      ]
                    },
                    "baseSchema": {
                      "names": [
                        "product_id",
                        "quantity",
                        "order_date",
                        "price"
                      ],
                      "struct": {
                        "types": [
                          {
                            "i64": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          },
                          {
                            "i32": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          },
                          {
                            "date": {
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          },
                          {
                            "decimal": {
                              "scale": 2,
                              "precision": 10,
                              "nullability": "NULLABILITY_NULLABLE"
                            }
                          }
                        ],
                        "nullability": "NULLABILITY_REQUIRED"
                      }
                    }
                  }
                },
                "right": {
                  "filter": {
                    "input": {
                      "read": {
                        "namedTable": {
                          "names": [
                            "products"
                          ]
                        },
                        "baseSchema": {
                          "names": [
                            "product_id",
                            "categories",
                            "details",
                            "manufacturer",
                            "year_created",
                            "product_name"
                          ],
                          "struct": {
                            "types": [
                              {
                                "i64": {
                                  "nullability": "NULLABILITY_REQUIRED"
                                }
                              },
                              {
                                "list": {
                                  "type": {
                                    "string": {
                                      "nullability": "NULLABILITY_REQUIRED"
                                    }
                                  },
                                  "nullability": "NULLABILITY_REQUIRED"
                                }
                              },
                              {
                                "struct": {
                                  "types": [
                                    {
                                      "string": {
                                        "nullability": "NULLABILITY_NULLABLE"
                                      }
                                    },
                                    {
                                      "i32": {
                                        "nullability": "NULLABILITY_NULLABLE"
                                      }
                                    }
                                  ],
                                  "nullability": "NULLABILITY_NULLABLE"
                                }
                              },
                              {
                                "string": {
                                  "nullability": "NULLABILITY_NULLABLE"
                                }
                              }
                            ],
                            "nullability": "NULLABILITY_REQUIRED"
                          }
                        }
                      }
                    },
                    "condition": {
                      "scalarFunction": {
                        "functionReference": 2,
                        "outputType": {
                          "bool": {
                            "nullability": "NULLABILITY_REQUIRED"
                          }
                        },
                        "arguments": [
                          {
                            "value": {
                              "scalarFunction": {
                                "functionReference": 1,
                                "outputType": {
                                  "i64": {
                                    "nullability": "NULLABILITY_NULLABLE"
                                  }
                                },
                                "arguments": [
                                  {
                                    "value": {
                                      "literal": {
                                        "string": "Computers"
                                      }
                                    }
                                  },
                                  {
                                    "value": {
                                      "selection": {
                                        "directReference": {
                                          "structField": {
                                            "field": 1
                                          }
                                        },
                                        "rootReference": {

                                        }
                                      }
                                    }
                                  }
                                ]
                              }
                            }
                          }
                        ]
                      }
                    }
                  }
                },
                "type": "JOIN_TYPE_INNER",
                "expression": {
                  "scalarFunction": {
                    "functionReference": 3,
                    "outputType": {
                      "bool": {
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    },
                    "arguments": [
                      {
                        "value": {
                          "selection": {
                            "directReference": {
                              "structField": {
                                "field": 0
                              }
                            },
                            "rootReference": {

                            }
                          }
                        }
                      },
                      {
                        "value": {
                          "selection": {
                            "directReference": {
                              "structField": {
                                "field": 4
                              }
                            },
                            "rootReference": {

                            }
                          }
                        }
                      }
                    ]
                  }
                }
              }
            },
            "groupings": [
              {
                "groupingExpressions": [
                  {
                    "selection": {
                      "directReference": {
                        "structField": {
                          "field": 0
                        }
                      },
                      "rootReference": {

                      }
                    }
                  },
                  {
                    "selection": {
                      "directReference": {
                        "structField": {
                          "field": 7
                        }
                      },
                      "rootReference": {

                      }
                    }
                  }
                ]
              }
            ],
            "measures": [
              {
                "measure": {
                  "functionReference": 4,
                  "outputType": {
                    "decimal": {
                      "scale": 2,
                      "precision": 38,
                      "nullability": "NULLABILITY_NULLABLE"
                    }
                  },
                  "arguments": [
                    {
                      "value": {
                        "scalarFunction": {
                          "functionReference": 5,
                          "outputType": {
                            "decimal": {
                              "scale": 2,
                              "precision": 38,
                              "nullability": "NULLABILITY_NULLABLE"
                            }
                          },
                          "arguments": [
                            {
                              "value": {
                                "cast": {
                                  "type": {
                                    "decimal": {
                                      "scale": 2,
                                      "precision": 10,
                                      "nullability": "NULLABILITY_REQUIRED"
                                    }
                                  },
                                  "input": {
                                    "selection": {
                                      "directReference": {
                                        "structField": {
                                          "field": 1
                                        }
                                      },
                                      "rootReference": {

                                      }
                                    }
                                  }
                                }
                              }
                            },
                            {
                              "value": {
                                "selection": {
                                  "directReference": {
                                    "structField": {
                                      "field": 3
                                    }
                                  },
                                  "rootReference": {

                                  }
                                }
                              }
                            }
                          ]
                        }
                      }
                    }
                  ]
                }
              }
            ],
            "common": {
              "emit": {
                "outputMapping": [1, 0, 2]
              }
            }
          }
        }
      }
    }
  ]
}

Next steps

Validate and introspect plans using substrait-validator. Amongst other things, this tool can show what the current schema and column indices are at each point in the plan. Try downloading the final plan JSON above and generating an HTML report on the plan with: substrait-validator final_plan.json --out-file output.html 使用substrait-validator验证和内省计划。除其他外,该工具还可以显示计划中每个点的当前模式和列索引。尝试下载上面的最终计划JSON,并使用以下内容生成计划的HTML报告。


举报

相关推荐

0 条评论