PointPillar是一个来自工业界的模型，整体思想基于图片的处理框架，直接将点云从俯视图的视角划分为一个个的Pillar（立方柱体），从而构成了类似图片的数据，然后在使用2D的检测框架进行特征提取和密集的框预测得到检测框，从而使得该模型在速度和精度都达到了一个很好的平衡。

PointPillars网络结构总览：

网络速度精度对比：

注：（PP代表pointpillars，M代表MV3D， A代表AVOD，C代表ContFuse，V代表VoxelNet，

F代表Frustum Pointnet，S代表SECOND ，P+代表PIXOR++）

本文将会以OpenPCDet的代码基础，详细解析PointPillars的代码实现流程。

本文的论文地址为：

https://arxiv.org/pdf/1812.05784.pdf

解析参考代码：

https://github.com/open-mmlab/OpenPCDet

一：综述

3D检测算法通常有以下几种形式：

（1）将点云数据划纳入一个个体素（Voxel）中，构成规则的、密集分布的体素集，如有VoxelNet和SECOND。

（2）从前视和俯视角度对点云数据进行投影映射处理，获得一个个伪图片的数据。常见的模型有MV3D和AVOD。

（3）直接将点云数据映射到鸟瞰图后，再直接使用2D的检测框架的处理方法进行特征提取和RPN，实现3D的检测，如PIXOR、本文的主角pointpillar。

（4）使用pointnet直接从点云中对数据进行特征提取后获取proposals，然后根据获取的proposals进行微调，如Pointrcnn

二： PP网络点云数据处理

这里的处理过程直接将3D的点云信息直接从以俯视图的形式进行获取，在点云中假设有N*3个点的信息，所有的这些点都在kitti lidar坐标系xyz中（单位是米，其中x向前，y向左，z向上）。所有的这些点都会分配到均等大小的x-y平面的立方柱体中，这个立方柱就被称为pillar。如下图所示

左相机前视图

点云俯视图将点云分布到的均匀的立方柱体中

kitti的点云数据是4维度的数据包含（x, y, z, r）其中xyz是改点在点云中的坐标，r代表了改点的反射强度（与物体材质和激光入射角度等有关）；并且在将所有点放入每个pillar中的时候不需要像voxel那样考虑高度，可以将一个pillar理解为就是一个z轴上所有voxel组成在一起的。

在进行PP的数据增强时候，需要对pillar中的数据进行增强操作，需要将每个pillar中的点增加5个维度的数据，包含 x c , y c , z c , x p 和 y p，其中下标c代表了每个点云到改点所对应pillar中所有点平均值的偏移量，下标p代表了该点距离所在pillar中心点的x，y的偏移量。所有经过数据增强操作后每个点的维度是9维；包含了x,y,z, x c , y c , z c , x p 和 y p（注在openpcdet的代码实现中是10维，多了一个zp，也就是该点在z轴上与该点所处pillar的z轴中心的偏移量）

经过上述操作之后，就可以把原始的点云结构（N*3）变换成了（D，P，N），其中D代表了每个点云的特征维度，也就是每个点云9个特征，P代表了所有非空的立方柱体，N代表了每个pillar中最多会有多少个点。

注：

1、在实现的过程中，每个pillar的长宽是0.16米，在pcdet的实现中，我们只会截取前视图的部分，进行训练，因为kitti的标注是根据2号相机进行标注的，所有x轴的负方向（即车的后方）是没有标注数据的，我们会截取掉后面的数据；同时为了保证检测的可靠性，距离太远的点，由于点云过于稀疏，也会被截取。所以在pcdet的实现中，点云空间的选取范围xyz的最小值是=[0, -39.68,-3]， xyz选取的最大值是[69.12, 39.68, 1]。

2、其中每个pillar中的最大点云数量是32，如果一个pillar中的点云数量超过32,那么就会随机采样，选取32个点；如果一个pillar中的点云数量少于32；那么会对这个pillar使用0样本填充。

在经过映射后，就获得了一个（D，P，N）的张量；接下来这里使用了一个简化版的pointnet网络对点云的数据进行特征提取（即将这些点通过MLP升维，然后跟着BN层和Relu激活层），得到一个（C，P，N）形状的张量，之后再使用maxpool操作提取每个pillar中最能代表该pillar的点。那么输出会变成（C，P，N）->（C，P）；在经过上述操作编码后的点，需要重新放回到原来对应pillar的x,y位置上生成伪图象数据。

下面看这部分的代码实现：

预处理实现代码 pcdet/datasets/processor/data_processor.py

   def transform_points_to_voxels(self, data_dict=None, config=None):
        """
        将点云转换为pillar,使用spconv的VoxelGeneratorV2
        因为pillar可是认为是一个z轴上所有voxel的集合，所以在设置的时候，
        只需要将每个voxel的高度设置成kitti中点云的最大高度即可
        """
        
        #初始化点云转换成pillar需要的参数
        if data_dict is None:
            # kitti截取的点云范围是[0, -39.68, -3, 69.12, 39.68, 1]
            # 得到[69.12, 79.36, 4]/[0.16, 0.16, 4] = [432, 496, 1]
            grid_size = (self.point_cloud_range[3:6] - self.point_cloud_range[0:3]) / np.array(config.VOXEL_SIZE)
            self.grid_size = np.round(grid_size).astype(np.int64)
            self.voxel_size = config.VOXEL_SIZE
            # just bind the config, we will create the VoxelGeneratorWrapper later,
            # to avoid pickling issues in multiprocess spawn
            return partial(self.transform_points_to_voxels, config=config)

        if self.voxel_generator is None:
            self.voxel_generator = VoxelGeneratorWrapper(
                #给定每个pillar的大小  [0.16, 0.16, 4]
                vsize_xyz=config.VOXEL_SIZE,  
                #给定点云的范围 [0, -39.68, -3, 69.12, 39.68, 1]
                coors_range_xyz=self.point_cloud_range,  
                #给定每个点云的特征维度，这里是x，y，z，r 其中r是激光雷达反射强度
                num_point_features=self.num_point_features,
                #给定每个pillar中最多能有多少个点 32
                max_num_points_per_voxel=config.MAX_POINTS_PER_VOXEL,  
                #最多选取多少个pillar，因为生成的pillar中，很多都是没有点在里面的
                # 可以重上面的可视化图像中查看到，所以这里只需要得到那些非空的pillar就行
                max_num_voxels=config.MAX_NUMBER_OF_VOXELS[self.mode],  # 16000
            )
        
        points = data_dict['points']
        # 生成pillar输出
        voxel_output = self.voxel_generator.generate(points)
        # 假设一份点云数据是N*4，那么经过pillar生成后会得到三份数据
        # voxels代表了每个生成的pillar数据，维度是[M,32,4]
        # coordinates代表了每个生成的pillar所在的zyx轴坐标，维度是[M,3],其中z恒为0
        # num_points代表了每个生成的pillar中有多少个有效的点维度是[m,]，因为不满32会被0填充
        voxels, coordinates, num_points = voxel_output

        if not data_dict['use_lead_xyz']:
            voxels = voxels[..., 3:]  # remove xyz in voxels(N, 3)

        data_dict['voxels'] = voxels
        data_dict['voxel_coords'] = coordinates
        data_dict['voxel_num_points'] = num_points
        return data_dict

#　下面是使用spconv生成pillar的代码    

class VoxelGeneratorWrapper():
    def __init__(self, vsize_xyz, coors_range_xyz, num_point_features, max_num_points_per_voxel, max_num_voxels):
        try:
            from spconv.utils import VoxelGeneratorV2 as VoxelGenerator
            self.spconv_ver = 1
        except:
            try:
                from spconv.utils import VoxelGenerator
                self.spconv_ver = 1
            except:
                from spconv.utils import Point2VoxelCPU3d as VoxelGenerator
                self.spconv_ver = 2

        if self.spconv_ver == 1:
            self._voxel_generator = VoxelGenerator(
                voxel_size=vsize_xyz,
                point_cloud_range=coors_range_xyz,
                max_num_points=max_num_points_per_voxel,
                max_voxels=max_num_voxels
            )
        else:
            self._voxel_generator = VoxelGenerator(
                vsize_xyz=vsize_xyz,
                coors_range_xyz=coors_range_xyz,
                num_point_features=num_point_features,
                max_num_points_per_voxel=max_num_points_per_voxel,
                max_num_voxels=max_num_voxels
            )

    def generate(self, points):
        if self.spconv_ver == 1:
            voxel_output = self._voxel_generator.generate(points)
            if isinstance(voxel_output, dict):
                voxels, coordinates, num_points = \
                    voxel_output['voxels'], voxel_output['coordinates'], voxel_output['num_points_per_voxel']
            else:
                voxels, coordinates, num_points = voxel_output
        else:
            assert tv is not None, f"Unexpected error, library: 'cumm' wasn't imported properly."
            voxel_output = self._voxel_generator.point_to_voxel(tv.from_numpy(points))
            tv_voxels, tv_coordinates, tv_num_points = voxel_output
            # make copy with numpy(), since numpy_view() will disappear as soon as the generator is deleted
            voxels = tv_voxels.numpy()
            coordinates = tv_coordinates.numpy()
            num_points = tv_num_points.numpy()
        return voxels, coordinates, num_points

在经过上面的预处理之后，就需要使用简化版的pointnet网络对每个pillar中的数据进行特征提取了。

代码在pcdet/models/backbones_3d/vfe/pillar_vfe.py

import torch
import torch.nn as nn
import torch.nn.functional as F

from .vfe_template import VFETemplate


class PFNLayer(nn.Module):
    def __init__(self,
                 in_channels,
                 out_channels,
                 use_norm=True,
                 last_layer=False):
        super().__init__()

        self.last_vfe = last_layer
        self.use_norm = use_norm
        if not self.last_vfe:
            out_channels = out_channels // 2

        if self.use_norm:
            # 根据论文中，这是是简化版pointnet网络层的初始化
            # 论文中使用的是 1x1 的卷积层完成这里的升维操作（理论上使用卷积的计算速度会更快）
            # 输入的通道数是刚刚经过数据增强过后的点云特征，每个点云有10个特征，
            # 输出的通道数是64
            self.linear = nn.Linear(in_channels, out_channels, bias=False)
            # 一维BN层
            self.norm = nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.01)
        else:
            self.linear = nn.Linear(in_channels, out_channels, bias=True)

        self.part = 50000

    def forward(self, inputs):
        if inputs.shape[0] > self.part:
            # nn.Linear performs randomly when batch size is too large
            num_parts = inputs.shape[0] // self.part
            part_linear_out = [self.linear(inputs[num_part * self.part:(num_part + 1) * self.part])
                               for num_part in range(num_parts + 1)]
            x = torch.cat(part_linear_out, dim=0)
        else:
            # x的维度由（M, 32, 10）升维成了（M, 32, 64）
            x = self.linear(inputs)
        torch.backends.cudnn.enabled = False
        # BatchNorm1d层:(M, 64, 32) --> (M, 32, 64)
        # （pillars,num_point,channel）->(pillars,channel,num_points)
        # 这里之所以变换维度，是因为BatchNorm1d在通道维度上进行,对于图像来说默认模式为[N,C,H*W],通道在第二个维度上
        x = self.norm(x.permute(0, 2, 1)).permute(0, 2, 1) if self.use_norm else x
        torch.backends.cudnn.enabled = True
        x = F.relu(x)
        # 完成pointnet的最大池化操作，找出每个pillar中最能代表该pillar的点
        # x_max shape ：（M, 1, 64）　
        x_max = torch.max(x, dim=1, keepdim=True)[0]

        if self.last_vfe:
            # 返回经过简化版pointnet处理pillar的结果
            return x_max
        else:
            x_repeat = x_max.repeat(1, inputs.shape[1], 1)
            x_concatenated = torch.cat([x, x_repeat], dim=2)
            return x_concatenated


class PillarVFE(VFETemplate):
    """
    model_cfg:NAME: PillarVFE
                    WITH_DISTANCE: False
                    USE_ABSLOTE_XYZ: True
                    USE_NORM: True
                    NUM_FILTERS: [64]
    num_point_features:4
    voxel_size:[0.16 0.16 4]
    POINT_CLOUD_RANGE: [0, -39.68, -3, 69.12, 39.68, 1]
    """

    def __init__(self, model_cfg, num_point_features, voxel_size, point_cloud_range, **kwargs):
        super().__init__(model_cfg=model_cfg)

        self.use_norm = self.model_cfg.USE_NORM
        self.with_distance = self.model_cfg.WITH_DISTANCE
        self.use_absolute_xyz = self.model_cfg.USE_ABSLOTE_XYZ
        num_point_features += 6 if self.use_absolute_xyz else 3
        if self.with_distance:
            num_point_features += 1

        self.num_filters = self.model_cfg.NUM_FILTERS
        assert len(self.num_filters) > 0
        num_filters = [num_point_features] + list(self.num_filters)

        pfn_layers = []
        for i in range(len(num_filters) - 1):
            in_filters = num_filters[i]
            out_filters = num_filters[i + 1]
            pfn_layers.append(
                PFNLayer(in_filters, out_filters, self.use_norm, last_layer=(i >= len(num_filters) - 2))
            )
        # 加入线性层，将10维特征变为64维特征
        self.pfn_layers = nn.ModuleList(pfn_layers)

        self.voxel_x = voxel_size[0]
        self.voxel_y = voxel_size[1]
        self.voxel_z = voxel_size[2]
        self.x_offset = self.voxel_x / 2 + point_cloud_range[0]
        self.y_offset = self.voxel_y / 2 + point_cloud_range[1]
        self.z_offset = self.voxel_z / 2 + point_cloud_range[2]

    def get_output_feature_dim(self):
        return self.num_filters[-1]

    def get_paddings_indicator(self, actual_num, max_num, axis=0):
        """
        计算padding的指示
        Args:
            actual_num:每个voxel实际点的数量（M，）
            max_num:voxel最大点的数量（32，）
        Returns:
            paddings_indicator:表明一个pillar中哪些是真实数据，哪些是填充的0数据
        """
        # 扩展一个维度，使变为（M，1）
        actual_num = torch.unsqueeze(actual_num, axis + 1)
        # [1, 1]
        max_num_shape = [1] * len(actual_num.shape)
        # [1, -1]
        max_num_shape[axis + 1] = -1
        # (1,32)
        max_num = torch.arange(max_num, dtype=torch.int, device=actual_num.device).view(max_num_shape)
        # (M, 32)
        paddings_indicator = actual_num.int() > max_num
        return paddings_indicator

    def forward(self, batch_dict, **kwargs):
        """
        batch_dict:
            points:(N,5) --> (batch_index,x,y,z,r) batch_index代表了该点云数据在当前batch中的index
            frame_id:(4,) --> (003877,001908,006616,005355) 帧ID
            gt_boxes:(4,40,8)--> (x,y,z,dx,dy,dz,ry,class)
            use_lead_xyz:(4,) --> (1,1,1,1)
            voxels:(M,32,4) --> (x,y,z,r)
            voxel_coords:(M,4) --> (batch_index,z,y,x) batch_index代表了该点云数据在当前batch中的index
            voxel_num_points:(M,)
            image_shape:(4,2) 每份点云数据对应的2号相机图片分辨率
            batch_size:4    batch_size大小
        """
        voxel_features, voxel_num_points, coords = batch_dict['voxels'], batch_dict['voxel_num_points'], batch_dict[
            'voxel_coords']
        # 求每个pillar中所有点云的和 (M, 32, 3)->(M, 1, 3) 设置keepdim=True的，则保留原来的维度信息
        # 然后在使用求和信息除以每个点云中有多少个点来求每个pillar中所有点云的平均值 points_mean shape：(M, 1, 3)
        points_mean = voxel_features[:, :, :3].sum(dim=1, keepdim=True) / voxel_num_points.type_as(voxel_features).view(
            -1, 1, 1)
        # 每个点云数据减去该点对应pillar的平均值得到差值 xc,yc,zc
        f_cluster = voxel_features[:, :, :3] - points_mean

        # 创建每个点云到该pillar的坐标中心点偏移量空数据 xp,yp,zp
        f_center = torch.zeros_like(voxel_features[:, :, :3])
        #  coords是每个网格点的坐标，即[432, 496, 1]，需要乘以每个pillar的长宽得到点云数据中实际的长宽（单位米）
        #  同时为了获得每个pillar的中心点坐标，还需要加上每个pillar长宽的一半得到中心点坐标
        #  每个点的x、y、z减去对应pillar的坐标中心点，得到每个点到该点中心点的偏移量
        f_center[:, :, 0] = voxel_features[:, :, 0] - (
                coords[:, 3].to(voxel_features.dtype).unsqueeze(1) * self.voxel_x + self.x_offset)
        f_center[:, :, 1] = voxel_features[:, :, 1] - (
                coords[:, 2].to(voxel_features.dtype).unsqueeze(1) * self.voxel_y + self.y_offset)
        # 此处偏移多了z轴偏移  论文中没有z轴偏移
        f_center[:, :, 2] = voxel_features[:, :, 2] - (
                coords[:, 1].to(voxel_features.dtype).unsqueeze(1) * self.voxel_z + self.z_offset)

        # 如果使用绝对坐标，直接组合
        if self.use_absolute_xyz:
            features = [voxel_features, f_cluster, f_center]
        # 否则，取voxel_features的3维之后，在组合
        else:
            features = [voxel_features[..., 3:], f_cluster, f_center]

        # 如果使用距离信息
        if self.with_distance:
            # torch.norm的第一个2指的是求2范数，第二个2是在第三维度求范数
            points_dist = torch.norm(voxel_features[:, :, :3], 2, 2, keepdim=True)
            features.append(points_dist)
        # 将特征在最后一维度拼接 得到维度为（M，32,10）的张量
        features = torch.cat(features, dim=-1)
        # 每个pillar中点云的最大数量
        voxel_count = features.shape[1]
        """
        由于在生成每个pillar中，不满足最大32个点的pillar会存在由0填充的数据，
        而刚才上面的计算中，会导致这些
        由0填充的数据在计算出现xc,yc,zc和xp,yp,zp出现数值，
        所以需要将这个被填充的数据的这些数值清0,
        因此使用get_paddings_indicator计算features中哪些是需要被保留真实数据和需要被置0的填充数据
        """
        # 得到mask维度是（M， 32）
        # mask中指名了每个pillar中哪些是需要被保留的数据
        mask = self.get_paddings_indicator(voxel_num_points, voxel_count, axis=0)
        # （M， 32）->(M, 32, 1)
        mask = torch.unsqueeze(mask, -1).type_as(voxel_features)
        # 将feature中被填充数据的所有特征置0
        features *= mask

        for pfn in self.pfn_layers:
            features = pfn(features)
        # (M, 64), 每个pillar抽象出一个64维特征
        features = features.squeeze()
        batch_dict['pillar_features'] = features
        return batch_dict

在经过简化版的pointnet网络提取出每个pillar的特征信息后，就需要将，每个所有的pillar数据重新放回原来的坐标分布中来组成伪图像数据了