请点击此处查看本环境基本用法.

Please click here for more detailed instructions.

转载自AI Studio 项目链接https://aistudio.baidu.com/aistudio/projectdetail/3475736

INADE讲解（主要偏向于个人理解及其看法）

1. 论文题目：Diverse Semantic Image Synthesis via Probability Distribution Modeling

2. 原项目地址：https://github.com/tzt101/INADE

这里描述一下它对于语义生成任务的看法：

条件归一化，无论是空间自适应[30]还是类自适应[37]，都已被证明有助于语义图像的合成。语义条件调制能够在很大程度上防止重复规范化引起的语义信息的“冲刷”效应。

然而，考虑到标准化仅以语义图为条件，并且仅使用全局随机性（global randomness）来多样化图像样式[30]，因此要实现具有语义级甚至实例级多样性的有希望的生成结果仍然存在挑战。

语义级的多样性是由[51]通过Group conv 来实现的，但是使用这种卷积通过实例映射切断了将其扩展到实例级多样性的可能性。

最近在实例感知合成[41,6]方面的研究主要集中在更好的对象边界上，而不是每个单独实例的多样性和真实性。由于缺乏适当的实例条件作用，现有的方法倾向于将具有相同语义标签的实例收敛成相似的风格，这严重损害了生成的多样性。

这个时候它讲了一下这个INADE添加实例分割的想法来源：

实例级多样性的关键是确定地决定特定语义标签的一般特征的统一语义级分布和引入语义分布模型所允许覆盖的多样性的实例级随机性的适当组合。

考虑到生成网络包含多个条件归一化层，一个统一的采样解决方案仍然是协调所有这些层的关键。一种直接的方法，即对每个归一化层进行独立的随机抽样，可能会引入不一致性，并导致多样性被严重中和。因此，在本文中，我们提出了一种实例自适应调制采样方法，该方法可以在channel不相等的多个归一化层上实现一致的实例采样。

生成器架构

首先这个INADE它为了提高这个多样性主要想法是什么呢，它比起SPADE多了一个实例分割的输入，见这张图。

INADE数学公式表示

看不懂没关系，直接看代码就行,毕竟论文会讲故事，代码不会讲故事

这是INADE，实际pytorch代码：

class ILADE(nn.Module):
    def __init__(self, config_text, norm_nc, label_nc, noise_nc):
        super().__init__()
        self.norm_nc = norm_nc
        assert config_text.startswith('spade')
        parsed = re.search('spade(\D+)(\d)x\d', config_text)
        param_free_norm_type = str(parsed.group(1))

        if param_free_norm_type == 'instance':
            self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)
        elif param_free_norm_type == 'syncbatch':
            self.param_free_norm = SynchronizedBatchNorm2d(norm_nc, affine=False)
        elif param_free_norm_type == 'batch':
            self.param_free_norm = nn.BatchNorm2d(norm_nc, affine=False)
        else:
            raise ValueError('%s is not a recognized param-free norm type in SPADE'
                             % param_free_norm_type)
        # wights and bias for each class
        self.weight = nn.Parameter(torch.Tensor(label_nc, norm_nc,2))
        self.bias = nn.Parameter(torch.Tensor(label_nc, norm_nc,2))
        self.reset_parameters()
        self.fc_noise = nn.Linear(noise_nc, norm_nc)

    def reset_parameters(self):
        nn.init.uniform_(self.weight)
        nn.init.zeros_(self.bias)

    def forward(self, x, segmap, input_instances=None, noise=None):
        # Part 1. generate parameter-free normalized activations
        # noise is [B, inst_nc, 2, noise_nc], 2 is for scale and bias
        normalized = self.param_free_norm(x)

        # Part 2. scale the segmentation mask and instance mask
        segmap = F.interpolate(segmap, size=x.size()[2:], mode='nearest')
        input_instances = F.interpolate(input_instances, size=x.size()[2:], mode='nearest')

        # the segmap is concate with instance map
        inst_map = torch.unsqueeze(segmap[:,-1,:,:],1)
        segmap = segmap[:,:-1,:,:]

        # Part 3. class affine with noise
        noise_size = noise.size() # [B,inst_nc,2,noise_nc]
        noise_reshape = noise.view(-1, noise_size[-1]) # reshape to [B*inst_nc*2,noise_nc]
        noise_fc = self.fc_noise(noise_reshape) # [B*inst_nc*2, norm_nc]
        noise_fc = noise_fc.view(noise_size[0],noise_size[1],noise_size[2],-1)
        # create weigthed instance noise for scale
        class_weight = torch.einsum('ic,nihw->nchw', self.weight[...,0], segmap)
        class_bias = torch.einsum('ic,nihw->nchw', self.bias[...,0], segmap)
        # init_noise = torch.randn([x.size()[0], input_instances.size()[1], self.norm_nc], device=x.get_device())
        instance_noise = torch.einsum('nic,nihw->nchw', noise_fc[:,:,0,:], input_instances)
        scale_instance_noise = class_weight*instance_noise+class_bias
        # create weighted instance noise for bias
        class_weight = torch.einsum('ic,nihw->nchw', self.weight[..., 1], segmap)
        class_bias = torch.einsum('ic,nihw->nchw', self.bias[..., 1], segmap)
        # init_noise = torch.randn([x.size()[0], input_instances.size()[1], self.norm_nc], device=x.get_device())
        instance_noise = torch.einsum('nic,nihw->nchw', noise_fc[:,:,1,:], input_instances)
        bias_instance_noise = class_weight * instance_noise + class_bias

        out = scale_instance_noise * normalized + bias_instance_noise

        return out

下面是我写的paddle版本

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

'''
在这里有一个einsum相信大家也不用一般，至少我不用啊，哈哈
在这里的用法，我举例一下
instance_noise = paddle.einsum('nic,nihw->nchw', noise_fc[:,:,0,:], input_instances)#[B,instance_nc,norm_nc] [B,instance_nc,h,w] ->[B,norm_nc,h,w]
 noise_fc[:,:,0,:]，input_instances这两个tensor.shape分别为[B,instance_nc,norm_nc] [B,instance_nc,h,w] 
 经过了上述的这里einsum操作，就得到shape为[B,norm_nc,h,w]的tensor，那这里很明显就是相当于在nn.linear放在第1维进行的那种感觉无bias，矩阵乘法
'''


class INADE(nn.Layer):
    def __init__(self, norm_nc = 64, label_nc = 46, noise_nc = 108):
        super().__init__()

        self.param_free_norm = nn.InstanceNorm2D(norm_nc,weight_attr=False, bias_attr=False)
        # wights and bias for each class
        weight = self.create_parameter([label_nc,norm_nc,2], default_initializer = paddle.nn.initializer.Uniform())#随机均匀分布初始化函数
        self.add_parameter("weight", weight)

        bias = self.create_parameter([label_nc,norm_nc,2],default_initializer = paddle.nn.initializer.Constant())
        self.add_parameter("bias", bias)
        self.fc_noise = nn.Linear(noise_nc, norm_nc)

    def forward(self, x, segmap, input_instances=None, noise=None):
        # Part 1. generate parameter-free normalized activations
        # noise is [B, inst_nc, 2, noise_nc], 2 is for scale and bias
        normalized = self.param_free_norm(x)

        # Part 2. scale the segmentation mask and instance mask
        segmap = F.interpolate(segmap, size=x.shape[2:], mode='nearest')
        input_instances = F.interpolate(input_instances, size=x.shape[2:], mode='nearest')

        # the segmap is concate with instance map
        inst_map = paddle.unsqueeze(segmap[:,-1,:,:],1)# 后面就不用了
        segmap = segmap[:,:-1,:,:]

        # Part 3. class affine with noise
        noise_size = noise.shape # [B,inst_nc,2,noise_nc]
        noise_reshape = noise.reshape([-1, noise_size[-1]]) # reshape to [B*inst_nc*2,noise_nc]
        noise_fc = self.fc_noise(noise_reshape) # [B*inst_nc*2, norm_nc]
        noise_fc = noise_fc.reshape([noise_size[0],noise_size[1],noise_size[2],-1])#[B,instance_nc,2,norm_nc]
        print("noise_fc",noise_fc.shape)
        # create weigthed instance noise for scale
        class_weight = paddle.einsum('ic,nihw->nchw', self.weight[...,0], segmap)#[label_nc, norm_nc] [b,label_nc,h,w] ->#[B,norm_nc,h,w]
        print("class_weight",class_weight.shape)
        class_bias = paddle.einsum('ic,nihw->nchw', self.bias[...,0], segmap)#[label_nc, norm_nc] [b,label_nc,h,w] ->#[B,norm_nc,h,w]
        # init_noise = torch.randn([x.size()[0], input_instances.size()[1], self.norm_nc], device=x.get_device())
        instance_noise = paddle.einsum('nic,nihw->nchw', noise_fc[:,:,0,:], input_instances)#[B,instance_nc,norm_nc] [B,instance_nc,h,w] ->[B,norm_nc,h,w]
        scale_instance_noise = class_weight*instance_noise+class_bias
        # create weighted instance noise for bias
        class_weight = paddle.einsum('ic,nihw->nchw', self.weight[..., 1], segmap)
        class_bias = paddle.einsum('ic,nihw->nchw', self.bias[..., 1], segmap)
        # init_noise = torch.randn([x.size()[0], input_instances.size()[1], self.norm_nc], device=x.get_device())
        instance_noise = paddle.einsum('nic,nihw->nchw', noise_fc[:,:,1,:], input_instances)#[B,instance_nc,norm_nc]  #[B,instance_nc,h,w]
        bias_instance_noise = class_weight * instance_noise + class_bias

        out = scale_instance_noise * normalized + bias_instance_noise
        return out
x = paddle.randn([3,64,50,50])
segmap = paddle.randn([3,47,66,66])
inst = paddle.randn([3,72,50,50])
noise = paddle.randn([3,72,2,108])
INADE()(x,segmap,inst,noise).shape

W0222 11:03:38.883304   183 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0222 11:03:38.890153   183 device_context.cc:465] device: 0, cuDNN Version: 7.6.


noise_fc [3, 72, 2, 64]
class_weight [3, 64, 50, 50]





[3, 64, 50, 50]

另外这个noise上面也说了，整个generator都是使用同一个noise.

这里还有一个细节就是，这个noise要和这个输入的图片信息有关系，相当于这个z是包含信息的。这样方便训练，嗯。
这里代码实现很复杂，实在有兴趣自己看原项目，因为这里我不能用到我的项目上我就没有太关心。加油.

开个玩笑啊，其实这里这个noise的设计具体落实到代码里面其实是很重要的，因为其中第一点是可以通过noise来进行模型生成多样性的增加和控制，然后另外就是这个noise开始训练的时候必须要和原图信息有关联，这样好训练，不然如果这个Noise直接初始化就很难训练。

接下来我就直接从代码的角度去分析实际pytorch代码中的主体实践。

一些用到的API介绍

nn.Unfold

import paddle
import paddle.nn as nn

x = paddle.randn((100,3,224,224)) 
unfold = nn.Unfold(kernel_sizes=[3, 3])
result = unfold(x) #result.shape = [100,3*3*3,(224-3+1)*(224-3+1)]
print(result.shape)

paddle.clip

'''
该代码块是重新构造一个encoder里面用到的卷积封装，这个encoder就是为了训练时候构造这个noise用的
'''
import paddle
import paddle.nn.functional as F
import paddle.nn as nn
class InstanceAwareConv2d(nn.Layer):
    def __init__(self, fin = 64, fout = 128, kw = 3, stride=1, padding=1):
        super().__init__()
        self.kw = kw
        self.stride = stride
        self.padding = padding
        self.fin = fin
        self.fout = fout
        self.unfold = nn.Unfold(kw, strides = stride, paddings = padding)

        weight = self.create_parameter([fout, fin, kw, kw], default_initializer = paddle.nn.initializer.Uniform())#随机均匀分布初始化函数
        self.add_parameter("weight", weight)

        bias = self.create_parameter([fout],default_initializer = paddle.nn.initializer.Constant())
        self.add_parameter("bias", bias)

    def forward(self, x, instances, check=False):
        N,C,H,W = x.shape
        # cal the binary mask from instance map
        instances = F.interpolate(instances, x.shape[2:], mode='nearest') # [n,1,h,w]
        inst_unf = self.unfold(instances)
        # print("inst_unf",inst_unf.shape)
        # substract the center pixel
        center = paddle.unsqueeze(inst_unf[:, self.kw * self.kw // 2, :], axis=1)#因为instance的channel为1,所以这个channel的center 为 self.kw * self.kw // 2
        # print("center",center.shape)
        mask_unf = inst_unf - center
        # clip the absolute value to 0~1
        mask_unf = paddle.abs(mask_unf)
        mask_unf = paddle.clip(mask_unf, 0, 1)
        mask_unf = 1.0 - mask_unf # [n,k*k,L]
        # print("mask_unf",mask_unf.shape)#mask_unf [4, 9, 65536]
        # # multiply mask_unf and x
        x_unf = self.unfold(x)  # [n,c*k*k,L]
        # print("x_unf",x_unf.shape) #x_unf [4, 64*9, 65536]
        x_unf = x_unf.reshape([N, C, -1, x_unf.shape[-1]]) # [n,c,k*k,L]
        # print("x_unf",x_unf.shape) #[4, 64, 9, 65536]
        mask = paddle.unsqueeze(mask_unf,1) # [n,1,k*k,L]
        mask_x = mask * x_unf # [n,c,k*k,L]
        mask_x = mask_x.reshape([N,-1,mask_x.shape[-1]]) # [n,c*k*k,L]
        # # conv operation
        weight = self.weight.reshape([self.fout,-1]) # [fout, c*k*k]
        out = paddle.einsum('cm,nml->ncl', weight, mask_x)
        # print("out",out.shape)#[4, 128, 65536]
        # # x_unf = torch.unsqueeze(x_unf, 1)  # [n,1,c*k*k,L]
        # # out = torch.mul(masked_weight, x_unf).sum(dim=2, keepdim=False) # [n,fout,L]
        bias = paddle.unsqueeze(paddle.unsqueeze(self.bias,0),-1) # [1,fout,1]
        out = out + bias
        # print("out",out.shape)#[4, 128, 65536]
        out = out.reshape([N,self.fout,H//self.stride,W//self.stride])
        # # print('weight:',self.weight[0,0,...])
        # # print('bias:',self.bias)

        if check:
            out2 = nn.functional.conv2d(x, self.weight, self.bias, stride=self.stride, padding=self.padding)
            print((out-out2).abs().max())
        return out

x = paddle.randn([4,64,256,256])
y = paddle.randn([4,1,256,256])
InstanceAwareConv2d()(x,y).shape

W0222 16:13:26.634137   145 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0222 16:13:26.638576   145 device_context.cc:465] device: 0, cuDNN Version: 7.6.





[4, 128, 256, 256]

# Encoder构造
import paddle 
import paddle.nn as nn
import numpy as np

class Encoder_OPT:
    def __init__(self):
        super().__init__()
        self.ngf = 64
        self.semantic_nc = 46
        self.no_instance = True
        self.noise_nc = 108
opt = Encoder_OPT()


class instanceAdaptiveEncoder(nn.Layer):
    def __init__(self, opt):
        super().__init__()
        self.opt = opt
        kw = 3
        pw = int(np.ceil((kw - 1.0) / 2))
        ndf = opt.ngf
        conv_layer = InstanceAwareConv2d

        self.layer1 = conv_layer(3, ndf, kw, stride=2, padding=pw)
        self.norm1 = nn.InstanceNorm2D(ndf)
        self.layer2 = conv_layer(ndf * 1, ndf * 2, kw, stride=2, padding=pw)
        self.norm2 = nn.InstanceNorm2D(ndf * 2)
        self.layer3 = conv_layer(ndf * 2, ndf * 4, kw, stride=2, padding=pw)
        self.norm3 = nn.InstanceNorm2D(ndf * 4)
        self.layer4 = conv_layer(ndf * 4, ndf * 8, kw, stride=2, padding=pw)
        self.norm4 = nn.InstanceNorm2D(ndf * 8)
        
        self.middle = conv_layer(ndf * 8, ndf * 4, kw, stride=1, padding=pw)
        self.norm_middle = nn.InstanceNorm2D(ndf * 4)
        self.up1 = conv_layer(ndf * 8, ndf * 2, kw, stride=1, padding=pw)
        self.norm_up1 = nn.InstanceNorm2D(ndf * 2)
        self.up2 = conv_layer(ndf * 4, ndf * 1, kw, stride=1, padding=pw)
        self.norm_up2 = nn.InstanceNorm2D(ndf)
        self.up3 = conv_layer(ndf * 2, ndf, kw, stride=1, padding=pw)
        self.norm_up3 = nn.InstanceNorm2D(ndf)

        self.up = nn.Upsample(scale_factor=2, mode='bilinear')
        self.class_nc = opt.semantic_nc if opt.no_instance else opt.semantic_nc-1

        self.scale_conv_mu = conv_layer(ndf, opt.noise_nc, kw, stride=1, padding=pw)
        self.scale_conv_var = conv_layer(ndf, opt.noise_nc, kw, stride=1, padding=pw)
        self.bias_conv_mu = conv_layer(ndf, opt.noise_nc, kw, stride=1, padding=pw)
        self.bias_conv_var = conv_layer(ndf, opt.noise_nc, kw, stride=1, padding=pw)

        self.actvn = nn.LeakyReLU(0.2, False)
        self.opt = opt

    def instAvgPooling(self, x, instances):
        inst_num = instances.shape[1]
        for i in range(inst_num):
            inst_mask = paddle.unsqueeze(instances[:,i,:,:], 1) # [n,1,h,w]
            pixel_num = paddle.sum(paddle.sum(inst_mask, axis=2, keepdim = True), axis=3, keepdim=True)
            pixel_num[pixel_num == 0] = 1 #防止某一个instance的label 某一行或某一列没有，防止后续步骤中作为除数报错
            feat = x * inst_mask#只需要该label的x信息，inst_mask为0或1.
            feat = paddle.sum(paddle.sum(feat, axis =2, keepdim=True), axis=3, keepdim=True) / pixel_num
            if i == 0:
                out = paddle.unsqueeze(feat[:,:,0,0],1) # [n,1,c]
            else:
                out = paddle.concat([out,paddle.unsqueeze(feat[:,:,0,0],1)],1)
            # inst_pool_feats.append(feat[:,:,0,0]) # [n, 64]
        return out #shape = [n,inst_num,c]

    def forward(self, x, input_instances):
        # instances [n,1,h,w], input_instances [n,inst_nc,h,w] 注意一下这个shape
        instances = paddle.argmax(input_instances, 1, keepdim=True).astype("float32")
        print("instance",instances.shape)

        x1 = self.actvn(self.norm1(self.layer1(x,instances)))
        x2 = self.actvn(self.norm2(self.layer2(x1,instances)))
        x3 = self.actvn(self.norm3(self.layer3(x2,instances)))
        x4 = self.actvn(self.norm4(self.layer4(x3,instances)))
        print("x1",x1.shape,"x2",x2.shape,"x3",x3.shape,"x4",x4.shape) #x1 [4, 64, 128, 128] x2 [4, 128, 64, 64] x3 [4, 256, 32, 32] x4 [4, 512, 16, 16]
        y = self.up(self.actvn(self.norm_middle(self.middle(x4,instances))))
        y1 = self.up(self.actvn(self.norm_up1(self.up1(paddle.concat([y,x3],1),instances))))
        y2 = self.up(self.actvn(self.norm_up2(self.up2(paddle.concat([y1, x2], 1),instances))))
        y3 = self.up(self.actvn(self.norm_up3(self.up3(paddle.concat([y2, x1], 1),instances))))
        print("y",y.shape,"y1",y1.shape,"y2",y2.shape,"y3",y3.shape)# y [4, 256, 32, 32] y1 [4, 128, 64, 64] y2 [4, 64, 128, 128] y3 [4, 64, 256, 256]
        scale_mu = self.scale_conv_mu(y3,instances)
        scale_var = self.scale_conv_var(y3,instances)
        bias_mu = self.bias_conv_mu(y3,instances)
        bias_var = self.bias_conv_var(y3,instances)

        scale_mus = self.instAvgPooling(scale_mu,input_instances)
        scale_vars = self.instAvgPooling(scale_var,input_instances)
        bias_mus = self.instAvgPooling(bias_mu,input_instances)
        bias_vars = self.instAvgPooling(bias_var,input_instances)

        return scale_mus, scale_vars, bias_mus, bias_vars #shape都为[batch_size,instance_nc,noise_num]


encoder = instanceAdaptiveEncoder(opt)
x = paddle.randn([4,3,256,256])
input_instances = paddle.randn([4,72,256,256])
encoder(x,input_instances)


class Encoder_OPT:
    def __init__(self):
        super().__init__()
        self.ngf = 64
        self.semantic_nc = 2
        self.no_instance = True
        self.noise_nc = 108
opt = Encoder_OPT()
def instance_encode_z(real_image, input_instances):
    s_mus, s_logvars, b_mus, b_logvars = instanceAdaptiveEncoder(opt)(real_image,input_instances)
    z = [s_mus,paddle.exp(0.5 * s_logvars),b_mus,paddle.exp(0.5 * b_logvars)]
    return z, s_mus, s_logvars, b_mus, b_logvars

instance_nc = 2

real_image = paddle.randn([4,3,256,256])
input_instances = paddle.randn([4,instance_nc,256,256])
z, s_mus, s_logvars, b_mus, b_logvars = instance_encode_z(real_image,input_instances)
#s_mus, s_logvars, b_mus, b_logvars这四个return是为了计算KLDLOSS

instance [4, 1, 256, 256]
x1 [4, 64, 128, 128] x2 [4, 128, 64, 64] x3 [4, 256, 32, 32] x4 [4, 512, 16, 16]
y [4, 256, 32, 32] y1 [4, 128, 64, 64] y2 [4, 64, 128, 128] y3 [4, 64, 256, 256]

# KLD_loss = (KLDLoss(s_mus, s_logvars)+KLDLoss(b_mus, b_logvars)) * .opt.lambda_kld / 2
instance_nc = 2
noise_nc = 108
noise = paddle.randn([x.shape[0], instance_nc, 2,noise_nc])
def pre_process_noise( noise, z):
    '''
    noise: [n,inst_nc,2,noise_nc], z_i [n,inst_nc,noise_nc]
    z: [s_mus,torch.exp(0.5 * s_logvars),b_mus,torch.exp(0.5 * b_logvars)]
    '''
    s_noise = paddle.unsqueeze(noise[:,:,0,:].multiply(z[1])+z[0],2)
    b_noise = paddle.unsqueeze(noise[:,:,1,:].multiply(z[3])+z[2],2)
    return paddle.concat([s_noise,b_noise],2)

noise = pre_process_noise(noise,z)#这个时候得到的noise才是贯穿decoder，其中INADE的一个输入的noise
print(noise.shape) #[4,instance_nc,2,noise]

[4, 2, 2, 108]

那么train的时候这个noise可以得到了，但是decoder是基于一个很小的特征图逐渐上采样的，那么这个特征图作者采取得到的方法是直接randn加linear再reshape，初始化完全是没有任何信息的。这样的方法我看来是不错的，因为实际测试的时候decoder输入的就是标准正太分布，这样就防止了训练和测试的时候输入不一致的问题，防止模型依靠这个初始特征图的结构信息.（这个特征图的处理我正在实验，好像不太好训练）


batch_size = 4

z = paddle.randn(batch_size, z_dim,
                    dtype=torch.float32, device=input.get_device())
x = nn.Linear(opt.z_dim, 16 * 64 * sw * sh)(z)
x = x.reshape(-1, 16 * 64, self.sh, self.sw)

总结：

相当于spade多了一个实例分割输入，用来增加多样性，并且设置noise，为了控制多样性。
这个noise的训练设计还是很值得学习的。

我会把这个noise操作图像特征这一方法自己去尝试在具体项目中，到时候继续分享。