Metrics of GAN: IS, FID

前言

本文主要总结下不同领域下 GAN 的评估方法：

Inception Score(IS)
FID (Frechet Inception Distance)

这两个算是最常用的。

Inception Score [3]

Inception Score 借助预训练的 Inception 模型，需要使用到 Inception 最后一层输出的对分数（可以看成是预测成每个class的概率或者置信度）。 IS衡量GAN的效果的两个出发点：

条件分布 ${p(y \vert x)}$ 的熵（熵通常用于衡量不确定性的大小）尽可能地小，也就是对一个确定的图片 ${x}$ ，它的输出是很确定的（只有一个class的分数很高，其余的均为0）。
边缘分布 ${p(y)}$ 的熵尽可能的大，也就是对所有的生成的图片，各种类都比较均匀。

基于以上两个原则可以写得以下两个式子：

{ \begin{aligned} H_{1} &= - \mathbb{E}_{y} \left[ \log p(y \vert x) \right] \\ H_{2} &= - \mathbb{E}_{y} \left[ \log p(y) \right] \end{aligned} }

所以我们的目标是希望 ${H_{1}}$ 尽可能地小，而 ${H_{2}}$ 尽可能地大。上面的式子可以整合成以下形式：

{ \begin{aligned} H_{2} - H_{1} &= \mathbb{E}_{y} \left[ \log p(y \vert x) - \log p(y) \right] \\ &= KL(p(y \vert x) \| p(y)) \end{aligned} }

进一步我们扩展到所有的 ${x}$ ，那么也就是上式取 ${x}$ 的期望。

{ \mathbb{E}_{x \sim P_{g}} KL(p(y \vert x) \| p(y)) }

由于式子中使用了 ${\log}$ ，所以我们最后使用 ${\exp}$ 把它变换成线性的尺度，使得变化均匀，就得到了 Inception Score 的表达式：

{ IS = \exp \left( \mathbb{E}_{x \sim P_{g}} KL(p(y \vert x) \| p(y)) \right) }

可以看到，IS越大说明生成的效果越好。而且可以从两个方面增大IS，一个是生成使得 Inception 模型分类置信度高的图像，另外一种是生成图片各个类别尽量均衡。

但是目前可以看到的弊端有：

Inception Score 依赖 Inception 模型，比较的是生成图片和Incpetion 预训练使用的训练集 ${D_{pretrain}}$ 之间的差异，而不是和训练GAN使用的训练集 ${D_{GANtrain}}$ 上的差异。当然我觉得将 Inception 替换成在其他在 ${D_{GANtrain}}$ 上的其他分类模型也可行，但是数据集 ${D_{GANtrain}}$ 不一定可以用于分类任务（有没有用于分类的label）。用于衡量 ImageNet 上训练的 GAN 比较可行。

IS 计算

将式子离散化得到：

{ \ln (IS) = \frac{1}{N} \sum_{i = 1}^{N} \left[ \sum_{j = 1}^{M} p(y_{j}) \log \frac{p(y_{j} \vert x_{i})}{p(y_{j})} \right] }

其中 ${N}$ 是图片数量， ${M}$ 是类别数量。

但是离散的近似计算，原文中提及取 ${N = 5000}$ ，计算10次，并计算均值的方差。

IS 的弊端

无法衡量类内生成图像的 mode collapse， IS没有衡量类内的生成图片的多样性，它仅强调生成的图片类别均匀，这一衡量尺度对 CGAN 无效（CGAN可以控制生成的类别），但是可以检测出类间的生成图像的 mode collapse。
对 Inception 模型的权重敏感，不同实现的预训练的 Inception 模型可能分类准确率差不多，但是输出的对分数差异可能比较大，IS的差距也比较大。
计算的时候取 ${N = 5000}$ ，对 ImageNet 这样的大数据集不足以得到比较准确的边缘分布的近似。而且引入了 ${n}$ （计算次数）超参数。一种解决方法是去掉IS中的 ${\exp}$ ，这样多次计算取平均就和单次计算的结果一样了。

(2)(3) 来源于文献[1]。

使用 Inception Score 的错误

在使用 IS 上要注意一些问题，以下是文献[1]中指出的使用IS的错误：

在 ImageNet 以外的数据集上使用 IS 。其实很容易想到，使用在 ImageNet 上训练的 Inception 用于其他数据集时， ${p(y|x)}$ 不能真实地反应测试集上数据 feature 的分布。
将 IS 作为优化目标。容易生成对抗样本。
不报告生成模型是否过拟合。

代码实现

以下是没有 ${\exp}$ 的 IS 的代码实现，inception 模型直接使用的torchvision中的预训练的 inception_v3.

import torch
import torch.nn as nn 
from torch.utils.data import Dataset, DataLoader
import PIL.Image as im 
import torchvision.transforms as T 

import numpy as np 
from tqdm import tqdm 

class LabelIgnoreDataset(Dataset):
    def __init__(self, origin_set):
        self.org = origin_set
    
    def __len__(self):
        return len(self.org)

    def __getitem__(self, idx):
        return self.org[idx][0]


class ISImageSet(Dataset):
    def __init__(self, root, transform = None):
        self.root = root
        self.images = os.listdir(self.root)
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        assert idx >= 0 and idx < self.__len__(), ""
        image = im.open(self.root + self.images[idx])
        return self.transform(image) if self.transform else image

def InceptionScore(inception, images, run_batch_size = 256, batch_size = 5000, resize = True, device = None):

    data = DataLoader(images, run_batch_size)
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inception.to(device)
    inception.eval()

    def get_preds(x):
        if resize:
            x = F.interpolate(x, size=(299,299))
        with torch.no_grad():
            y = inception(x)
        return torch.softmax(x, dim = 0)

    preds = []
    for i, img in enumerate(tqdm(data)):
        preds.append(get_preds(img.to(device)))

    scores = []
    num_batch = len(images) // batch_size
    for i in len(num_batch):
        part = torch.cat(preds[i * batch_size, (i+1) * batch_size], dim = 0)
        yp = torch.mean(part, dim = 0)
        score = torch.mean(torch.sum(yp * torch.log(part - yp), dim = 1))
        scores.append(score.item())

    scores = np.array(scores)
    
    return scores.mean(), scores.std(), scores

FID (Frechet Inception Score) [2]

Frechet Inception Distance 同样也是利用的 Inception模型的输出，针对的是IS 没有使用训练集和生成图片比较的缺点。

假设真实图片和生成图片的feature （Inception输出的feature，FC层之前）都服从高斯分布，并且均值和协方差分别为： ${(m,C)}$ ， ${(m_{w},C_{w})}$ 。

则FID 的计算方法为：

{ \| m - m_{w} \|^{2} + Tr(C + C_{w} - 2(CC_{w})^{\frac{1}{2}}) }

对 FID 来说，越低越好。

简单的代码实现

FID 的计算看起来更加简单。

def FrechetID(fakeimages, images, device = None):

    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    model = Inceptionv3Fe()
    model.to(device)

    out = []

    fakeloader = DataLoader(fakeimages, batch_size=32, num_workers=4)
    imageloader = DataLoader(images, batch_size=32, num_workers=4)
    upsampler = nn.Upsample(size=299, mode="nearest")

    with torch.no_grad():
        fout = []
        for fake in fakeloader:
            fake = upsampler(fake).to(device)
            fout.append(model(fake))

        out = []
        for images in imageloader:
            images = upsampler(images).to(device)
            out.append(model(images))

        out = torch.cat(out, dim = 0)
        tout = torch.cat(fout, dim = 0)

        diff = out.mean(dim = 0) - tout.mean(dim = 0)

        def covariance(X):
            X = X - X.mean(dim = 0)
            return torch.matmul(X.transpose(), X) / (X.size(0) - 1)
            
        cov1 = covariance(out)
        cov2 = covariance(tout)

        fid = torch.matmul(diff, diff.transpose()) + torch.trace(cov1 + cov2 - 2*torch.sqrt(torch.matmul(cov1, cov2)))

    return fid.cpu()

其他资源：

arxiv 上一篇关于许多 GAN 评估方法的总结性的文章： https://arxiv.org/pdf/1802.03446.pdf
TF实现的IS,FID,KID方法：https://github.com/taki0112/GAN_Metrics-Tensorflow
pytorch实现的 FID：https://github.com/hukkelas/pytorch-frechet-inception-distance

Reference

Shane Barratt and Rishi Sharma. A Note on the Inception Score. 2018.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 2017-Decem(Nips):6627–6638, 2017.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in Neural Information Processing Systems, 2016