SinGAN: Learning a Generative Model from a Single Natural Image 論文まとめ

SinGANという論文を軽く読んだのでメモ。

この記事は自分用にまとめたもので間違っている可能性があります。

もし間違い等あればご指摘して頂けると助かります。

どんなもの？
先行研究と比べてどこがすごい？
技術や手法のキモはどこ？
どうやって有効だと検証した？
議論はある？
次に読むべき論文は？
参考

どんなもの？

1枚の学習画像の内部統計をキャプチャする無条件の（つまり、ノイズから生成する）生成モデル。

つまりSinGANは画像「1枚だけ」から、それと似たような画像を生成出来る。

推論時に任意の解像度に変更可能で，下層の入力画像をいじることで，ペイントから画像への変換、画像の編集、画像の調和、超解像度およびアニメーションを、アーキテクチャの変更やさらなるチューニングなしで実行するために使用できる。

f:id:hotcocoastudy:20191030011214p:plain — Figure 1: Image generation learned from a single training image. We propose SinGAN–a new unconditional generative model trained on a single natural image. Our model learns the image’s patch statistics across multiple scales, using a dedicated multi-scale adversarial training scheme; it can then be used to generate new realistic image samples that preserve the original patch distribution while creating new object configurations and structures.

先行研究と比べてどこがすごい？

f:id:hotcocoastudy:20191030011520p:plain — Figure 2: Image manipulation. SinGAN can be used in various image manipulation tasks, including: transforming a paint (clipart) into a realistic photo, rearranging and editing objects in the image, harmonizing a new object into an image, image super-resolution and creating an animation from a single input. In all these cases, our model observes only the training image (first row) and is trained in the same manner for all applications, with no architectural changes or further tuning (see Sec. 4).

1枚の自然画像に対してGANベースのモデルを提案したものはあるが、画像を画像にマッピングする（ある画像を入れると必ず決まった画像が出力される）ので、ランダムサンプルの生成には向いていない。

というか1枚の画像で生成するGANは今までテクスチャを生成するものがほとんどだった（らしい）。
その証拠に従来の1枚の画像で生成できるGANを使うとFigure 3のようになる。

f:id:hotcocoastudy:20191030011600p:plain — Figure 3: SinGAN vs. Single Image Texture Generation. Single image models for texture generation [3, 16] are not designed to deal with natural images. Our model can produce realistic image samples that consist of complex textures and non-reptititve global structures.

非テクスチャ画像だと意味のあるサンプルを生成していない。一方SinGANはPSGAN、Deep Texture Synthesisよりうまく描画できている。SinGANではテクスチャ画像に限定されずに一般的な自然画像（非テクスチャ画像）でも1枚の画像から生成が可能になった（Figure 1）。

技術や手法のキモはどこ？

軽量なFCN (Fully Convolutional Network)のGANを何層かに積み重ねたようなGANピラミッドによって構成され、それぞれが異なるスケールでパッチの分布をキャプチャする。学習が完了すると、SinGANはさまざまな高品質の画像サンプル（任意のサイズ）を生成できる。

以下のFigure 4はSinGANの全体図である。各スケールに対してPatchGANを行うイメージ。

f:id:hotcocoastudy:20191030011753p:plain — Figure 4: SinGAN’s multi-scale pipeline. Our model consists of a pyramid of GANs, where both training and inference are done in a coarse-to-fine fashion. At each scale, Gn learns to generate image samples in which all the overlapping patches cannot be distinguished from the patches in the down-sampled training image, xn, by the discriminator Dn; the effective patch size decreases as we go up the pyramid (marked in yellow on the original image for illustration). The input to Gn is a random noise image zn, and the generated image from the previous scale x˜n, upsampled to the current resolution (except for the coarsest level which is purely generative). The generation process at level n involves all generators {GN . . . Gn} and all noise maps {zN , . . . , zn} up to this level. See more details at Sec. 2.

ピラミッド形式のGANで学習と推論の両方が「荒い→精巧」のような流れで行われる。

各スケールで Generator $G_n$ は、すべてのオーバーラップパッチを Discriminator $D_n$ がダウンサンプリングされた学習画像 $x_n$ のパッチと騙されるような画像サンプルを生成するように学習する。

大雑把な流れは以下の図のようになる。

f:id:hotcocoastudy:20191104171753p:plain

f:id:hotcocoastudy:20191031121738p:plain

パッチサイズとは？

patchGANというGANで用いられたものです。パッチサイズごとにDiscriminatorがRealかFakeを判定している。

https://blog.shikoan.com/pytorch_pix2pix_colorization/より

patchGANは決められたパッチサイズごとにRealかFakeかを判別する

https://blog.paperspace.com/unpaired-image-to-image-translation-with-cyclegan/より