An Introduction to
Generative Adversarial Networks

Mathematical Foundations and Evolution (2019-2025)
Bryan Paget | University of Ottawa
This thesis presents a comprehensive survey of the mathematical foundations underlying Generative Adversarial Networks (GANs). We examine GANs through three complementary theoretical lenses: game theory, information theory, and optimal transport theory, revealing fundamental insights about their behavior, limitations, and potential improvements.

Table of Contents

1. Introduction

Generative Adversarial Networks (GANs) represent a groundbreaking approach in machine learning that has revolutionized generative modeling since their introduction in 2014. This thesis examines GANs through three complementary theoretical frameworks: game theory, information theory, and optimal transport theory.

At its core, the GAN algorithm trains two neural networks through an adversarial two-player, zero-sum game. The generator learns to produce synthetic data that resembles real samples from a target distribution, while the discriminator learns to distinguish between real data samples and those generated by the first network.

Definition 1 (Generator). Let $\Phi \subset \mathbb{R}^n$ be a parameter space and $(\mathcal{Z}, p_z)$ be a prior probability space. A generator $G$ is a function $G: \Phi \times \mathcal{Z} \mapsto \mathcal{X}$ that maps a noise vector $z \sim p_z$ to the target space $(\mathcal{X}, p_{data})$. We denote the generator parameterized by $\phi \in \Phi$ as $G_\phi$.
Definition 2 (Discriminator). Let $\Theta \subset \mathbb{R}^m$ be a parameter space. A discriminator $D$ is a function $D: \Theta \times \mathcal{X} \mapsto [0, 1]$ that computes the probability that a sample $x \in \mathcal{X}$ was drawn from the real data distribution $(\mathcal{X}, p_{data})$ rather than the generator's distribution. We denote the discriminator parameterized by $\theta \in \Theta$ as $D_\theta$.
Definition 3 (Generative Adversarial Network). The Generative Adversarial Network algorithm trains two neural networks in an adversarial manner: The discriminator maximizes and the generator minimizes the value function:
\[V(G_\phi, D_\theta) = \mathbb{E}_{x \sim p_{data}}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))]\]
leading to the minimax optimization problem:
\[\min_{\phi} \max_{\theta} V(G_\phi, D_\theta)\]

This adversarial formulation creates a dynamic equilibrium where the generator improves its ability to create realistic samples while the discriminator becomes more adept at detecting fakes. Ideally, this process converges when the generator produces samples indistinguishable from real data, forcing the discriminator to output approximately 0.5 for all inputs.

Applications

GANs have demonstrated remarkable success across numerous domains:

  • Computer Vision: Photo-realistic image generation
  • Medical Imaging: Data augmentation and anomaly detection
  • Domain Adaptation: Knowledge transfer between domains
  • Creative Applications: Art generation and design

Historical Context

The conceptual foundations of GANs draw from several historical threads:

  • Doyle's adversarial framework (1970s)
  • Schmidhuber's predictability minimization (1990s)
  • Goodfellow's unified GAN algorithm (2014)
  • Subsequent variants addressing limitations

2. Game Theory

The GAN framework is fundamentally rooted in game theory, which studies strategic interactions between rational decision-makers. In this section, we examine GANs as a two-player game between the discriminator ($D$) and generator ($G$).

Definition 4 (Value Function). A value function $V_i$ quantifies the reward (or loss if negative) for player $i$ in a game, expressed as a function of the player's actions.
Remark 1. In many machine learning contexts, the loss function represents the difference between target values $y$ and estimates $\hat{y}$. In GANs, however, the value function is more complex, serving as a measure of uncertainty in the adversarial game.
Definition 5 (Zero-Sum Game). A zero-sum game is one where one player's gain exactly equals the other player's loss. Formally, $V_D + V_G = 0$.
Remark 2. For this section, we focus on the zero-sum formulation, where the GAN algorithm represents a strategic competition between the generator and discriminator.

Minimax Strategy and Equilibrium

In zero-sum games, players seek optimal strategies to maximize their minimum possible reward. The minimax decision rule provides a framework for this optimization.

Definition 6 (Minimax Decision Rule). In a two-player game with players $D$ and $G$, the minimax decision rule for $D$ maximizes $D$'s expected reward after $G$ has minimized $D$'s maximum attainable reward. The minimax value $\overline{V_D}$ is:
\[\overline{V_D} = \min_{G} \max_{D} V_D(D, G)\]

The minimax strategy ensures the best possible outcome against an optimal opponent. If $G$ moves first to minimize $D$'s reward, $D$'s minimax rule maximizes the reduced reward.

Nash Equilibrium and Strategic Stability

The GAN algorithm seeks a Nash equilibrium in the parameter space of the discriminator and generator.

Definition 7 (Nash Equilibrium). A Nash equilibrium in an $n$-player game is a strategy profile $s^* = (s_1, s_2, \dots, s_n)$ where no player can benefit by unilaterally changing their strategy. Formally, for all players $i$ and alternative strategies $s_{-i}$:
\[f_i(s^*) \geq f_i(s_1, s_2, \dots, s_{-i}, \dots, s_n)\]
where $f_i(s)$ is the payoff to player $i$ under strategy profile $s$.
Remark 3. In a Nash equilibrium, each player's strategy is optimal given the others' strategies. For zero-sum games, the minimax solution corresponds to a Nash equilibrium. Note that Nash equilibria are not necessarily globally optimal, and multiple equilibria may exist.

Prisoner's Dilemma: A Case Study

The Prisoner's Dilemma illustrates key concepts of Nash equilibrium and strategic interaction. Originally formulated by Flood and Dresher in the 1950s and later reformulated by Tucker, it demonstrates how individual rationality can lead to suboptimal collective outcomes.

In this game, two players ($D$ and $G$) are held in separate custody with no communication. Each faces the following choices:

  1. If one defects (accuses the other) while the other denies, the defector receives 1 year (reward for cooperation) while the other receives 5 years.
  2. If both deny, each receives 2 years (sufficient evidence exists for the lesser crime).
  3. If both defect, each receives 3 years.

Analyzing all possible states:

  1. (Deny, Defect): If $D$ switches to defect while $G$ defects, $D$ improves from 5 to 3 years. This is not a Nash equilibrium (and symmetrically for (Defect, Deny)).
  2. (Deny, Deny): If $D$ switches to defect while $G$ denies, $D$ improves from 2 to 1 year. This is not a Nash equilibrium.
  3. (Defect, Defect): Neither player can improve by unilaterally changing strategy. If $D$ switches to deny while $G$ defects, $D$ worsens from 3 to 5 years. This is the Nash equilibrium.

Key Insight

The Prisoner's Dilemma shows that the globally optimal outcome (Deny, Deny) is unstable, while the Nash equilibrium (Defect, Defect) is stable but suboptimal. This has important implications for GANs, where the minimax formulation can lead to oscillatory dynamics rather than stable convergence.

Derivation of the GAN Value Function

We now derive the GAN value function from game-theoretic principles. Let $(\mathcal{X}, p_{data})$ be a probability space where $\mathcal{X}$ is a finite space (e.g., the space of all $H \times W$ 8-bit RGB images, $\mathcal{X} = [0, 255]^{3 \times H \times W}$) and $p_{data}$ assigns mass to regions corresponding to meaningful images.

The GAN algorithm trains a generator $G_\phi$ (parameterized by $\phi \in \Phi \subset \mathbb{R}^n$) to map random samples $z$ from a prior space $(\mathcal{Z}, p_z)$ to $\mathcal{X}$ such that $G_\phi(z)$ lies in regions where $p_{data}$ assigns significant mass. Typically, $\mathcal{Z} \neq \mathcal{X}$ and $|\mathcal{Z}| < |\mathcal{X}|$.

Initially, $G_\phi$ maps $z$ to $\mathcal{X}$ randomly. Through adversarial training, both $D_\theta$ and $G_\phi$ learn $p_{data}$ from different perspectives: $D_\theta$ learns to distinguish real from generated samples, while $G_\phi$ learns to generate samples that fool $D_\theta$.

The value function $V$ captures this adversarial dynamic:

\[V(D_\theta, G_\phi) = \mathbb{E}_{x \sim p_{data}}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))]\]

This value function resembles the noise-contrastive estimator. In the game between $G_\phi$ and $D_\theta$, each seeks to maximize their minimum reward by playing their minimax decision rule.

Discriminator's Objective

From $D_\theta$'s perspective, we want $D_\theta(x)$ to approximate $p_{data}(x)$. We find $\theta \in \Theta$ that maximizes the likelihood:

\[L_D^{(1)}(\theta) = \prod_{i=1}^n D_\theta(x_i), \quad x_i \sim p_{data}\]

For numerical optimization, we maximize the log-likelihood:

\[\ell_D^{(1)}(\theta) = \sum_{i=1}^n \log D_\theta(x_i)\]

Simultaneously, $D_\theta$ must assign low probability to generated samples $G_\phi(z) = \tilde{x}$. For fixed $G_\phi$, $\theta$ should minimize:

\[^*L_D^{(2)}(\theta) = \prod_{i=1}^n D_\theta(G_\phi(z_i)), \quad z_i \sim p_z\]

Equivalently, we minimize the log-likelihood:

\[^*\ell_D^{(2)}(\theta) = \sum_{i=1}^n \log D_\theta(G_\phi(z_i))\]

To combine both objectives, we maximize the complement:

\[\ell_D^{(2)}(\theta) = \sum_{i=1}^n \log(1 - D_\theta(G_\phi(z_i)))\]

since minimizing $D_\theta(G_\phi(z))$ is equivalent to maximizing $1 - D_\theta(G_\phi(z))$. Combining both objectives, $D_\theta$ maximizes:

\[\sum_{i=1}^n \left( \log D_\theta(x_i) + \log(1 - D_\theta(G_\phi(z_i))) \right)\]

By the law of large numbers, this sample average converges to:

\[\mathbb{E}_{x \sim p_{data}}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))]\]

for sufficiently large $n$.

Generator's Objective

From $G_\phi$'s perspective, we seek $\phi$ that maximizes the likelihood (as judged by a fixed $D_\theta$) that generated samples come from $p_{data}$. We maximize:

\[^*\ell_G(\phi) = \sum_{i=1}^n \log D_\theta(G_\phi(z_i)), \quad z_i \sim p_z\]

Equivalently, we minimize:

\[\ell_G(\phi) = \sum_{i=1}^n \log(1 - D_\theta(G_\phi(z_i)))\]

which corresponds to minimizing $\mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))]$.

Minimax Formulation

The training objectives lead to the minimax formulation:

\[\begin{aligned} \overline{V_D} & = \min_{\phi}\max_{\theta} V_D(\theta, \phi) \\ & = \min_{\phi}\max_{\theta} \left( \mathbb{E}_{x \sim p_{data}}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))] \right) \end{aligned}\]

We solve this by finding the minimax decision rule and value for $D_\theta$, which corresponds to the Nash equilibrium. However, finding this equilibrium is often challenging in practice.

Challenges in Finding Nash Equilibrium

Finding Nash equilibria can be difficult, as demonstrated by the following example:

Example 1 (Oscillatory Dynamics).

Let $V(x, y) = xy$ be a value function with the game:

\[\min_x \max_y V(x, y) = xy\]

The gradients are $\frac{\partial V}{\partial x} = y$ and $\frac{\partial V}{\partial y} = x$, leading to updates:

\[\begin{aligned} x^{(t+1)} & \gets x^{(t)} - \eta \cdot y, \\ y^{(t+1)} & \gets y^{(t)} + \eta \cdot x, \end{aligned}\]

where $\eta > 0$ is the learning rate.

The Nash equilibrium is $s^* = (s_G^*, s_D^*) = (0, 0, \dots)$, where $V(s_G^*, s_D) = 0$ for all $s_D$, and $V(s_G, s_D^*) = 0$ for all $s_G$.

This oscillatory behavior is frequently observed in GAN training, where alternating gradient updates can cause the system to oscillate around equilibrium rather than converge to it.

Alternative Value Functions

The original value function often leads to vanishing gradients, especially early in training when $D_\theta$ easily distinguishes real from generated samples. To address this, an alternative formulation was introduced:

\[\begin{aligned} \max_{\theta} \left( \mathbb{E}_{x \sim p_{data}}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))] \right) \\ \max_{\phi} \mathbb{E}_{z \sim p_z}[\log D_\theta(G_\phi(z))] \end{aligned}\]

These decoupled objectives share the same fixed points as the original formulation but provide stronger gradients for the generator, improving learning dynamics. With this formulation, GANs are no longer strictly zero-sum games.

Strategic Interpretation and Discussion

The minimax strategy for $D_\theta$ has an important interpretation: if $D_\theta$ assumes $G_\phi$ has done its worst (i.e., generated perfect samples), then $D_\theta$ should assign probability $1/2$ to all inputs. This is the maximum entropy distribution over the two states (real or synthetic), representing maximum uncertainty.

When $D_\theta(x) = 1/2$ for all $x$, the generator receives no useful gradient signal and cannot improve. This represents a strategic equilibrium where neither player can benefit by unilaterally changing their strategy.

"The game-theoretic perspective reveals fundamental challenges in GAN training: the minimax formulation can lead to oscillatory dynamics rather than stable convergence, and the zero-sum assumption may not hold in practice."

3. Information Theory

Information theory emerged in 1948 with Claude Shannon's seminal work *A Mathematical Theory of Communication*. Shannon's framework, inspired by thermodynamics concepts from Boltzmann and Gibbs and communication theory from Hartley and Nyquist at Bell Labs, revolutionized our understanding of information quantification and transmission. While applications like data compression, error-correcting codes, and channel capacity are beyond this thesis's scope, we focus on information-theoretic quantities fundamental to machine learning and generative adversarial networks.

Remark 4. In information theory, we quantify information about probability distributions rather than the semantic content of outcomes. Probability distributions capture our uncertainty about outcomes, making them central to information measurement. The most valuable information lies in accurately characterizing these distributions.

Entropy: Quantifying Uncertainty

The cornerstone of information theory is entropy, which measures uncertainty in probability distributions. We derive entropy by defining uncertainty as a function $\eta$ that satisfies intuitive requirements.

Definition 8 (Uncertainty). Let $(\mathcal{X}, p)$ be a discrete probability space. We define uncertainty as a real-valued function $\eta(\cdot): \mathcal{X} \mapsto \mathbb{R}^+$ that depends only on outcome probabilities and satisfies:
  1. If outcome $x$ is certain, $\eta(x) = 0$ (no uncertainty);
  2. For outcomes $x, x'$, $p(x) < p(x') \iff \eta(x) > \eta(x')$ (rarer events are more surprising);
  3. For independent outcomes $x, x'$, $\eta(x \cdot x') = \eta(x) + \eta(x')$ (uncertainty is additive).
Remark 5. This definition adapts concepts from Martin (2011) for our context.

New information reduces uncertainty, with rare events providing more information than common ones. This suggests $\eta$ should be inversely proportional to probability:

\[\eta(x) \propto {1 \over p(x)}\]

The additivity requirement (iii) implies a logarithmic relationship since independent events' probabilities multiply while their information content should add:

\[\eta(x) = \log{1 \over p(x)}\]

For a probability distribution, we need an average uncertainty measure weighted by outcome probabilities. This leads to entropy, denoted by $H$ (resembling the Greek eta):

\[H(p) = \sum_{x \in \mathcal{X}} p(x) \log{1 \over p(x)}\]

Entropy represents the average surprise associated with outcomes from $(\mathcal{X}, p)$. It reaches maximum when all outcomes are equally likely (uniform distribution), reflecting maximum uncertainty:

\[H(p) \leq \log{|\mathcal{X}|}\]

We can also interpret entropy as the average information (in bits) needed to describe outcomes from $(\mathcal{X}, p)$. The logarithm base determines the units:

\[\log_2{1 \over p(x)} = n \iff 2^n = {1 \over p(x)}\]

While base 2 yields bits, base $e$ (nats) and base 10 (dits) are also common.

Definition 9 (Entropy). Let $(\mathcal{X}, p)$ be a discrete probability space. The entropy of distribution $p$ is:
\[H(p) = - \mathbb{E}_{x \sim p} \left[ \log{p(x)} \right]\]
Entropy quantifies the average uncertainty in outcomes from $(\mathcal{X}, p)$. The uniform distribution maximizes entropy, as all outcomes are equally surprising.

Divergences and Distance Measures

To compare probability distributions, we need measures of dissimilarity. We first distinguish between metrics and divergences.

Definition 10 (Metric). A metric on a set $\mathcal{X}$ is a function $d(\cdot, \cdot): \mathcal{X} \times \mathcal{X} \mapsto \mathbb{R}^+$ satisfying $\forall x, y, z \in \mathcal{X}$:
  1. $d(x, y) \geq 0$, and $d(x, y) = 0 \iff x = y$ (non-negativity and identity)
  2. $d(x, y) = d(y, x)$ (symmetry)
  3. $d(x, z) \leq d(x, y) + d(y, z)$ (triangle inequality)
Remark 6. A divergence is a weaker notion than distance. It need not be symmetric nor satisfy the triangle inequality, but must be non-negative and zero only for identical distributions.
Definition 11 (Divergence). Let $\mathcal{P}$ be a space of probability distributions over finite set $\mathcal{X}$ with identical support. A divergence on $\mathcal{P}$ is a function $\text{D}(\cdot, \cdot): \mathcal{P} \times \mathcal{P} \mapsto \mathbb{R}^+$ such that $\forall p, q \in \mathcal{P}$:
  1. $\text{D}(p, q) \geq 0$
  2. $\text{D}(p, q) = 0 \iff p = q$.

Kullback-Leibler Divergence

The most fundamental divergence in information theory is the Kullback-Leibler (KL) divergence.

Definition 12 (Kullback-Leibler Divergence). The Kullback-Leibler divergence (also called relative entropy, directed divergence, information gain, or discrimination information) between distributions $p$ and $q$ is:
\[\text{KL}(p \| q) = \sum_{x \in \mathcal{X}} p(x) \log{p(x) \over q(x)}\]
If $p$ and $q$ have the same support, $\text{KL}(p \| q) = 0$ if and only if $p = q$.
Remark 7. KL divergence requires $p$ to be absolutely continuous with respect to $q$ ($q(x) = 0 \implies p(x) = 0$). When $p(x) = 0$, $\text{KL}(p \| q) = 0$ since $\lim_{x \to 0} x\log{x} = 0$.
Theorem 13 (Pythagorean Identity for KL Divergence). For a closed convex set $E \subset \mathcal{P}$ and distribution $Q \not \in E$, let $P^* \in E$ be the projection $p^* = \mathop{\mathrm{arg\,min}}_{P \in E} \text{KL}(p \| q)$. Then:
\[\text{KL}(p \| q) \geq \text{KL}(p \| p^*) + \text{KL}(p^* \| q)\]
See Theorem 11.6.1 in Cover and Thomas (2006) for details.
Remark 8. KL divergence relates to the log-likelihood ratio test for comparing statistical models. For models $p(x) = f(x|\theta)$ and $q(x) = f(x|\phi)$, the log-likelihood ratio is:
\[\lambda(x) = \log{\prod_{x \in \mathcal{X}} p(x) \over \prod_{x \in \mathcal{X}} q(x)} = \sum_{x \in \mathcal{X}} \log {p(x) \over q(x)}\]
The KL divergence $\text{KL}(p \| q)$ is the expectation of $\lambda(x)$ under $p$:
\[\mathbb{E}_{x \sim p}[\lambda(x)] = \sum_{x \in \mathcal{X}} p(x)\log {p(x) \over q(x)}\]

This connection reveals that GAN training can be viewed as optimizing a goodness-of-fit test. The GAN objective:

\[\min_\phi\max_\theta{1 \over n} \sum_{i=1}^n \log{D_\theta(x_i)} + {1 \over n} \sum_{i=1}^n\log{(1 - D_\theta(G_\phi(z_i)))}\]

has the same fixed point as:

\[\min_\phi\max_\theta{1 \over n} \sum_{i=1}^n \log{\left({D_\theta(x_i) \over D_\theta(G_\phi(z_i))}\right)}\]

which represents a KL divergence or average log-likelihood ratio. Since $D_\theta(x) \in [0, 1]$, an optimal $D$ assigns higher probability to real data than generated data.

Reverse Kullback-Leibler Divergence

Definition 13 (Reverse Kullback-Leibler Divergence). The reverse Kullback-Leibler divergence is:
\[\text{KL}(q \| p) = \sum_{x \in \mathcal{X}} q(x) \log{q(x) \over p(x)}\]
This represents the expectation of the log-likelihood ratio under $q$:
\[\mathbb{E}_{x \sim q}[\lambda(x)] = \sum_{x \in \mathcal{X}} q(x)\log {q(x) \over p(x)}\]

Unlike forward KL, minimizing reverse KL is not equivalent to maximum likelihood estimation.

Cross Entropy

Cross entropy measures the average uncertainty when using $q$ to encode events from $p$.

Definition 14 (Cross Entropy). The cross entropy between $p$ and $q$ is:
\[H(p, q) = - \sum_{x \in \mathcal{X}} p(x) \log q(x) = -\mathbb{E}_{x \sim p}\left[\log{q(x)}\right]\]
It represents the total uncertainty incurred by modeling data with $q$ instead of $p$.
Lemma 15. Cross entropy decomposes as $H(p, q) = H(p) + \text{KL}(p \| q)$.
Proof.
\[\begin{aligned} H(p, q) & = -\sum_{x \in \mathcal{X}} p(x) \log q(x) \\ & = -\sum_{x \in \mathcal{X}} p(x) \log p(x) + \sum_{x \in \mathcal{X}} p(x) \log p(x) - \sum_{x \in \mathcal{X}} p(x) \log q(x) \\ & = -\sum_{x \in \mathcal{X}} p(x) \log p(x) + \sum_{x \in \mathcal{X}} p(x) \log {p(x) \over q(x)} \\ & = H(p) + \text{KL}(p \| q) \end{aligned}\]

This decomposition shows cross entropy is bounded below by $H(p)$, with the excess quantified by $\text{KL}(p \| q)$. Cross entropy is asymmetric since $H(q, p) = H(q) + \text{KL}(q \| p)$.

Jensen-Shannon Divergence

The Jensen-Shannon divergence (JSD) provides a symmetric, smoothed version of KL divergence.

Definition 15 (Jensen-Shannon Divergence). For distributions $p(x)$ and $q(x)$ over space $\mathcal{X}$, the Jensen-Shannon divergence is:
\[\text{JSD}(p \| q) = {1 \over 2} \text{KL}\left(p \| {p + q \over 2}\right) + {1 \over 2} \text{KL}\left(q \| {p + q \over 2}\right)\]
Remark 9. JSD is the average of forward and reverse KL divergences with respect to the midpoint distribution $(p+q)/2$.
Theorem 16. The square root of Jensen-Shannon divergence, $\sqrt{\text{JSD}(p \| q)}$, is a metric.
Proof. See Endres and Schindelin (2003). ◻

Mutual Information and Dependency Measures

Information theory provides tools to quantify dependencies between distributions.

Definition 16 (Mutual Information). Let $(\mathcal{X}, p)$ and $(\mathcal{Y}, q)$ be finite probability spaces with random variables $X \sim p$ and $Y \sim q$. Their joint distribution is $\gamma$ with marginals $\pi_p \circ \gamma = p$ and $\pi_q \circ \gamma = q$. The mutual information $I(X;Y)$ is:
\[I(X; Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}}\gamma(x, y) \log{{\gamma(x, y) \over p(x)q(y)}}\]
Theorem 17. If $X$ and $Y$ are independent, then $\gamma(x,y) = p(x)q(y)$ and $I(X; Y) = 0$.
Remark 10. Mutual information quantifies the information shared between distributions, serving as a measure of statistical dependence. It can also be expressed as:
\[I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)\]
where $H(X | Y)$ is the conditional entropy of $X$ given $Y$. This form shows mutual information as the reduction in uncertainty about one variable given knowledge of the other.

Information-Theoretic Analysis of GANs

We now analyze the GAN value function $V = \mathbb{E}_{x \sim p_r}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))]$ from an information-theoretic perspective. The following proposition examines the optimization dynamics at each training step, while Section 3.5 analyzes limiting behavior.

GAN training aims to: 1. Minimize the KL divergence between the data distribution $p_r$ and the discriminator's output distribution $D_\theta(x)$ 2. Maximize the KL divergence between the generator's prior $p_z$ and the discriminator's assessment of generated data $D_\theta(G_\phi(z))$ 3. Minimize the KL divergence between $p_z$ and $D_\theta(G_\phi(z))$ from the generator's perspective

Remark 11. KL divergence in this context quantifies the additional surprise when using an approximate distribution $q$ instead of the true distribution $p$.
Proposition 18. The GAN objective:
\[\min_\phi \max_\theta \mathbb{E}_{x \sim p_r}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))]\]
corresponds to the following information-theoretic objectives:
  1. Discriminator optimization: Find $\theta \in \Theta$ to minimize $\text{KL}(p_r \| D_\theta)$ and maximize $\text{KL}(p_z \| D_\theta \circ G_\phi)$
  2. Generator optimization: Find $\phi \in \Phi$ to minimize $\text{KL}(p_z \| D_\theta \circ G_\phi)$
Proof. The first term of $V$ is the negative cross entropy between $p_r$ and $D_\theta$:
\[- H(p_r, D_\theta) = \sum_{x \in \mathcal{X}} p_r(x)\log(D_\theta(x)) = \mathbb{E}_{x \sim p_r} [\log(D_\theta(x))]\]
Maximizing this term minimizes cross entropy $H(p_r, D_\theta)$. Since:
\[H(p_r, D_\theta) = H(p_r) + \text{KL}(p_r \| D_\theta),\]
and $H(p_r)$ is fixed, minimizing cross entropy requires minimizing $\text{KL}(p_r \| D_\theta)$, achieved when $D_\theta(x) = p_r(x)$. The second term is:
\[\mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))] = \sum_{z \sim p_z} p_z(z) \log (1 - D_\theta(G_\phi(z)))\]
The discriminator maximizes this, equivalent to maximizing:
\[-\sum_{z \sim p_z} p_z(z) \log (D_\theta(G_\phi(z))),\]
which is the negative cross entropy between $p_z$ and $D_\theta \circ G_\phi$. Thus, $D_\theta$ maximizes $\text{KL}(p_z \| D_\theta \circ G_\phi)$, making $D_\theta(G_\phi(z))$ as different as possible from $p_z$. Conversely, the generator minimizes the same term, equivalent to minimizing cross entropy between $p_z$ and $D_\theta \circ G_\phi$, which minimizes $\text{KL}(p_z \| D_\theta \circ G_\phi)$. ◻

This analysis reveals that the generator aims to make the discriminator's output on generated data as uninformative as random noise, while the discriminator tries to match its output distribution to the true data distribution while distinguishing generated data from noise.

Optimization Dynamics and Limiting Behavior

We now rigorously analyze GAN optimization dynamics. Let $\mathcal{X}$ be the data space with true distribution $p_r$, and $\tilde{\mathcal{X}}$ be generated data with distribution $p_g$ induced by generator $G_\phi$. Let $X \sim p_r$, $\tilde{X} \sim p_g$, and $Z \sim p_z$ (noise prior).

We consider three equivalent forms of the value function:

\[\begin{aligned} V(\phi, \theta)(X, Z) & = \mathbb{E}_{x \sim p_r}[\log D_\theta(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\theta(G_\phi(z)))] \\ \tilde{V}(\phi, \theta)(X, \tilde{X}) & = \mathbb{E}_{x \sim p_r}[\log D_\theta(x)] + \mathbb{E}_{\tilde{x} \sim p_g}[\log(1 - D_\theta(\tilde{x}))] \\ \mathcal{V}(U) & = \mathbb{E}_{u \sim p_U} [p_r(u) \log D_\theta(u) + p_g(u) \log(1 - D_\theta(u))] \end{aligned}\]

where $U \subset \mathbb{R}^2$ contains pairs $(x, \tilde{x})$, and $p_U$ is the joint distribution over these pairs.

Theorem 19 (Optimal Discriminator). For any fixed generator $G_\phi$, the optimal discriminator $D^*$ that maximizes $V$ is:
\[D^*(x) = \frac{p_r(x)}{p_r(x) + p_g(x)}\]
Proof. We compute the derivative of $\mathcal{V}$ with respect to $\theta$:
\[{\partial \mathcal{V}(U) \over \partial \theta} = \mathbb{E}_{u \sim p_U}\left[ p_r(u) { {\partial D_\theta(u) \over \partial \theta} \over D_\theta(u)} - p_g(u) {{\partial D_\theta(u) \over \partial \theta} \over (1 - D_\theta(u))}\right],\]
Setting this to zero yields:
\[{p_r(u) \over D_\theta(u)} = {p_g(u) \over (1 - D_\theta(u))} \iff D^*(u) = {p_r(u) \over p_g(u) + p_r(u)}\]
Thus, for any fixed $G_\phi$, $\mathcal{V}(U)$ is maximized when $D_\theta = D^*$. ◻

Substituting $D^*$ into $\tilde{V}$ gives:

\[\max_D \tilde{V}(\phi, \theta)(X, \tilde{X}) = \mathbb{E}_{x \sim p_r}\left[\log{p_r(x) \over p_g(x) + p_r(x)}\right] + \mathbb{E}_{\tilde{x} \sim p_g}\left[\log{p_g(\tilde{x}) \over p_g(\tilde{x}) + p_r(\tilde{x})} \right]\]

This expression relates to the Jensen-Shannon divergence between $p_r$ and $p_g$.

Lemma 20. The minimax value of the GAN game is $-\log 4$.
Proof. The optimal generator satisfies $p_g = p_r$. Substituting into the previous equation:
\[\begin{aligned} V(\phi^*, \theta^*)(X, \tilde{X}) & = \mathbb{E}_{x \sim p_r}\left[\log{p_r(x) \over 2p_r(x)}\right] + \mathbb{E}_{\tilde{x} \sim p_r}\left[\log{p_r(\tilde{x}) \over 2p_r(\tilde{x})}\right] \\ & = \mathbb{E}_{x \sim p_r}[\log(1/2)] + \mathbb{E}_{\tilde{x} \sim p_r}[\log(1/2)] \\ & = -\log 2 - \log 2 = -\log 4 \end{aligned}\]
Theorem 21 (GAN Optimization as JSD Minimization). With an optimal discriminator, minimizing $V$ with respect to $\phi$ is equivalent to minimizing the Jensen-Shannon divergence between $p_r$ and $p_g$.
Proof. Adding and subtracting $\log 4$ from the equation yields:
\[V(\phi, \theta^*) = 2 \cdot \text{JSD}(p_r \| p_g) - \log 4,\]
as shown in Appendix A. Minimizing $V$ thus minimizes $\text{JSD}(p_r \| p_g)$. ◻

Discussion

This section presented two complementary information-theoretic perspectives on GAN training. Theorem 21 characterizes the limiting behavior, showing that GAN optimization minimizes Jensen-Shannon divergence between real and generated distributions.

Theorem 18 examines the optimization dynamics at each training step, revealing how the discriminator and generator alternately minimize and maximize Kullback-Leibler divergences. The analysis shows that GANs avoid limitations of traditional maximum likelihood methods, which minimize forward KL divergence $\text{KL}(p_r \| p_g)$. This divergence is sensitive to mode dropping, as it penalizes generating samples where $p_g > 0$ but $p_r = 0$. In contrast, JSD provides a more balanced measure of distributional similarity.

During training, the discriminator progressively forces the value function to better approximate JSD by performing the operations in Theorem 18. This stepwise optimization reveals the intricate interplay between discriminator and generator as they jointly minimize distributional divergence through adversarial dynamics.

4. Optimal Transport

In Section 3, we explored how the GAN generator minimizes an approximation of the Jensen-Shannon divergence between the real data distribution $p_{data}$ and the generated distribution $p_g$. However, this approach suffers from fundamental limitations rooted in the topology induced by the Jensen-Shannon divergence. In this section, we examine these limitations and introduce optimal transport theory as a more robust framework for comparing probability distributions, leading to the Wasserstein GAN (WGAN) variant.

The central insight is that the choice of distance metric has profound consequences for continuity and convergence properties of probability distributions. As we will see, the Wasserstein distance provides a more suitable topology for GAN training, addressing key limitations of the original formulation.

Topology and Convergence of Probability Distributions

To understand the limitations of the original GAN formulation, we first need to examine how different metrics induce different topologies on spaces of probability distributions.

Definition 17 (Topology). Let $\mathcal{X}$ be any set. A topology $\mathcal{T}$ on $\mathcal{X}$ is a collection of subsets of $\mathcal{X}$ (called open sets) satisfying:
  1. The empty set and $\mathcal{X}$ itself are open;
  2. The intersection of finitely many open sets is open;
  3. The union of any collection of open sets is open.
The pair $(\mathcal{X}, \mathcal{T})$ is called a topological space.

In metric spaces, the topology is generated by open balls, which form a basis for the topology.

Definition 18 (Open Ball). Let $(\mathcal{X}, d)$ be a metric space. An open ball of radius $r > 0$ centered at $x_0 \in \mathcal{X}$ is:
\[B_d(x_0, r) = \{ x \in \mathcal{X} : d(x, x_0) < r \}\]
Definition 19 (Topology Induced by Metric). Let $(\mathcal{X}, d)$ be a metric space. The topology induced by $d$ is the topology generated by the basis of open balls $\mathcal{B} = \{B_d(x, r) \mid x \in \mathcal{X}, r > 0\}$.

Different metrics can induce different topologies with varying degrees of "granularity" or "fineness."

Theorem 22. Let $d$ and $d'$ be metrics on $\mathcal{X}$, with induced topologies $\mathcal{T}$ and $\mathcal{T}'$. Then $\mathcal{T}'$ is finer than $\mathcal{T}$ if and only if for every $x \in \mathcal{X}$ and $\epsilon > 0$, there exists $\delta > 0$ such that $B_{d'}(x, \delta) \subset B_d(x, \epsilon)$.

For probability distributions, convergence depends on the chosen metric:

Definition 20 (Convergence in Metric Space). Let $\mathcal{P}$ be the space of probability distributions on $\mathcal{X}$ with equal support, and let $d: \mathcal{P} \times \mathcal{P} \to \mathbb{R}$ be a metric. A sequence $\{P_n\}_{n \in \mathbb{N}}$ converges to $P$ if $d(P_n, P) \to 0$ as $n \to \infty$.

The Kullback-Leibler and Jensen-Shannon divergences induce a coarser topology than the Wasserstein distance, meaning they have fewer open sets and stronger requirements for convergence. This coarser topology leads to discontinuities that cause problems in GAN training.

Limitations of KL and Jensen-Shannon Divergences

The Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences, while useful in many contexts, have significant limitations when comparing distributions with disjoint supports---a common scenario in GAN training.

Example 2 (Learning Parallel Lines).

Consider learning to generate a vertical line at $x=0$ when starting from a line at $x=\phi$. Let $\mathcal{X} = \mathbb{R}^2$, $p_0(z)$ be the uniform distribution over $\{(0, z) : z \in [0, 1]\}$, and $G_\phi(z)$ generate points $\{(\phi, z) : z \in [0, 1]\}$. We want to train $G_\phi$ to approximate $p_0$ by moving $\phi \to 0$.

(i) KL Divergence: For $\phi \neq 0$, the distributions have disjoint supports:

\[\text{KL}(p_0 \| G_\phi) = \mathbb{E}_{z \sim p_0} \left[ \log \frac{p_0(z)}{G_\phi(z)} \right] = \infty\]
since $G_\phi(z) = 0$ for all $z \sim p_0$. Similarly, $\text{KL}(G_\phi \| p_0) = \infty$. Only when $\phi = 0$ do we have $\text{KL}(p_0 \| G_\phi) = 0$.

(ii) Jensen-Shannon Divergence: The JS divergence also fails to provide a useful gradient:

\[\begin{aligned} \text{JSD}(p_0 \| G_\phi) &= \frac{1}{2} \mathbb{E}_{z \sim p_0} \left[ \log \frac{p_0(z)}{p_m(z)} \right] + \frac{1}{2} \mathbb{E}_{z \sim G_\phi} \left[ \log \frac{G_\phi(z)}{p_m(z)} \right] \\ &= \frac{1}{2} \mathbb{E}_{z \sim p_0} [\log 2] + \frac{1}{2} \mathbb{E}_{z \sim G_\phi} [\log 2] \\ &= \log 2 \end{aligned}\]
where $p_m = \frac{p_0 + G_\phi}{2}$. Again, this is constant ($\log 2$) for all $\phi \neq 0$ and only drops to 0 when $\phi = 0$.

This example illustrates a fundamental problem: both KL and JS divergences provide no useful gradient information when distributions have disjoint supports, which is common in high-dimensional spaces like those used in GANs.

The Manifold Hypothesis and Dimensionality

In practice, GANs often operate on high-dimensional data like images, where the data lies on low-dimensional manifolds embedded in the ambient space.

Definition 21 (Manifold). A manifold is a topological space that locally resembles Euclidean space near each point. Formally, an $n$-dimensional manifold is a topological space where each point has a neighborhood homeomorphic to $\mathbb{R}^n$.
Remark 12 (Manifold Hypothesis). The manifold hypothesis posits that real-world high-dimensional data (such as images) lie on low-dimensional manifolds embedded in the ambient space. For example, the space of all possible $H \times W$ RGB images is $[0, 255]^{3 \times H \times W}$, but natural images occupy only a tiny fraction of this space.

This phenomenon is related to the curse of dimensionality: as dimensionality increases, the volume of the space grows exponentially, making data increasingly sparse. For instance, covering a unit interval with points spaced 0.1 units apart requires 10 points; covering a unit square requires 100 points; a unit cube requires 1000 points, and so on.

In GANs, the generator and real data distributions typically lie on different low-dimensional manifolds within the high-dimensional ambient space. When these manifolds are disjoint (which is likely in high dimensions), the KL and JS divergences become constant or infinite, providing no useful gradient signal.

The Perfect Discriminator Problem

The limitations of KL and JS divergences lead to a critical issue in GAN training: the emergence of a "perfect discriminator" that halts learning.

Definition 22 (Perfect Discriminator). A perfect discriminator is a function $D: \mathcal{X} \to [0,1]$ that achieves perfect accuracy on both real and generated data:
\[\begin{aligned} \mathbb{P}_{x \sim p_{data}}(\{x : D(x) = 1\}) &= 1, \\ \mathbb{P}_{\tilde{x} \sim p_g}(\{\tilde{x} : D(\tilde{x}) = 0\}) &= 1. \end{aligned}\]
Theorem 23. If $D$ is a perfect discriminator, then $\nabla_\phi V = 0$, meaning the generator receives no gradient signal and learning stops.
Proof. The generator update is:
\[\begin{aligned} \phi^{(t+1)} &\gets \phi^{(t)} - \alpha \nabla_\phi \frac{1}{m} \sum_{i=1}^m \log(1 - D^{(t+1)}(G_{\phi^{(t)}}(z_i))) \\ &\gets \phi^{(t)} - \alpha \nabla_\phi \frac{1}{m} \sum_{i=1}^m \log(1 - 0) \\ &\gets \phi^{(t)} - \alpha \nabla_\phi \frac{1}{m} \sum_{i=1}^m \log(1) \\ &\gets \phi^{(t)} \end{aligned}\]
Thus, the generator parameters do not update. ◻
Theorem 24. If $D$ becomes a perfect discriminator early in training, then $D$ also stops learning, as its gradients vanish.
Proof. The discriminator update is:
\[\begin{aligned} \theta^{(t+1)} &\gets \theta^{(t)} - \alpha \nabla_\theta \frac{1}{m} \sum_{i=1}^m \left[ \log D^{(t)}(x_i) + \log(1 - D^{(t)}(G_{\phi^{(t)}}(z_i))) \right] \\ &\gets \theta^{(t)} - \alpha \nabla_\theta \frac{1}{m} \sum_{i=1}^m \left[ \log(1) + \log(1) \right] \\ &\gets \theta^{(t)} \end{aligned}\]
Thus, the discriminator parameters also do not update. ◻

This perfect discriminator problem illustrates a fundamental limitation of the original GAN formulation: once the discriminator becomes too good too quickly, training stalls completely. We need a "gentler" discriminator that provides meaningful gradients throughout training.

Optimal Transport Theory

Optimal transport theory provides a more robust framework for comparing probability distributions, addressing the limitations of KL and JS divergences. The theory originated with Gaspard Monge in 1781 and was later generalized by Leonid Kantorovich.

Definition 23 (Transport Plan). Let $\mathcal{X}$ and $\mathcal{Y}$ be spaces, and let $p$ and $q$ be probability distributions on $\mathcal{X}$ and $\mathcal{Y}$, respectively. A transport plan is a joint probability distribution $\gamma \in \Gamma(p, q)$, where $\Gamma(p, q) = \{\gamma : \gamma \text{ has marginals } p \text{ and } q\}$. The set $\Gamma(p, q)$ is called the set of couplings between $p$ and $q$.
Remark 13. Each transport plan $\gamma(x, y)$ represents the amount of "mass" to move from $x$ to $y$ to transform distribution $p$ into distribution $q$.
Definition 24 (Transport Cost). Given a cost function $c: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}^+$, the transport cost of plan $\gamma$ is:
\[\mathbb{E}_{(x,y) \sim \gamma}[c(x, y)] = \int_{\mathcal{X}} \int_{\mathcal{Y}} c(x, y) \, d\gamma(x, y)\]
For the $p$-Wasserstein distance, we use $c(x, y) = \|x - y\|^p$.

The goal of optimal transport is to find the transport plan with minimal cost:

Definition 25 (Wasserstein Distance). The $p$-Wasserstein distance between distributions $p$ and $q$ is:
\[W_p(p, q) = \left( \inf_{\gamma \in \Gamma(p, q)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|^p] \right)^{1/p}\]
For $p = 1$, we call this the Kantorovich distance or earth mover's distance.

The Wasserstein distance provides a more robust metric for comparing distributions, especially when they have disjoint supports.

Example 3 (Learning Parallel Lines (Revisited)).

Using the Wasserstein-1 distance for our parallel lines example:

\[\begin{aligned} W_1(G_\phi, p_0) &= \inf_{\gamma \in \Gamma(G_\phi, p_0)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|] \\ &= \inf_{\gamma \in \Gamma(G_\phi, p_0)} \mathbb{E}_{(x,y) \sim \gamma}[\sqrt{(\phi - 0)^2 + (z - z)^2}] \\ &= |\phi| \end{aligned}\]
Unlike KL and JS divergences, the Wasserstein distance provides a continuous, meaningful gradient signal ($\nabla_\phi W_1 = \text{sign}(\phi)$) even when the distributions have disjoint supports.

Kantorovich-Rubinstein Duality

Directly computing the Wasserstein distance is intractable due to the infimum over all possible transport plans. However, the Kantorovich-Rubinstein duality provides an alternative formulation:

Definition 26 (Lipschitz Function). A function $f: \mathcal{X} \to \mathbb{R}$ is $K$-Lipschitz if for all $x_1, x_2 \in \mathcal{X}$:
\[|f(x_1) - f(x_2)| \leq K \cdot d(x_1, x_2)\]
where $d$ is the metric on $\mathcal{X}$.
Theorem 25 (Kantorovich-Rubinstein Duality). The 1-Wasserstein distance can be expressed as:
\[W_1(p, q) = \sup_{f \in \mathcal{F}} \left( \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{x \sim q}[f(x)] \right)\]
where $\mathcal{F}$ is the set of all 1-Lipschitz functions.

This dual formulation is computationally more tractable and forms the basis of the Wasserstein GAN. Instead of directly minimizing the Wasserstein distance, we can train a neural network to approximate the optimal 1-Lipschitz function.

Wasserstein GAN

The Wasserstein GAN (WGAN) leverages the Kantorovich-Rubinstein duality to create a more stable GAN variant. The key insight is to replace the discriminator with a "critic" that approximates the optimal 1-Lipschitz function.

Definition 27 (Wasserstein GAN Objective). The Wasserstein GAN objective is:
\[\min_G \max_{f \in \mathcal{F}} \mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{z \sim p_z}[f(G(z))]\]
where $\mathcal{F}$ is the set of 1-Lipschitz functions.

In practice, we enforce the Lipschitz constraint through weight clipping or gradient penalty, rather than explicitly constraining the function space.

WGAN Algorithm

  1. Initialize critic parameters $\theta$ and generator parameters $\phi$ randomly
  2. For each training iteration:
    1. For $k$ steps:
      1. Sample minibatch $\{x_1, \ldots, x_m\}$ from data distribution $p_{data}$
      2. Sample minibatch $\{z_1, \ldots, z_m\}$ from prior $p_z$
      3. Update critic by ascending its stochastic gradient:
        \[\nabla_\theta \frac{1}{m} \sum_{i=1}^m [f_\theta(x_i) - f_\theta(G_\phi(z_i))]\]
      4. Clip weights of $f_\theta$ to a fixed box $[-c, c]$
    2. Sample minibatch $\{z_1, \ldots, z_m\}$ from prior $p_z$
    3. Update generator by descending its stochastic gradient:
      \[\nabla_\phi \frac{1}{m} \sum_{i=1}^m [-f_\theta(G_\phi(z_i))]\]

Discussion and Comparison

The Wasserstein GAN addresses several key limitations of the original GAN formulation:

  1. Meaningful gradients with disjoint supports: Unlike KL and JS divergences, the Wasserstein distance provides continuous gradients even when distributions have disjoint supports, as demonstrated in the parallel lines example.
  2. Better correlation with sample quality: The Wasserstein distance correlates more closely with the empirical quality of generated samples, providing a more meaningful training signal.
  3. Improved stability: WGAN training is generally more stable and less prone to mode collapse than the original GAN formulation.

However, WGAN has its own limitations:

  1. Weight clipping: The original WGAN uses weight clipping to enforce the Lipschitz constraint, which can lead to optimization issues and capacity limitations.
  2. Computational cost: WGAN typically requires more critic updates per generator iteration, increasing computational cost.
  3. Convergence speed: While more stable, WGAN may converge more slowly than some improved GAN variants.

Subsequent improvements like WGAN-GP replace weight clipping with a gradient penalty, addressing some of these limitations. The fundamental insight remains: by using a weaker metric (the Wasserstein distance) that induces a finer topology, we can achieve more stable and meaningful GAN training.

The Wasserstein GAN exemplifies how deep theoretical understanding can lead to practical improvements in machine learning algorithms. By recognizing the topological limitations of the original GAN formulation and drawing on optimal transport theory, researchers developed a more robust framework that addresses fundamental challenges in generative modeling.

"The Wasserstein GAN exemplifies how deep theoretical understanding can lead to practical improvements in machine learning algorithms. By recognizing the topological limitations of the original GAN formulation and drawing on optimal transport theory, researchers developed a more robust framework that addresses fundamental challenges in generative modeling."

5. Conclusion

This thesis has presented a comprehensive mathematical survey of Generative Adversarial Networks, examining their theoretical foundations through three complementary lenses: game theory, information theory, and optimal transport. By synthesizing these perspectives, we have illuminated the mathematical principles that govern GAN behavior and performance.

In Section 2, we framed GANs as a strategic game between competing neural networks, revealing the minimax strategy where the discriminator adopts $D(x) = {1 \over 2}$ for all $x$. This equilibrium represents maximum uncertainty, creating a challenging optimization landscape where the generator's actions become ineffective. This game-theoretic perspective explains why GAN training often suffers from instability and mode collapse---phenomena that emerge naturally from adversarial dynamics.

Section 3 provided dual information-theoretic perspectives on GAN training. Theorem 21 examined the asymptotic behavior of GANs, establishing fundamental limits on their performance, while Theorem 18 analyzed the optimization process at each iteration. Together, these results offer both macroscopic and microscopic views of the training process, revealing how information-theoretic measures evolve during adversarial learning.

Section 4 demonstrated how optimal transport theory has revolutionized GAN research. The Kantorovich-Rubinstein distance provides a more stable and meaningful metric for comparing distributions than the Jensen-Shannon divergence used in original GANs. As shown by Arjovsky et al. (2017), this theoretical insight led to Wasserstein GANs, which significantly mitigate training instability. This exemplifies how deep theoretical understanding directly enables practical improvements in learning systems.

Game Theory Perspective

Framing GANs as a strategic game reveals the minimax strategy where the discriminator adopts $D(x) = 1/2$ for all $x$. This equilibrium represents maximum uncertainty, creating a challenging optimization landscape that explains instability and mode collapse.

Information Theory Perspective

The analysis shows that GAN optimization minimizes Jensen-Shannon divergence between real and generated distributions, avoiding limitations of traditional maximum likelihood methods that minimize forward KL divergence.

Optimal Transport Perspective

The Kantorovich-Rubinstein distance provides a more stable metric than Jensen-Shannon divergence, leading to Wasserstein GANs that significantly mitigate training instability and provide meaningful gradients even with disjoint supports.

Future Directions

The mathematical frameworks examined offer fertile ground for advancing GAN research: game theory suggests new approaches to equilibrium selection; information theory provides tools for analyzing representation learning; and optimal transport offers geometric insights for more robust architectures.

Looking forward, the mathematical frameworks examined in this thesis offer fertile ground for advancing GAN research. Game theory suggests new approaches to equilibrium selection and incentive design; information theory provides tools for analyzing representation learning and generalization; and optimal transport offers geometric insights for developing more robust architectures. The convergence of these perspectives will likely yield next-generation generative models with improved stability, sample efficiency, and theoretical guarantees.

Beyond technical advancements, this research carries significant societal implications. As GANs make synthetic media generation increasingly accessible, they present both opportunities and challenges. On one hand, these technologies enable creative applications in art, design, and data augmentation. On the other hand, they facilitate the creation of convincing fake images and videos that can be weaponized for misinformation campaigns. The counterfeiter-police analogy from Goodfellow et al. (2014) aptly captures this duality:

"The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles."

This adversarial dynamic is now playing out in the realm of digital media, where researchers developing detection methods must continually adapt to increasingly sophisticated generation techniques. As documented in recent studies, this technological arms race demands ongoing innovation in both generation and detection algorithms.

Ultimately, the mathematical theory of GANs represents more than an academic exercise---it provides essential tools for understanding and shaping the future of artificial intelligence. By deepening our theoretical foundations, we not only improve technical capabilities but also develop the frameworks needed to address the profound societal challenges posed by generative technologies. As this field continues to evolve, the mathematical perspectives surveyed here will remain indispensable for researchers seeking to harness the power of adversarial learning responsibly and effectively.

"Beyond technical advancements, this research carries significant societal implications. As GANs make synthetic media generation increasingly accessible, they present both opportunities and challenges. The mathematical theory of GANs provides essential tools for understanding and shaping the future of artificial intelligence, enabling us to harness the power of adversarial learning responsibly and effectively."

6. Advances in Generative Models (2019-2025)

Since the completion of this thesis in 2019, the field of generative models has undergone revolutionary changes. While GANs continue to be influential, new paradigms have emerged that have transformed the landscape of generative AI. This addendum provides an overview of the most significant developments from 2019 to 2025.

The Rise of Diffusion Models

The most significant development has been the ascendancy of diffusion models as the dominant paradigm for generative modeling.

Denoising Diffusion Probabilistic Models (DDPM)

In 2020, Ho et al. introduced DDPMs, which simplified the diffusion process and achieved remarkable image generation quality. Unlike GANs, diffusion models are trained by gradually adding noise to data and then learning to reverse this process. The forward process is defined as:

\[q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)\]

where $\beta_t$ is a variance schedule controlling the noise addition at step $t$. The reverse process is parameterized by a neural network $\theta$ that learns to predict the noise component:

\[p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t))\]

Latent Diffusion Models

Rombach et al. (2022) introduced latent diffusion models, which operate in a compressed latent space rather than pixel space. This approach dramatically improved computational efficiency and enabled high-resolution image generation. An encoder $E$ maps input images $x$ to a latent representation $z = E(x)$, and a decoder $D$ reconstructs the image from the latent space: $\hat{x} = D(z)$. The diffusion process occurs in this latent space:

\[\mathcal{L} = \mathcal{L}_{\text{perceptual}} + \gamma\mathcal{L}_{\text{latent}} + \lambda\mathcal{L}_{\text{commitment}}\]

where $\mathcal{L}_{\text{perceptual}}$ ensures high-quality reconstruction, $\mathcal{L}_{\text{latent}}$ regularizes the latent space, and $\mathcal{L}_{\text{commitment}}$ prevents encoder output drift.

Stable Diffusion

The release of Stable Diffusion in 2022 by Rombach et al. democratized access to high-quality image generation. Its open-source nature and efficiency made it widely adopted, leading to an explosion of applications.

Large-Scale Text-to-Image Models

Several groundbreaking models have demonstrated unprecedented text-to-image generation capabilities:

DALL-E 2 and 3

OpenAI's DALL-E 2 (2022) and DALL-E 3 (2023) demonstrated unprecedented text-to-image generation capabilities, using a combination of diffusion models and CLIP-based text understanding. DALL-E 3 introduced significant improvements in understanding complex prompts and generating coherent text within images.

Midjourney

Midjourney, released in 2022 and continuously improved, became known for its artistic image generation capabilities and became a cultural phenomenon, particularly in creative communities.

Google's Imagen and Parti

Google introduced Imagen (2022) and Parti (2022), which pushed the boundaries of photorealistic image generation and text understanding. Imagen used a large frozen T5 text encoder to encode text, and diffusion models to generate high-fidelity images.

Video Generation

Recent advances have extended generative models to video content:

Make-A-Video

Meta's Make-A-Video (2022) extended text-to-image generation to video, using diffusion models to generate short video clips from text prompts. The approach leverages pre-trained image generation models and extends them to temporal consistency.

Sora

OpenAI's Sora (2024) represented a major breakthrough in video generation, capable of generating minute-long videos with remarkable coherence and realism. Sora uses a diffusion transformer architecture that can generate videos up to a minute long while maintaining visual quality and consistency.

3D and Multimodal Generation

Recent advances have extended generative models to 3D content and multimodal applications:

3D Generation

Models like DreamFusion (2022) and Magic3D (2023) enabled text-to-3D generation by extending diffusion models to 3D spaces. These approaches typically use a 2D diffusion model as a prior and optimize a 3D representation (such as NeRF) to match the 2D renderings.

Multimodal Models

GPT-4 (2023) and Gemini (2023) demonstrated the power of multimodal models that can generate and understand text, images, audio, and video in a unified framework. These models represent a shift toward more general-purpose AI systems that can handle multiple modalities seamlessly.

Theoretical Advances

Significant theoretical progress has been made in understanding generative models:

Score-Based Generative Modeling

Song and Ermon (2019-2021) developed a unified framework connecting diffusion models, score-based generative models, and energy-based models, providing theoretical foundations for the success of diffusion models. The score function $s_\theta(x,t) = \nabla_x \log p_t(x)$ is learned to reverse the diffusion process.

Flow Matching

Lipman et al. (2023) introduced flow matching, a simpler and more flexible approach to generative modeling that has shown promise as an alternative to diffusion models. Flow matching directly learns a vector field that transforms noise into data, avoiding the iterative sampling process.

Consistency Models

Song et al. (2023) developed consistency models, which can generate samples in a single step, addressing the computational inefficiency of iterative sampling in diffusion models. These models learn to map any point on a probability flow trajectory to its starting point.

GANs in the New Era

While diffusion models have dominated, GANs have continued to evolve:

StyleGAN Series

StyleGAN2 (2020) and StyleGAN3 (2021) by Karras et al. continued to push the boundaries of GAN-based image generation, particularly for face synthesis. These models introduced architectural improvements and training techniques that significantly improved image quality and controlled generation.

GANs for Video

Models like MoCoGAN (2023) and VideoGAN (2024) have adapted GANs for video generation tasks, addressing challenges of temporal consistency and motion modeling.

Conditional and Controllable Generation

Research has focused on making GANs more controllable and interpretable, with applications in medical imaging, design, and content creation. Techniques like StyleSpace manipulation allow for fine-grained control over generated images.

Efficiency and Scalability

Recent research has focused on improving the efficiency and scalability of generative models:

Knowledge Distillation

Techniques for distilling large diffusion models into smaller, faster models have become crucial for practical applications. Methods like progressive distillation can reduce the number of sampling steps from hundreds to just a few.

Few-Shot and Zero-Shot Learning

Models like DALL-E Mini (2022) and Stable Diffusion XL (2023) have demonstrated impressive few-shot and zero-shot generation capabilities, allowing users to generate high-quality images with minimal examples or even just textual descriptions.

Societal Impact and Challenges

The improved quality of generative models has raised important societal concerns:

Deepfakes and Misinformation

The improved quality of generative models has raised concerns about deepfakes and misinformation, leading to research in detection and watermarking techniques. Watermarking methods aim to embed imperceptible signals in generated content that can be used to verify its origin.

Copyright and Legal Issues

The training of generative models on copyrighted data has led to legal challenges and debates about fair use and data rights. Several lawsuits have been filed against companies developing generative AI models, raising questions about the legality of training on copyrighted content without permission.

Bias and Fairness

Research has focused on understanding and mitigating biases in generative models, which can amplify societal biases present in training data. Techniques like dataset curation, prompt engineering, and post-processing have been developed to address these issues.

Applications and Industry Adoption

Generative models have been widely adopted across various industries:

Creative Industries

Generative models have been widely adopted in creative industries for content creation, design, and entertainment. Artists and designers use these tools as creative aids, while game developers use them for asset generation and world-building.

Healthcare

Applications in medical imaging, drug discovery, and synthetic data generation have shown significant promise. Generative models can create synthetic medical images for training diagnostic systems, helping address data scarcity and privacy concerns.

Scientific Research

Generative models have been applied to scientific problems, including protein structure prediction, material design, and climate modeling. These applications leverage the ability of generative models to explore complex, high-dimensional spaces efficiently.

Future Directions

Looking ahead, several promising directions are emerging in generative modeling:

Multimodal and Interactive Generation

Future research is likely to focus on more sophisticated multimodal generation and interactive systems that can respond to user feedback in real-time. This includes models that can generate and modify content across multiple modalities based on user input.

3D and Video Generation

Improvements in 3D and video generation quality and efficiency are active areas of research. This includes better handling of temporal consistency, physical realism, and semantic coherence in generated content.

Theoretical Understanding

Despite empirical success, the theoretical understanding of diffusion models and large-scale generative models remains incomplete, presenting opportunities for future research. This includes developing better theoretical frameworks for understanding the generalization properties, sample efficiency, and optimization dynamics of these models.

Conclusion

The period from 2019 to 2025 has seen generative AI transition from a research curiosity to a transformative technology with widespread applications. While GANs laid important foundations, diffusion models and large-scale multimodal systems have become the dominant paradigms. The field continues to evolve rapidly, with ongoing research addressing efficiency, controllability, theoretical understanding, and societal impact.

"The period from 2019 to 2025 has seen generative AI transition from a research curiosity to a transformative technology with widespread applications. While GANs laid important foundations, diffusion models and large-scale multimodal systems have become the dominant paradigms. The field continues to evolve rapidly, with ongoing research addressing efficiency, controllability, theoretical understanding, and societal impact."