Normalizing Flows
Normalizing flows provide a way of constructing probability
distributions over continuous random variables. In flow-based
modelling, we would like to express a D-dimensional vector
The transformation
The density can also be equivalently written in terms of the Jacobian
of
In practice, a flow-based model is constructed by implementing
One can think of transformations
These invertible, differentiable transformations are composable. Complex transformations can be constructed by composing simple transformations.
Flow-based models provide 2 operations, with differing computational complexity:
- Sampling from the model, requiring ability to sample from
and computing the forward transformation . - Evaluating the model’s density, requiring computing the inverse
transformation
and its Jacobian determinant.
Flow-based models can represent any distribution
Flow-based models for modeling and inference
Fitting a flow-based model
For example, one can use the forward KL divergence:
The forward KL divergence is well-suited where we have samples from
the target distribution, but cannot necessarily evaluate the target
density
Fitting the model requires computing
The reverse KL-divergence is suitable when we have the ability to
evaluate the target density
There is some duality between the forward and reverse-mode
KL-divergence in flow-based models. Fitting the model to the target
via reverse KL-divergence is equivalent to fitting
Alternative divergences include f-divergences, which use density ratios, or integral probability metrics (IPM) that uses differences.
Computational Complexities
Increasing the “depth” (number of composed sub-flows) of the
transformation results in only
One crucial operation is the computation of the Jacobian determinant.
In automatic-differentiation frameworks, this has computational cost
of
Examples of such efficient sub-flow transformations include:
- autoregressive flows
- linear flows
- residual flows
Practical Considerations
Composing a large number of flows bring their own challenges.
Normalization
As with deep neural networks, normalizing the intermediate
representation is crucial for stable gradients throughout the flow.
Models such as Glow employ variants of batch normalization. Batch
normalization can be implemented as a composition of 2 affine
transformations. The first has scale and translation parameters set
by the batch statistics, and the second has free parameters
Glow uses a variant called activation normalization, which is preferable when training with small mini-batches since batch norm’s statistics become noisy and can destabilize training.
Multi-scale architectures
Because
This kind of optimization is natural when dealing with granular data types such as pixels.
Continuous Flows
We can construct flows in continuous time by parameterizing the flow’s infinitesimal dynamics, and then integrating to find the corresponding transformation. The flow is defined by an ordinary differential equation (ODE) that describes the flow’s evolution in time.
Resources
- Normalizing Flows for Probabilistic Modeling and Inference (Papamakarios et al., n.d.)
Bibliography
Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. n.d. “Normalizing Flows for Probabilistic Modeling and Inference.” http://arxiv.org/abs/1912.02762v1.