Jethro's Braindump

Normalizing Flows

Normalizing flows provide a way of constructing probability distributions over continuous random variables. In flow-based modelling, we would like to express a D-dimensional vector x as a transformation T of a real vector u sampled from pu(u):

x=T(u) where upu(u)

The transformation T must be invertible and both T and T1 must be differentiable. These transformations are known as diffeomorphisms, and require that u must be D-dimensional. Under these conditions, the density of x is well-defined, and can be obtained via the change-of-variables theorem:

px(x)=pu(u)|detJT(u)|1 where u=T1(x)

The density can also be equivalently written in terms of the Jacobian of T1:

px(x)=pu(T1(x))|detJT1(x)|

In practice, a flow-based model is constructed by implementing T or T1 with a neural network, and pu(u) as a simple density such as a multivariate normal.

One can think of transformations T as expanding and contracting the space RD in order to mold the density pu(u) into px(x).

These invertible, differentiable transformations are composable. Complex transformations can be constructed by composing simple transformations.

(T2T1)1=T11T21 detJT2T1(u)=detJT2(T1(u))detJT1(u)

Flow-based models provide 2 operations, with differing computational complexity:

  1. Sampling from the model, requiring ability to sample from pu(u) and computing the forward transformation T.
  2. Evaluating the model’s density, requiring computing the inverse transformation T1 and its Jacobian determinant.

Flow-based models can represent any distribution px(x), under reasonable conditions on px(x).

Flow-based models for modeling and inference

Fitting a flow-based model px(x;θ) to a target distribution px(x) can be done by minimizing some divergence or discrepancy between them. The minimization is performed with respect to the model’s parameters, θ={ϕ,ψ}, where ϕ are parameters of T and ψ are parameters of pu(u).

For example, one can use the forward KL divergence:

L(θ)=DKL[px(x)px(x;θ)] =Epx(x)[logpx(x;θ)]+ const.  =Epx(x)[logpu(T1(x;ϕ);ψ)+log|detJT1(x;ϕ)|]+ const. 

The forward KL divergence is well-suited where we have samples from the target distribution, but cannot necessarily evaluate the target density px(x). We can estimate this expectation by Monte Carlo using samples from px(x).

Fitting the model requires computing T1, its Jacobian determinant and the density pu(u;ψ), and differentiating through all three. We do not need to compute T or sample from pu(u,ψ), although these operations will be needed if we want to sample from the model after fitting.

The reverse KL-divergence is suitable when we have the ability to evaluate the target density px(x), but are not necessarily able to sample from it.

There is some duality between the forward and reverse-mode KL-divergence in flow-based models. Fitting the model to the target via reverse KL-divergence is equivalent to fitting pu(u;ϕ) to the base via forward KL-divergence.

Alternative divergences include f-divergences, which use density ratios, or integral probability metrics (IPM) that uses differences.

Computational Complexities

Increasing the “depth” (number of composed sub-flows) of the transformation results in only O(K) growth in computation complexity, where K is the depth of the flow.

One crucial operation is the computation of the Jacobian determinant. In automatic-differentiation frameworks, this has computational cost of O(D3), where D is the number of inputs and outputs of a neural network. For practical applications we choose neural network designs that reduce the cost to O(D).

Examples of such efficient sub-flow transformations include:

  • autoregressive flows
  • linear flows
  • residual flows

Practical Considerations

Composing a large number of flows bring their own challenges.

Normalization

As with deep neural networks, normalizing the intermediate representation is crucial for stable gradients throughout the flow. Models such as Glow employ variants of batch normalization. Batch normalization can be implemented as a composition of 2 affine transformations. The first has scale and translation parameters set by the batch statistics, and the second has free parameters α (scale) and β (translation):

BN(z)=αzμ^σ^2+ϵ+β,BN1(z)=μ^+zβασ^2+ϵ

Glow uses a variant called activation normalization, which is preferable when training with small mini-batches since batch norm’s statistics become noisy and can destabilize training.

Multi-scale architectures

Because x and u must have the same dimensionality, and Tk must preserve this dimensionality, the transformations can be extremely expensive. To combat this issue, one can clamp sub-dimensions of the intermediate flow zk such that no additional transformation is applied. Doing so allows us to apply steps to a subset of dimensions, which is less costly.

This kind of optimization is natural when dealing with granular data types such as pixels.

Continuous Flows

We can construct flows in continuous time by parameterizing the flow’s infinitesimal dynamics, and then integrating to find the corresponding transformation. The flow is defined by an ordinary differential equation (ODE) that describes the flow’s evolution in time.

Resources

Bibliography

Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. n.d. “Normalizing Flows for Probabilistic Modeling and Inference.” http://arxiv.org/abs/1912.02762v1.

Links to this note