Variance Transfer¶
TLDR: Function-preserving growth with i) good weight initialisation and ii) growth-aware learning rates goes a long way.
Many growing methods frame growth as the solution of a local optimisation problem for the new weights at each growth step. Instead, Variance Transfer [YSM23] uses an (approximately) function-preserving initialisation and focuses on training dynamics, preserving desirable statistical properties that benefit future optimisation of the network. Variance Transfer has four main components:
Maximal Update Parameterisation [YHB+21] for the learning rate and weight initialisation.
Function-preserving transformations with random weights, rather than e.g. splitting.
Adapting the learning rate to different growth stages.
(Optionally) Adapting the batch size to maximize GPU throughput and reduce training time.
1. Maximal update parameterisation (\(\mu P\))¶
In general, the optimal choice of hyperparameters such as the learning rate and weight initialisation variance are not the same for networks of different sizes. How should these hyperparameters depend on the layer size?
For layers of the form \(y_l = W_l x_l + b_l\). Assuming that each element of \(x_l\) and \(W_l\) are independently distributed, and the biases are zeroed. The variance of the preactivations \(y_l\) is given by
In the final line we assume that the weights have zero mean, although the inputs \(x_l\) may not. The latter is true for linearised activation functions, but for ReLU activations \(\mathbb{E}[x_l^2] = \tfrac{1}{2} \textrm{Var}[y_{l-1}]\) and thus
In order to preserve the magnitude of the pre-activations from layer to layer, Kaiming Initialisation proposes to set \(\tfrac{1}{2} n_l \textrm{Var}[w_l] = 1\), thus initialising each layer’s weights with a zero-mean Gaussian with variance \(\propto 1 / \textrm{fan\_in}\).
This “PyTorch” parameterisation is not unique. In the large-width limit, gradient updates cause the activations to blow up in PyTorch parameterisation. Maximal Update Parameterisation (\(\mu P\), [YHB+21]) therefore proposes the alternative requirement: the effect of a gradient step on the activations should be approximately width-independent in the large-width limit. This results in the following parameterisation (with the PyTorch parameterisation in parentheses):
Input weights & all biases |
Output weights |
Hidden weights |
|
|---|---|---|---|
Init. Var. |
\(\frac{1}{\mathrm{fan\_in}}\) |
\(\frac{1}{\mathrm{fan\_in}^2}\) \(\left(\frac{1}{\mathrm{fan\_in}}\right)\) |
\(\frac{1}{\mathrm{fan\_in}}\) |
SGD LR |
\(\mathrm{fan\_out}\) (1) |
\(\frac{1}{\mathrm{fan\_in}}\) (1) |
\(1\) |
Later, it was observed empirically that, applying \(\mu P\) to finite-width models, the optimal choice of e.g. learning rate remains constant, roughly independent of layer size. This motivates its use for both hyperparameter transfer between networks [YHB+21], and more generally for growing neural networks.
2. Function-preserving growth¶
Variance Transfer growth for hidden layers: expanding the \(\rm fan\_in\) from \(C_t^u\) to \(C_{t+1}^u\) and the \(\rm fan\_out\) from \(C_t^h\) to \(C_{t+1}^h\).
Splitting individual neurons is not the only form of function-preserving growth. Indeed, for any matrices \(V \in \mathbb{R}^{k/2\times C_{l-2}}\) and \(Z \in \mathbb{R}^{C_l \times k/2}\), the addition of new neurons with a minus sign inserted
ensures that the contributions of the new weights cancel, preserving the network function. See the Neuron Addition Problem for details on this notation. The new weights are initialised following \(\mu P\),
To preserve variance, the old weights are rescaled by \(\boldsymbol{W}_{t+1}=\boldsymbol{W}_t \cdot \frac{C_t}{C_{t+1}}\). This is an approximation that only strictly holds at initialisation. More carefully, as described in App A of [YSM23], one can explicitly enforce unit variance of the preactivations \(\textrm{Var}[y_l] = 1\) after growth, rescaling based on the empirical weight variance rather than simply the \(\textrm{fan\_in}\). However, in practice this does not outperform the \(\textrm{fan\_in}\) approximation.
The running mean \(\mu\) and variance \(\sigma^2\) of Batch Normalization layers are also rescaled. For a scale factor \(c\), the mean and variance are scaled by \(c \mu\) and \(c^2 \sigma^2\) respectively.
3. Learning-rate adaptation¶
Typically, the learning rate is global, the same for all layers in the network. Following \(\mu P\), Variance Transfer assigns a layer-dependent learning rate proportional to the \(\textrm{fan\_in}\) of that layer, as described in Table 4.
Furthermore, the learning rate is adapted to the growth cycle of each sub-network. Partitioning the weights \(\mathbf{W}_T\) of the entire network according to the growth stage \(t \in [0, T]\) at which they were added, \(\mathbf{W} = \{W_0, W_{\Delta 1}, \ldots, W_{\Delta T} \}\), each sub-network \(W_{\Delta t}\) is assigned a learning rate:
where \(\eta_0\) is the base learning rate and \(\|\cdot\|_F\) is the Frobenius norm. This compensates for the fact that different subnetworks are trained for a different number of epochs.
4. Batch size adaptation¶
One of the motivations for growing neural networks is accelerated training, however a reduction in parameters does not generally translate to significant walltime speedups. For example, for ResNet-20 on CIFAR-10, growing only results in 15% speedup. A variant of Variance Transfer is proposed which also scales the batch size, to maximise GPU utilisation throughout growth, resulting in a 70% speedup, in exchange for a small decrease in performance.
5. When, how many, and where to grow?¶
Where to grow?¶
Like Net2Net, the paper grows the width of all layers simultaneously.
How many?¶
The paper proposes an exponential growth schedule, with \(C_t\) the number of channels at growth stage \(t\):
With typical values of \(p_c = 0.2\) which results in 8 growth steps.
When to grow?¶
The paper proposes an exponential growth schedule. Each growth stage \(t\) is assigned \(T_t\) epochs, with:
with typical values of \(p_T = p_C\), \(T_0 \in [4,10]\) and \(T_\textrm{final} \in [90, 200]\) epochs.
This was compared with a constant growth schedule (\(p_T = 0\)), which performs similarly to the exponential schedule. The authors claim that the exponential schedule (i.e. \(p_T > 0\)) performs better, but the experiments are not fully convincing.
Results¶
The following table ablates the various components of Variance Transfer for ResNets on CIFAR-10/100. They ablate their function-preserving morphism (vs Net2Net), variance rescaling (VRS) of the old weights, and learning rate adaptation (LRA), as well as the implementation of all components (Full). They perform similarly to the non-grown baseline.
Variant |
Res-20 on C-10 (%) |
Res-18 on C-100 (%) |
|---|---|---|
Net2Net |
\(91.60 \pm 0.21\) (+0.00) |
\(76.48 \pm 0.20\) (+0.00) |
Growing |
\(91.62 \pm 0.12\) (+0.02) |
\(76.82 \pm 0.17\) (+0.34) |
Growing+VRS |
\(92.00 \pm 0.10\) (+0.40) |
\(77.27 \pm 0.14\) (+0.79) |
Growing+LRA |
\(92.24 \pm 0.11\) (+0.64) |
\(77.74 \pm 0.16\) (+1.26) |
Full |
\(92.53 \pm 0.11\) (+0.93) |
\(78.12 \pm 0.15\) (+1.64) |
Non-growing baseline |
\(92.62 \pm 0.15\) |
\(78.36 \pm 0.12\) |
Hyperparameters:
Growth schedule: \(p_T = p_C = 0.2\), \(T_0 = 8, 10\) for C-10, C-100 respectively, \(T_\textrm{final} = 100, 200\) for C-10, C-100 respectively.
Optimizer: SGD with momentum 0.9, weight decay \(5 \cdot 10^{-4}\), base learning rate \(\eta_0 = 0.1\) with cosine decay.
Open Questions¶
Variance Transfer’s growing method improves performance over Net2Net for CIFAR-100 but not CIFAR-10. This difference may be due to the increased propensity to overfit in CIFAR-100, since the use of random weight initialisation can provide additional regularisation compared to neuron splitting.
In general, scaling the batch size requires simultaneously scaling the learning rate in order to preserve training dynamics, see e.g. [GDG+17]. However, their Batch Rate Adaptation method does not do this.
References¶
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677v2, 2017. URL: https://arxiv.org/abs/1706.02677v2.
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. In NeurIPS. 2021. arXiv:2203.03466. URL: http://arxiv.org/abs/2203.03466, doi:10.48550/arXiv.2203.03466.
Xin Yuan, Pedro Savarese, and Michael Maire. Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation. In NeurIPS. December 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/359ffa88712bd688963a0ca641d8330b-Abstract-Conference.html.