Variance Transfer¶
Variance Transfer. Instead of focusing solely on function preservation, Variance Transfer [YSM23] builds upon this last function-preserving initialization with the following modifications. Inspired by studies on how to optimally transfer hyperparameters across networks of different widths [YHB+21], the new fan-in weights \(V \sim \mathcal{N}(0, 1/(C_{l-2}^2)\) and new fan-out weights \(Z \sim \mathcal{N}(0, 1/(C_{l-1}+k)^2)\) rather than the usual \(1/\textrm{fan\_in}\), and rescaling them when the weight matrix is enlarged by \(\boldsymbol{W}_{t+1}=\boldsymbol{W}_t \cdot \frac{C_t}{C_{t+1}}\) to preserve variance.
Furthermore, the learning rate is adapted to the growth cycle of each sub-network. Partitioning the weights \(\mathbf{W}_T\) of the entire network according to the growth stage \(t \in [0, T]\) at which they were added, \(\mathbf{W} = \{W_0, W_{\Delta 1}, \ldots, W_{\Delta T} \}\), each sub-network \(W_{\Delta t}\) is assigned a learning rate:
where \(\eta_0\) is the base learning rate and \(\|\cdot\|_F\) is the Frobenius norm. This compensates for the fact that different subnetworks are trained for a different number of epochs. These modifications to the vanilla function-preserving morphism result in a surprisingly strong baseline.
References¶
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. In NeurIPS. 2021. arXiv:2203.03466. URL: http://arxiv.org/abs/2203.03466, doi:10.48550/arXiv.2203.03466.
Xin Yuan, Pedro Savarese, and Michael Maire. Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation. In NeurIPS. December 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/359ffa88712bd688963a0ca641d8330b-Abstract-Conference.html.