Variance Transfer
================


**Variance Transfer.** Instead of focusing solely on function
preservation, Variance
Transfer :cite:p:`yuan_accelerated_2023` builds upon this last
function-preserving initialization with the following modifications.
Inspired by studies on how to optimally transfer hyperparameters across
networks of different widths :cite:p:`yang_tensor_2021`, the
new fan-in weights :math:`V \sim \mathcal{N}(0, 1/(C_{l-2}^2)` and new
fan-out weights :math:`Z \sim \mathcal{N}(0, 1/(C_{l-1}+k)^2)` rather
than the usual :math:`1/\textrm{fan\_in}`, and rescaling them when the
weight matrix is enlarged by
:math:`\boldsymbol{W}_{t+1}=\boldsymbol{W}_t \cdot \frac{C_t}{C_{t+1}}`
to preserve variance.

Furthermore, the learning rate is adapted to the growth cycle of each
sub-network. Partitioning the weights :math:`\mathbf{W}_T` of the entire
network according to the growth stage :math:`t \in [0, T]` at which they
were added,
:math:`\mathbf{W} = \{W_0, W_{\Delta 1}, \ldots, W_{\Delta T} \}`, each
sub-network :math:`W_{\Delta t}` is assigned a learning rate:

.. math::

   \begin{aligned}
   \eta_t = \eta_0 \, \frac{\|\boldsymbol{W}_{\Delta t}\|_F}{\|\boldsymbol{W}_{0}\|_F}
   \end{aligned}

where :math:`\eta_0` is the base learning rate and :math:`\|\cdot\|_F`
is the Frobenius norm. This compensates for the fact that different
subnetworks are trained for a different number of epochs. These
modifications to the vanilla function-preserving morphism result in a
surprisingly strong baseline.