When to grow?
================

Much of the focus of the depth-growing literature has
been on *when to grow*. Counter-intuitively, waiting for the current
network to fully converge harms performance, and rather new layers
should be grown well before
convergence :cite:p:`wen_autogrow_2020,dong_towards_2020,wu_when_2024`.
Two explanations are commonly suggested. First, the converged weights of
the current sub-network :math:`\boldsymbol{W}_t` may provide a poor
warm-start initialisation for optimizing the larger network. Second, the
newly-added sub-networks may simply be
undertrained :cite:p:`wu_when_2024`, giving rise to a
regularising effect, possibly finding flatter minima than standard
training :cite:p:`caillon_growing_2024`. The key growth
schedules proposed are:

- *Periodic Growth*: grow every :math:`K` epochs.

- *Convergent Growth*: grow when the increase in validation accuracy is
  less than :math:`\tau` in the last :math:`K` epochs.

- *FraGrow*: Arguing that the speed of growth determines the degree of
  under/overfitting, FRAGrow :cite:p:`wu_when_2024` uses the
  difference between train and validation acc as a signal to trigger
  growth.

- LipGrow: grow to limit gains in the Lipschitz
  constant :cite:p:`dong_towards_2020`.

FraGrow, although heuristic, performs well on a wide range of datasets.
Nevertheless, periodic growth is a simple and widely used baseline,
outperforming convergent growth.

It is currently unclear how many of these observations generalise beyond
layer-addition, to provide a general answer of *when to grow*. Residual
networks, the focus of much of the literature, have unusual properties.
A network with :math:`n` residual connections has :math:`2^n` implicit
paths through the network, giving rise to ensemble-like behaviour:
removing any individual layer (apart from downsampling layers) has a
negligible impact on test
accuracy :cite:p:`veit_residual_2016`. Working in reverse, we
might expect that growing residual layers shares some similarities with
adding ensemble members.