When to grow?¶

Much of the focus of the depth-growing literature has been on when to grow. Counter-intuitively, waiting for the current network to fully converge harms performance, and rather new layers should be grown well before convergence [DLLS20, WYCL20, WWM+24]. Two explanations are commonly suggested. First, the converged weights of the current sub-network \(\boldsymbol{W}_t\) may provide a poor warm-start initialisation for optimizing the larger network. Second, the newly-added sub-networks may simply be undertrained [WWM+24], giving rise to a regularising effect, possibly finding flatter minima than standard training [CC24]. The key growth schedules proposed are:

Periodic Growth: grow every \(K\) epochs.
Convergent Growth: grow when the increase in validation accuracy is less than \(\tau\) in the last \(K\) epochs.
FraGrow: Arguing that the speed of growth determines the degree of under/overfitting, FRAGrow [WWM+24] uses the difference between train and validation acc as a signal to trigger growth.
LipGrow: grow to limit gains in the Lipschitz constant [DLLS20].

FraGrow, although heuristic, performs well on a wide range of datasets. Nevertheless, periodic growth is a simple and widely used baseline, outperforming convergent growth.

The current algorithms in this survey can be summarized as follows:

Table 2 When to grow across algorithms¶
Algorithm	When to grow
AutoGrow	Stage-wise schedule before convergence.
Firefly	When a proposed edit improves the loss.
GradMax	Predefined schedule.
NeST	Predefined schedule.
Net2Net	Only once, on converged networks.
Network Morphism	Predefined schedule.
NORTH	Activation-rank threshold.
SENN	Residual natural-gradient threshold.
Splitting	Negative splitting criterion.
Tiny	Predefined schedule.
Variance Transfer	Predefined schedule.

It is currently unclear how many of these observations generalise beyond layer-addition, to provide a general answer of when to grow. Residual networks, the focus of much of the literature, have unusual properties. A network with \(n\) residual connections has \(2^n\) implicit paths through the network, giving rise to ensemble-like behaviour: removing any individual layer (apart from downsampling layers) has a negligible impact on test accuracy [VWB16]. Working in reverse, we might expect that growing residual layers shares some similarities with adding ensemble members.

References¶

[CC24]

Paul Caillon and Christophe Cerisara. Growing Neural Networks have Flat Optima and Generalize Better. September 2024. URL: https://hal.science/hal-04697428.

[DLLS20] (1,2)

Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards Adaptive Residual Network Training: A Neural-ODE Perspective. In ICML. 2020.

[VWB16]

Andreas Veit, Michael Wilber, and Serge Belongie. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. October 2016. arXiv:1605.06431. URL: http://arxiv.org/abs/1605.06431, doi:10.48550/arXiv.1605.06431.

[WYCL20]

Wei Wen, Feng Yan, Yiran Chen, and Hai Li. AutoGrow: Automatic Layer Growing in Deep Convolutional Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, 833–841. New York, NY, USA, August 2020. Association for Computing Machinery. URL: https://dl.acm.org/doi/10.1145/3394486.3403126, doi:10.1145/3394486.3403126.

[WWM+24] (1,2,3)

Haihang Wu, Wei Wang, Tamasha Malepathirana, Damith Senanayake, Denny Oetomo, and Saman Halgamuge. When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks. In AAAI, volume 38, 5994–6002. 2024. Number: 6. URL: https://ojs.aaai.org/index.php/AAAI/article/view/28414, doi:10.1609/aaai.v38i6.28414.