When to grow?¶
Much of the focus of the depth-growing literature has been on when to grow. Counter-intuitively, waiting for the current network to fully converge harms performance, and rather new layers should be grown well before convergence [DLLS20, WYCL20, WWM+24]. Two explanations are commonly suggested. First, the converged weights of the current sub-network \(\boldsymbol{W}_t\) may provide a poor warm-start initialisation for optimizing the larger network. Second, the newly-added sub-networks may simply be undertrained [WWM+24], giving rise to a regularising effect, possibly finding flatter minima than standard training [CC24]. The key growth schedules proposed are:
Periodic Growth: grow every \(K\) epochs.
Convergent Growth: grow when the increase in validation accuracy is less than \(\tau\) in the last \(K\) epochs.
FraGrow: Arguing that the speed of growth determines the degree of under/overfitting, FRAGrow [WWM+24] uses the difference between train and validation acc as a signal to trigger growth.
LipGrow: grow to limit gains in the Lipschitz constant [DLLS20].
FraGrow, although heuristic, performs well on a wide range of datasets. Nevertheless, periodic growth is a simple and widely used baseline, outperforming convergent growth.
It is currently unclear how many of these observations generalise beyond layer-addition, to provide a general answer of when to grow. Residual networks, the focus of much of the literature, have unusual properties. A network with \(n\) residual connections has \(2^n\) implicit paths through the network, giving rise to ensemble-like behaviour: removing any individual layer (apart from downsampling layers) has a negligible impact on test accuracy [VWB16]. Working in reverse, we might expect that growing residual layers shares some similarities with adding ensemble members.
References¶
Paul Caillon and Christophe Cerisara. Growing Neural Networks have Flat Optima and Generalize Better. September 2024. URL: https://hal.science/hal-04697428.
Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards Adaptive Residual Network Training: A Neural-ODE Perspective. In ICML. 2020.
Andreas Veit, Michael Wilber, and Serge Belongie. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. October 2016. arXiv:1605.06431. URL: http://arxiv.org/abs/1605.06431, doi:10.48550/arXiv.1605.06431.
Wei Wen, Feng Yan, Yiran Chen, and Hai Li. AutoGrow: Automatic Layer Growing in Deep Convolutional Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, 833–841. New York, NY, USA, August 2020. Association for Computing Machinery. URL: https://dl.acm.org/doi/10.1145/3394486.3403126, doi:10.1145/3394486.3403126.
Haihang Wu, Wei Wang, Tamasha Malepathirana, Damith Senanayake, Denny Oetomo, and Saman Halgamuge. When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks. In AAAI, volume 38, 5994–6002. 2024. Number: 6. URL: https://ojs.aaai.org/index.php/AAAI/article/view/28414, doi:10.1609/aaai.v38i6.28414.