When to stop?

Most growing methods follow a predetermined schedule, providing no notion of when to stop, and only a small number have an explicit (or implicit) notion of stopping. AutoGrow [WYCL20] proposes an explicit stopping policy, permanently freezing further growth of a block when an addition fails to improve performance. Splitting methods also naturally terminate when a local minimum of the loss is reached [WYL+21], while grow-prune methods such as SENN [MMKM24] achieve a steady-state behaviour by iterative growing and pruning.

Following the Pareto front. By considering growth operations \(\mathcal{T}\) which are constrained to a neighbourhood of the existing architecture, one imposes an implicit constraint on the complexity \(C(A_t)\), such as parameters, FLOPs, or walltime of the model during the growing process. However, in practice, the objective lies in a constraint on the final complexity. In other words, architectures on the Pareto optimality front provide the best possible performance for our given resources, and to stop growing once the improvements no longer justify additional capacity. None of the stopping methods above identifies a desired performance-complexity tradeoff. Furthermore, the observation of Sec. 3.1 that new growth should occur before the convergence of the old network makes it difficult to explore the Pareto front directly.

Notions of complexity. What is the most appropriate notion of complexity \(C(A, \theta)\)? The reduction in FLOPs of smaller models does not necessarily translate into a significant reduction in training time. As a point of comparison, Variance Transfer [YSM23], whose growing method requires negligible additional overhead, observes a mere 1.2x speedup when growing a \(1/4\) channel-width ResNet-18 on CIFAR-100. Due to this, they propose adapting the batch size to maximise GPU memory usage and achieve a 1.7x speedup for a negligible change in final accuracy. In general, however, large-batch training often results in poor generalisation performance [KMN+17] and is itself a topic of research. The largest-scale evidence for walltime efficiency comes from transformer growth experiments (Sec. 6), which demonstrate significant speedups but also reveal that efficiency gains depend on architecture and scale in ways that current theory does not predict.

Supernetworks. Apart from growing architectures, building a supernetwork that encompasses many operational sub-models in a nested manner is an alternative approach to recovering the entire Pareto front, at the cost of a single training procedure (in this case, joint training of all sub-models). To train a supernetwork, one can either impose a sub-modular structure through joint loss minimization, where each loss corresponds to the error of a given sub-model [DKK+24, TMK16, YH19, YYX+18], or first train a large neural network and then distill its predictive power through fine-tuning smaller sub-modules [CGW+20]. Depending on the method, the nested structure can be obtained by varying the width of representations [DKK+24, YH19, YYX+18], the depth of the neural network [TMK16], or both width and depth [CGW+20]. In general, increasing the number of sub-modules provides greater flexibility in selecting a model that trades off model size against performance.

References

[CGW+20] (1,2)

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: train one network and specialize it for efficient deployment. In ICLR. 2020.

[DKK+24] (1,2)

Fnu Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham Kakade, Ali Farhadi, and others. Matformer: nested transformer for elastic inference. NeurIPS, 2024.

[KMN+17]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. February 2017. arXiv:1609.04836. URL: http://arxiv.org/abs/1609.04836, doi:10.48550/arXiv.1609.04836.

[MMKM24]

Rupert Mitchell, Robin Menzenbach, Kristian Kersting, and Martin Mundt. Self-Expanding Neural Networks. 2024. arXiv:2307.04526. URL: http://arxiv.org/abs/2307.04526, doi:10.48550/arXiv.2307.04526.

[TMK16] (1,2)

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd ICPR, 2464–2469. IEEE, 2016.

[WYCL20]

Wei Wen, Feng Yan, Yiran Chen, and Hai Li. AutoGrow: Automatic Layer Growing in Deep Convolutional Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, 833–841. New York, NY, USA, August 2020. Association for Computing Machinery. URL: https://dl.acm.org/doi/10.1145/3394486.3403126, doi:10.1145/3394486.3403126.

[WYL+21]

Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, and Qiang Liu. Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting. 2021. arXiv:2003.10392. URL: http://arxiv.org/abs/2003.10392, doi:10.48550/arXiv.2003.10392.

[YH19] (1,2)

Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In ICCV, 1803–1811. 2019.

[YYX+18] (1,2)

Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.

[YSM23]

Xin Yuan, Pedro Savarese, and Michael Maire. Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation. In NeurIPS. December 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/359ffa88712bd688963a0ca641d8330b-Abstract-Conference.html.