Introduction¶

Neural Networks (NNs) are typically trained by first fixing the architecture \(A \in \mathcal{A}\), after which the parameters \(\theta \in \Theta_A\) are optimized to minimize a training objective. The architectural choice is crucial, as it defines the function class explored during training, and is the workhorse of deep learning [HZRS16, VSP+17]. Despite this, architecture search is still typically performed manually, requiring significant expertise and wasted compute due to retraining.

Ideally, we would like to jointly optimize over the architecture space \(\mathcal{A}\) and its space of parameters \(\Theta_A\) by solving

\[\begin{aligned} \label{eqn:ideal_obj} A^*, \theta^* = \mathop{\mathrm{\arg\!\min}}_{A\in \mathcal{A},\; \theta \in \Theta_A} \mathcal{L}(f_{A,\theta}) \end{aligned}\]

where \(f_{A,\theta}\) denotes the function induced by architecture \(A\) with parameters \(\theta\), and \(\mathcal{L}\) is the empirical risk over dataset \(\mathcal{D}\). The closest approach to this objective is Neural Architecture Search (NAS) [ZL17]. However, a full NAS loop is often prohibitively expensive, requiring multiple retrainings, and also ignores a key constraint: we frequently start from a pre-trained model that we would like to adapt rather than discard.

This motivates growing Neural Network architectures: starting with a small “seed” architecture and expanding its capacity during training by applying local architecture transformations \(\mathcal{T}\) (also known as network morphisms), such as widening existing layers or adding new ones, and appropriately adapting the existing weights. Concretely, let \(f_{A_t, \theta_t}\) denote the model at growth step \(t\) and \(\mathop{\mathrm{\text{Opt}}}_x (\text{goal}(\theta), \theta_{\text{init}})\) denotes a few steps of e.g. stochastic gradient descent to optimize \(\text{goal}(\theta)\) over \(\theta\) starting from \(\theta_{\text{init}}\). We alternate between applying the growth operator \(\mathcal{T}\) and the optimization step

\[\begin{split}\label{eqn:grow_decomposition} \begin{aligned} &\theta_t' = \mathop{\mathrm{\text{Opt}}}_{\theta}(\mathcal{L}(f_{A_t, \theta}), \theta_t )\\ &A_{t+1}, \theta_{t+1} = \mathcal{T}(A_t, \theta_{t}') \end{aligned}\end{split}\]

in the hope that the final architecture \(A_T\) and weights \(\theta_T\) are a good approximation to the original objective [eqn:ideal_obj]. The growth operator \(\mathcal{T}\) is typically constrained to a neighbourhood \(\mathcal{N}(A_t)\) of architectures, making local architecture modifications (e.g., adding neurons or layers).

In practice, the behaviour of \(\mathop{\mathrm{\text{Opt}}}\) heavily depends on the initialisation \(\theta_{t+1}\) of the transformed architecture, making it key to achieve good performance. The lottery ticket phenomenon [CCW+21] highlights that the particular initialization and training path can matter as much as the final architecture, suggesting that growth methods leveraging a fixed set of initial weights have the potential to outperform NAS-like methods, which ignore this.

The abstract formulation of Equation [eqn:grow_decomposition] leaves much of the growing problem unspecified, which is often summarised as where to grow, when to grow, and how to grow. The focus in the literature has been overwhelmingly on the last question, which we term the neuron addition problem: how to best choose the new parameters in the case of neuron addition.

Motivations and applications¶

Motivations for growing neural networks broadly fall into two settings. In the first, the end point is known: growth is a training strategy for reaching a predefined target architecture \(A_T\), for example, in continual learning [LZW+19, YLZF21, YYLH18] are primarily interested in the sequence of models \(A_t, \theta_t\), or for improved optimisation dynamics [EMU+22, YSM23].

In contrast, we focus on the second setting, where the end point is unknown: growth is used as a frugal form of architecture search, where we would like to discover an architecture that is “just large enough” for the task at hand. Such frugal learning is becoming increasingly important as achieving state-of-the-art performance increasingly relies on scaling model size and compute, with energy consumption and \(\mathrm{CO_2}\) emissions increasing exponentially, outpacing improvements in hardware [MLN25, TGLM21]. Growing neural networks is often compared to other computation-reduction methods, such as compression, pruning, and data scaling. However, the relative advantages of each from an energy-efficiency perspective are not yet well understood [BBP+23]. Furthermore, we discuss ways in which growing can complement these methods.

Overall, this survey is the first methodological overview of growing architectures at training. Methodological contributions regarding growing neural architectures are spread in the literature, proposed in different communities, and pursue diverse objectives. This paper shows that the diverse methods for neuron addition can be unified via a common optimization objective, representing a foundational block of growth. Beyond adding neurons, extension to layer addition and computation graphs is considered, as well as encompassing non-stationary data distributions and transformer architectures. In contrast, prior surveys target either sparsity and pruning in neural networks [HABN+21], dynamic architectures for inference [HHS+21], or comparative studies targeted only to transformers [PGQS24].

References¶

[BBP+23]

Anais Boumendil, Walid Bechkit, Pierre-Edouard Portier, Frédéric Le Mouël, and Malcolm Egan. Grow, prune or select data: which technique allows the most energy-efficient neural network training? In ICTAI. Atlanta, USA, 2023. URL: https://ieeexplore.ieee.org/document/10356560/, doi:10.1109/ICTAI59109.2023.00051.

[CCW+21]

Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Jingjing Liu, and Zhangyang Wang. The Elastic Lottery Ticket Hypothesis. In NeurIPS. 2021. arXiv:2103.16547. URL: http://arxiv.org/abs/2103.16547, doi:10.48550/arXiv.2103.16547.

[EMU+22]

Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Fabian Pedregosa, and Max Vladymyrov. GradMax: Growing Neural Networks using Gradient Information. In ICLR. 2022. URL: https://openreview.net/forum?id=qjN4h_wwUO.

[HHS+21]

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: a survey. TPAMI, 44(11):7436–7456, 2021.

[HZRS16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR. December 2016. arXiv:1512.03385. URL: http://arxiv.org/abs/1512.03385, doi:10.48550/arXiv.1512.03385.

[HABN+21]

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. JMLR, 22(241):1–124, 2021.

[LZW+19]

Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting. In ICML. 2019. arXiv:1904.00310. URL: http://arxiv.org/abs/1904.00310.

[MLN25]

Clément Morand, Anne-Laure Ligozat, and Aurélie Névéol. The Environmental Impacts of Machine Learning Training Keep Rising Evidencing Rebound Effect. October 2025. arXiv:2510.09022. URL: http://arxiv.org/abs/2510.09022, doi:10.48550/arXiv.2510.09022.

[PGQS24]

Dhroov Pandey, Jonah Ghebremichael, Zongqing Qi, and Tong Shu. A comparative survey: reusing small pre-trained models for efficient large model training. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 56–63. IEEE, 2024.

[TGLM21]

Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso. Deep Learning's Diminishing Returns: The Cost of Improvement is Becoming Unsustainable. IEEE Spectrum, 58(10):50–55, October 2021. URL: https://ieeexplore.ieee.org/document/9563954, doi:10.1109/MSPEC.2021.9563954.

[VSP+17]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In NeurIPS. 2017. URL: http://arxiv.org/abs/1706.03762, doi:10.48550/arXiv.1706.03762.

[YLZF21]

Li Yang, Sen Lin, Junshan Zhang, and Deliang Fan. GROWN: GRow Only When Necessary for Continual Learning. 2021. arXiv:2110.00908. URL: http://arxiv.org/abs/2110.00908, doi:10.48550/arXiv.2110.00908.

[YYLH18]

Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong Learning with Dynamically Expandable Networks. In ICLR. 2018. arXiv:1708.01547. URL: http://arxiv.org/abs/1708.01547.

[YSM23]

Xin Yuan, Pedro Savarese, and Michael Maire. Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation. In NeurIPS. December 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/359ffa88712bd688963a0ca641d8330b-Abstract-Conference.html.

[ZL17]

Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. In ICLR. 2017. URL: http://arxiv.org/abs/1611.01578, doi:10.48550/arXiv.1611.01578.