Neuron addition problem¶

We consider a feedforward neural network with layers indexed by \(l\). The size of layer \(l\) is \(C_l\), and \(n\) denotes the number of data points. The forward pass through consecutive layers is:

\[\begin{aligned} x \xrightarrow{\boldsymbol{h}^{(l-1)}} \boldsymbol{h}^{(l-1)}(x) \xrightarrow{W_l} \boldsymbol{z}^{(l)}(x) \to f(x) \end{aligned}\]

where \(\boldsymbol{z}^{(l)}: \mathbb{R}^{C_0} \to \mathbb{R}^{C_l}\) and \(\boldsymbol{h}^{(l)}: \mathbb{R}^{C_0} \to \mathbb{R}^{C_l}\) denote the function giving the pre- and post-activations at layer \(l\), respectively. For \(n\) batched samples, we stack activations row-wise: \(\boldsymbol{H}^{(l)} \in \mathbb{R}^{n \times C_l}\). The (negative) gradient of the loss with respect to pre-activations is similarly stacked: \(\boldsymbol{G}^{(l)} \in \mathbb{R}^{n \times C_{l}}\). Without loss of generality, we omit the index \(l\) when we consider an object associated with a current layer \(l\).

Neuron addition schematic — Illustration of neuron addition in a growing network.¶

Neuron addition. We aim to expand layer \(l-1\) by adding \(k\) new neurons. This is done by adding fan-in weights \(\boldsymbol{\Psi}\in \mathbb{R}^{k \times C_{l-2}}\) and fan-out weights \(\boldsymbol{\Omega}\in \mathbb{R}^{C_l \times k}\):

\[\begin{aligned} \boldsymbol{h}^{(l-2)}(x) \xrightarrow{\boldsymbol{\Psi}} \boldsymbol{z_{\text{ext}}}(x) \xrightarrow{\sigma} \boldsymbol{h_{\text{ext}}}(x) \xrightarrow{\boldsymbol{\Omega}} \delta_z(x) \end{aligned}\]

where \(\delta_z(x) = \boldsymbol{\Omega}\sigma \left(\boldsymbol{\Psi}\boldsymbol{h}^{(l-2)} \right)\). Hence, the change to the pre-activation at layer \(l\) is \(\boldsymbol{z}^{(l)} \leftarrow \boldsymbol{z}^{(l)} + \delta_z\), as shown in Fig. [fig:neuron-addition].

Theoretical Perspectives¶

Gradient boosting provides a general framework for constructing additive models by iteratively adding weak learners to minimize a given loss function [Fri01]. The weak learners are chosen from a predefined set of functions \(\mathcal{H}\). From an optimization perspective, each boosting iteration can be interpreted as a greedy descent step in the functional space spanned by \(\mathcal{H}\). The output function \(F\) is built as a weighted sum of weak learners:

\[F(x) = \sum_{m=1}^M \omega_m \gamma_m(x)\]

where \(\gamma_m\) are the weak learners and \(\omega_m\) their corresponding weights. The goal is to minimize a loss function \(\mathcal{L}{}(y, F(x))\) over the training data. At iteration \(m\), the weak learner \(\gamma_m\) and its weight \(\omega_m\) are chosen to minimize the loss:

\[(h_m, \omega_m) = \mathop{\mathrm{\arg\!\min}}_{h \in \mathcal{H}, \omega\in\mathbb{R}} \mathcal{L}{}\left(y, F_{m-1}(x) + \omega \gamma(x)\right)\]

where \(F_{m-1}\) is the current model. Boosting algorithms typically select the weak learner that is maximally aligned with the negative gradient of the loss with respect to the current model’s output.

In the context of neural network growth, adding a neuron at layer \(l-1\) induces a functional perturbation \(\delta_z\) of the pre-activation at layer \(l\), which can be interpreted as a boosting step, though over a continuous set of possible weak learners. Unlike classical boosting methods, except when growth occurs in the last hidden layer, the functional step is added to an intermediate representation. As in classical boosting, it can be chosen to correlate maximally with the negative gradient of the loss with respect to the pre-activation at layer \(l\), a quantity that is available through backpropagation.

Finding the optimal weights \(\boldsymbol{\Omega}\) and \(\boldsymbol{\Psi}\) for the newly added neurons to match the desired functional variation of the pre-activation at layer \(l\) leads to a non-convex optimization problem that is generally NP-hard (see [Bac17, MR18]). Exact methods are thus unsuitable for large-scale settings [LWW19]. Moreover, since the targeted functional variation is generally computed using only a first-order approximation, reaching the optimal solution may require adding too many neurons and may not be necessary. Most practical approaches, therefore, rely on various heuristics to add new neurons. However, theoretical guarantees for the overall process exist for certain methods: TINY [VRCC24] provides convergence guarantees, while splitting methods [LWW19, WYL+21] achieve locally optimal, function-preserving transformations, within the class of splitting morphisms.

Classification of methods. We categorize neuron addition methods by the goal of their initialization strategy:

Purely function-preserving: the network output is unchanged after growth (but not its gradient!),
Training dynamics-based: new weights are initialized to optimize training dynamics,
Function geometry-based: Methods optimizing local objectives such as gradient norm or loss decrease. They can be function-preserving or function-improving.

References¶

[Bac17]

Francis Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. JMLR, 18(19):1–53, 2017. URL: http://jmlr.org/papers/v18/14-546.html.

[Fri01]

Jerome H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189–1232, 2001.

[LWW19] (1,2)

Qiang Liu, Lemeng Wu, and Dilin Wang. Splitting Steepest Descent for Growing Neural Architectures. In NeurIPS. 2019. arXiv:1910.02366. URL: http://arxiv.org/abs/1910.02366, doi:10.48550/arXiv.1910.02366.

[MR18]

Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu(s). 2018. URL: https://arxiv.org/abs/1810.04207, arXiv:1810.04207.

[VRCC24]

Manon Verbockhaven, Théo Rudkiewicz, Sylvain Chevallier, and Guillaume Charpiat. Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally. TMLR, July 2024. URL: ².

[WYL+21]

Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, and Qiang Liu. Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting. 2021. arXiv:2003.10392. URL: http://arxiv.org/abs/2003.10392, doi:10.48550/arXiv.2003.10392.