Exploiting function geometry ============================ The methods in this section use local function information to optimize the initialization of new neurons. We define :math:`\boldsymbol{G}\in \mathbb{R}^{n \times C_{l}}` be the (negative) gradient of the loss w.r.t. pre-activations, stacked over :math:`n` samples, and the residual gradient :math:`\boldsymbol{G}^\perp` as the gradient component that cannot be addressed by updating existing fan-out weights of the increased layer—it identifies an *expressivity bottleneck*. Gradient geometry and K-FAC ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Differentials (and thus gradients) are defined for a fixed topology, as they give the local behaviour of functions, where local is defined by the topology. For classical gradient descent, the gradient is computed for a small, in Euclidean norm, change of parameters :math:`d\theta`. For natural gradient, locality is defined for functions: parameters :math:`\theta` and :math:`\theta'` are close if :math:`\left|\left|f_{\theta} - f_{\theta'}\right|\right|` is small. K-FAC :cite:p:`dangelKroneckerfactoredApproximateCurvature2025` proposes to approximate this norm when changing :math:`W_l` to :math:`W_l+\delta W_l` as :math:`\left|\left|\boldsymbol{S}^{1/2}\delta W_l A^{1/2}_{l-1}\right|\right|_F`, where :math:`A_{l-1} := \boldsymbol{h}^{(l-1)} \left(\boldsymbol{h}^{(l-1)}\right)^\top` is the covariance of the input of the layer and :math:`S := \frac{\partial f}{\partial \boldsymbol{z}^{(l)}}^\top \frac{\partial f}{\partial \boldsymbol{z}^{(l)}}` is the covariance of the Jacobians of the loss with respect to the layer output. We can say that :math:`A_{l-1}` accounts for curvature before layer :math:`l` and :math:`S` after. Methods ~~~~~~~~~ These methods, which may or may not be function-preserving, initialize with :math:`\delta_z = 0`, preserving the network function, but optimize :math:`(\boldsymbol{\Psi}, \boldsymbol{\Omega})` to maximize trainability. Function-preserving geometric methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - [[SENN]] - [[NORTH]] - [[GradMax]] Function-improving geometric methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These methods initialize with :math:`\delta_z \neq 0`, directly improving the loss at initialization. - [[Tiny]] - [[NeST|nest]] A unified framework for geometric methods ----------------------------------------- Many of the above methods, even if they seem different at first sight, can be unified under the same objective with different choices of target, metric, and constraints. Following the work of :cite:p:`verbockhavenSpottingExpressivityBottlenecks2025` that compare [[TINY|tiny]] to [[GradMax]] and [[NORTH]], we propose the following theorem. .. container:: theorem **Theorem 1** (Unified objective for neuron addition). *Many of the gradient-informed methods above can be derived from the following optimization problem: .. math:: \begin{aligned} \label{eqn:unified_obj} & \mathop{\mathrm{\arg\!\min}}_{\boldsymbol{\Psi}, \boldsymbol{\Omega}} \left|\left|T \boldsymbol{S}^{-1/2}- \sigma(\boldsymbol{H}^{(l-2)} \boldsymbol{\Psi}^\top) \boldsymbol{\Omega}^\top \boldsymbol{S}^{1/2}\right|\right|_F\\ &\propto \mathop{\mathrm{\arg\!\max}}_{\boldsymbol{\Psi}, \boldsymbol{\Omega}; \left|\left|\boldsymbol{A}_{\text{ext}}^{1/2}\boldsymbol{\Omega}^\top \boldsymbol{S}^{1/2}\right|\right|_F \le 1} \langle T \boldsymbol{S}^{-1/2}, \frac{1}{n} \boldsymbol{H_{\text{ext}}}\boldsymbol{\Omega}^\top \boldsymbol{S}^{1/2} \rangle \end{aligned} where:* - *:math:`T \in \mathbb{R}^{n \times C_{l}}` is the target: gradient :math:`\boldsymbol{G}` or residual gradient :math:`\boldsymbol{G}^\perp`* - *:math:`\boldsymbol{S}\in \mathbb{R}^{C_{l}\times C_{l}}` is the post-network metric :math:`\boldsymbol{S}:= \mathbb{E}_x\left[\left(\frac{\partial f}{\partial \boldsymbol{z}^{(l)}}(x)\right)^\top \frac{\partial f}{\partial \boldsymbol{z}^{(l)}}(x) \right]`* - *:math:`\boldsymbol{A}_{\text{ext}}\in \mathbb{R}^{C_{l-2} \times C_{l-2}}` is the activation covariance of the extended activations :math:`\boldsymbol{A}_{\text{ext}}:= \frac{1}{n} \left(\boldsymbol{H_{\text{ext}}}\right)^\top \boldsymbol{H_{\text{ext}}}` and :math:`\widehat{\boldsymbol{A}_{\text{ext}}}` its approximation for a linearized activation function (which is justified for :math:`\boldsymbol{\Psi}\approx 0`)* - *:math:`\sigma` is the activation function (or its linear approximation)* - *Additional constraints may be imposed on :math:`\boldsymbol{\Psi}` and :math:`\boldsymbol{\Omega}`* *Table *\ `[tab:unified] <#tab:unified>`__ *summarizes the specialization for each method. We also add a column indicating whether the method tries to satisfy [[NORTH]] orthogonality constraint :math:`\boldsymbol{H_{\text{ext}}}\perp \boldsymbol{H}^{(l-1)}` (in the case of [[TINY|tiny]] it is only satisfied for linearized activation functions).* The key insight is that all methods minimize the distance between the new neurons’ contribution and some target gradient signal, differing in: #. **Target**: whether they use the raw gradient :math:`\boldsymbol{G}` or residual gradient :math:`\boldsymbol{G}^\perp` (which avoids redundancy with existing neurons). #. **Metric** on parameters: whether they use natural norm integrating pre and post network curvature, semi-natural norm only integrating pre-network curvature (:math:`\boldsymbol{S}= I_{C_{l}}`), or standard Euclidean norm (:math:`\boldsymbol{S}= I_{C_{l}}` and :math:`\boldsymbol{A}_{\text{ext}}= I_{C_{\text{ext}}}`). #. **Constraints**: sparsity (:math:`\|\cdot\|_0 = 1`), or function preservation (:math:`\boldsymbol{\Psi}= 0` or :math:`\boldsymbol{\Omega}= 0`). .. _`sec:beyond_neuron_addition`: