Exploiting function geometry
============================

The methods in this section use local function information to optimize
the initialization of new neurons. We define
:math:`\boldsymbol{G}\in \mathbb{R}^{n \times C_{l}}` be the (negative)
gradient of the loss w.r.t. pre-activations, stacked over :math:`n`
samples, and the residual gradient :math:`\boldsymbol{G}^\perp` as the
gradient component that cannot be addressed by updating existing fan-out
weights of the increased layer—it identifies an *expressivity
bottleneck*.

Gradient geometry and K-FAC
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Differentials (and thus gradients) are defined for a fixed topology, as
they give the local behaviour of functions, where local is defined by
the topology. For classical gradient descent, the gradient is computed
for a small, in Euclidean norm, change of parameters :math:`d\theta`.
For natural gradient, locality is defined for functions: parameters
:math:`\theta` and :math:`\theta'` are close if
:math:`\left|\left|f_{\theta} - f_{\theta'}\right|\right|` is small.
K-FAC :cite:p:`dangelKroneckerfactoredApproximateCurvature2025`
proposes to approximate this norm when changing :math:`W_l` to
:math:`W_l+\delta W_l` as
:math:`\left|\left|\boldsymbol{S}^{1/2}\delta W_l A^{1/2}_{l-1}\right|\right|_F`,
where
:math:`A_{l-1} := \boldsymbol{h}^{(l-1)} \left(\boldsymbol{h}^{(l-1)}\right)^\top`
is the covariance of the input of the layer and
:math:`S := \frac{\partial f}{\partial \boldsymbol{z}^{(l)}}^\top \frac{\partial f}{\partial \boldsymbol{z}^{(l)}}`
is the covariance of the Jacobians of the loss with respect to the layer
output. We can say that :math:`A_{l-1}` accounts for curvature before
layer :math:`l` and :math:`S` after.


Methods
~~~~~~~~~

These methods, which may or may not be function-preserving, initialize
with :math:`\delta_z = 0`, preserving the network function, but optimize
:math:`(\boldsymbol{\Psi}, \boldsymbol{\Omega})` to maximize
trainability.

Function-preserving geometric methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- [[SENN]]
- [[NORTH]]
- [[GradMax]]

Function-improving geometric methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These methods initialize with :math:`\delta_z \neq 0`, directly
improving the loss at initialization.

- [[Tiny]]
- [[NeST|nest]]


A unified framework for geometric methods
-----------------------------------------

Many of the above methods, even if they seem different at first sight,
can be unified under the same objective with different choices of
target, metric, and constraints. Following the work of
:cite:p:`verbockhavenSpottingExpressivityBottlenecks2025` that
compare [[TINY|tiny]] to [[GradMax]] and [[NORTH]], we propose the
following theorem.

.. container:: theorem

   **Theorem 1** (Unified objective for neuron addition). *Many of the
   gradient-informed methods above can be derived from the following
   optimization problem:

   .. math::

      \begin{aligned}
      \label{eqn:unified_obj}
      & \mathop{\mathrm{\arg\!\min}}_{\boldsymbol{\Psi}, \boldsymbol{\Omega}} \left|\left|T \boldsymbol{S}^{-1/2}- \sigma(\boldsymbol{H}^{(l-2)} \boldsymbol{\Psi}^\top) \boldsymbol{\Omega}^\top \boldsymbol{S}^{1/2}\right|\right|_F\\
      &\propto \mathop{\mathrm{\arg\!\max}}_{\boldsymbol{\Psi}, \boldsymbol{\Omega}; \left|\left|\boldsymbol{A}_{\text{ext}}^{1/2}\boldsymbol{\Omega}^\top \boldsymbol{S}^{1/2}\right|\right|_F \le 1} \langle T \boldsymbol{S}^{-1/2}, \frac{1}{n} \boldsymbol{H_{\text{ext}}}\boldsymbol{\Omega}^\top \boldsymbol{S}^{1/2} \rangle
      \end{aligned}

   where:*

   - *:math:`T \in \mathbb{R}^{n \times C_{l}}` is the target: gradient
     :math:`\boldsymbol{G}` or residual gradient
     :math:`\boldsymbol{G}^\perp`*

   - *:math:`\boldsymbol{S}\in \mathbb{R}^{C_{l}\times C_{l}}` is the
     post-network metric
     :math:`\boldsymbol{S}:= \mathbb{E}_x\left[\left(\frac{\partial f}{\partial \boldsymbol{z}^{(l)}}(x)\right)^\top \frac{\partial f}{\partial \boldsymbol{z}^{(l)}}(x) \right]`*

   - *:math:`\boldsymbol{A}_{\text{ext}}\in \mathbb{R}^{C_{l-2} \times C_{l-2}}`
     is the activation covariance of the extended activations
     :math:`\boldsymbol{A}_{\text{ext}}:= \frac{1}{n} \left(\boldsymbol{H_{\text{ext}}}\right)^\top \boldsymbol{H_{\text{ext}}}`
     and :math:`\widehat{\boldsymbol{A}_{\text{ext}}}` its approximation
     for a linearized activation function (which is justified for
     :math:`\boldsymbol{\Psi}\approx 0`)*

   - *:math:`\sigma` is the activation function (or its linear
     approximation)*

   - *Additional constraints may be imposed on :math:`\boldsymbol{\Psi}`
     and :math:`\boldsymbol{\Omega}`*

   *Table *\ `[tab:unified] <#tab:unified>`__ *summarizes the
   specialization for each method. We also add a column indicating
   whether the method tries to satisfy [[NORTH]] orthogonality
   constraint
   :math:`\boldsymbol{H_{\text{ext}}}\perp \boldsymbol{H}^{(l-1)}` (in
   the case of [[TINY|tiny]] it is only satisfied for linearized
   activation
   functions).*

The key insight is that all methods minimize the distance between the
new neurons’ contribution and some target gradient signal, differing in:

#. **Target**: whether they use the raw gradient :math:`\boldsymbol{G}`
   or residual gradient :math:`\boldsymbol{G}^\perp` (which avoids
   redundancy with existing neurons).

#. **Metric** on parameters: whether they use natural norm integrating
   pre and post network curvature, semi-natural norm only integrating
   pre-network curvature (:math:`\boldsymbol{S}= I_{C_{l}}`), or
   standard Euclidean norm (:math:`\boldsymbol{S}= I_{C_{l}}` and
   :math:`\boldsymbol{A}_{\text{ext}}= I_{C_{\text{ext}}}`).

#. **Constraints**: sparsity (:math:`\|\cdot\|_0 = 1`), or function
   preservation (:math:`\boldsymbol{\Psi}= 0` or
   :math:`\boldsymbol{\Omega}= 0`).

.. _`sec:beyond_neuron_addition`: