Splitting ========= **[[Splitting]] methods.** One might wonder whether the [[Net2Net]] split of one neuron into two, with equally divided weights, is optimal. In S2D :cite:p:`liu_splitting_2019`, it has been shown that, for an infinitesimal change in the parameters :math:`\|\theta_{t+1} - \theta_t\| \le \epsilon`, this choice of split leads to the fastest decrease of the loss. Consider the post-activation output of a particular neuron :math:`i` at layer :math:`l-1`. Splitting replaces the original neuron with two neurons: .. math:: \begin{aligned} \sigma(\boldsymbol{z}^{(l-1)}_i) \to \frac{1}{2} \left( \sigma(\theta_1 \cdot \boldsymbol{h}^{(l-2)} ) + \sigma(\theta_2 \cdot \boldsymbol{h}^{(l-2)}) \right) \end{aligned} The influence of splitting on the loss is characterised by the minimum eigenvalue :math:`\lambda_{\min}` of the *splitting matrix*: .. math:: \label{eqn:splitting} S(\theta) = \underset{x \sim \mathcal{D}}{\mathbb{E}}\left[ \nabla_{\boldsymbol{z}^{(l-1)}} \mathcal{L}(f(x)) \nabla_{\theta\theta}^2 \sigma(\theta \cdot \boldsymbol{h}^{(l-2)}(x)) \right], This “semi-Hessian” matrix provides a notion of splitting stability: when :math:`\lambda_{\min} > 0`, the loss cannot be improved by splitting; when :math:`\lambda_{\min} < 0`, the maximum decrease is achieved with parameter updates: .. math:: \begin{aligned} \boldsymbol{\psi}_{1} &= \theta + \epsilon\, v_{\min}(S(\theta)), \quad \boldsymbol{\psi}_{2} = \theta - \epsilon\, v_{\min}(S(\theta)) \end{aligned} yielding loss decrease :math:`\Delta \mathcal{L}\geq \frac{\epsilon^2}{2} \lambda_{\min} + \mathcal{O}(\epsilon^3)`. Since the contribution to loss appears at :math:`\mathcal{O}(\epsilon^2)`, splitting can be thought of as a second-order method to escape local minima. In :cite:p:`wang_energy-aware_2020`, S2D is improved to include energy-aware constraints and a fast gradient-based approximation :math:`S(\theta)`, while S3D :cite:p:`wu_steepest_2021` generalises the types of split considered to non-convex combinations of arbitrary sign.