Splitting

Splitting methods. One might wonder whether the Net2Net split of one neuron into two, with equally divided weights, is optimal. In S2D [LWW19], it has been shown that, for an infinitesimal change in the parameters \(\|\theta_{t+1} - \theta_t\| \le \epsilon\), this choice of split leads to the fastest decrease of the loss. Consider the post-activation output of a particular neuron \(i\) at layer \(l-1\). Splitting replaces the original neuron with two neurons:

\[\begin{aligned} \sigma(\boldsymbol{z}^{(l-1)}_i) \to \frac{1}{2} \left( \sigma(\theta_1 \cdot \boldsymbol{h}^{(l-2)} ) + \sigma(\theta_2 \cdot \boldsymbol{h}^{(l-2)}) \right) \end{aligned}\]

The influence of splitting on the loss is characterised by the minimum eigenvalue \(\lambda_{\min}\) of the splitting matrix:

\[\label{eqn:splitting} S(\theta) = \underset{x \sim \mathcal{D}}{\mathbb{E}}\left[ \nabla_{\boldsymbol{z}^{(l-1)}} \mathcal{L}(f(x)) \nabla_{\theta\theta}^2 \sigma(\theta \cdot \boldsymbol{h}^{(l-2)}(x)) \right],\]

This “semi-Hessian” matrix provides a notion of splitting stability: when \(\lambda_{\min} > 0\), the loss cannot be improved by splitting; when \(\lambda_{\min} < 0\), the maximum decrease is achieved with parameter updates:

\[\begin{aligned} \boldsymbol{\psi}_{1} &= \theta + \epsilon\, v_{\min}(S(\theta)), \quad \boldsymbol{\psi}_{2} = \theta - \epsilon\, v_{\min}(S(\theta)) \end{aligned}\]

yielding loss decrease \(\Delta \mathcal{L}\geq \frac{\epsilon^2}{2} \lambda_{\min} + \mathcal{O}(\epsilon^3)\). Since the contribution to loss appears at \(\mathcal{O}(\epsilon^2)\), splitting can be thought of as a second-order method to escape local minima. In [WLW+20], S2D is improved to include energy-aware constraints and a fast gradient-based approximation \(S(\theta)\), while S3D [WYL+21] generalises the types of split considered to non-convex combinations of arbitrary sign.

References

[LWW19]

Qiang Liu, Lemeng Wu, and Dilin Wang. Splitting Steepest Descent for Growing Neural Architectures. In NeurIPS. 2019. arXiv:1910.02366. URL: http://arxiv.org/abs/1910.02366, doi:10.48550/arXiv.1910.02366.

[WLW+20]

Dilin Wang, Meng Li, Lemeng Wu, Vikas Chandra, and Qiang Liu. Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent. 2020. arXiv:1910.03103. URL: http://arxiv.org/abs/1910.03103, doi:10.48550/arXiv.1910.03103.

[WYL+21]

Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, and Qiang Liu. Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting. 2021. arXiv:2003.10392. URL: http://arxiv.org/abs/2003.10392, doi:10.48550/arXiv.2003.10392.