Splitting¶
Splitting methods. One might wonder whether the Net2Net split of one neuron into two, with equally divided weights, is optimal. In S2D [LWW19], it has been shown that, for an infinitesimal change in the parameters \(\|\theta_{t+1} - \theta_t\| \le \epsilon\), this choice of split leads to the fastest decrease of the loss. Consider the post-activation output of a particular neuron \(i\) at layer \(l-1\). Splitting replaces the original neuron with two neurons:
The influence of splitting on the loss is characterised by the minimum eigenvalue \(\lambda_{\min}\) of the splitting matrix:
This “semi-Hessian” matrix provides a notion of splitting stability: when \(\lambda_{\min} > 0\), the loss cannot be improved by splitting; when \(\lambda_{\min} < 0\), the maximum decrease is achieved with parameter updates:
yielding loss decrease \(\Delta \mathcal{L}\geq \frac{\epsilon^2}{2} \lambda_{\min} + \mathcal{O}(\epsilon^3)\). Since the contribution to loss appears at \(\mathcal{O}(\epsilon^2)\), splitting can be thought of as a second-order method to escape local minima. In [WLW+20], S2D is improved to include energy-aware constraints and a fast gradient-based approximation \(S(\theta)\), while S3D [WYL+21] generalises the types of split considered to non-convex combinations of arbitrary sign.
References¶
Qiang Liu, Lemeng Wu, and Dilin Wang. Splitting Steepest Descent for Growing Neural Architectures. In NeurIPS. 2019. arXiv:1910.02366. URL: http://arxiv.org/abs/1910.02366, doi:10.48550/arXiv.1910.02366.
Dilin Wang, Meng Li, Lemeng Wu, Vikas Chandra, and Qiang Liu. Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent. 2020. arXiv:1910.03103. URL: http://arxiv.org/abs/1910.03103, doi:10.48550/arXiv.1910.03103.
Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, and Qiang Liu. Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting. 2021. arXiv:2003.10392. URL: http://arxiv.org/abs/2003.10392, doi:10.48550/arXiv.2003.10392.