When to stop? ================ Most growing methods follow a predetermined schedule, providing no notion of when to stop, and only a small number have an explicit (or implicit) notion of stopping. [[AutoGrow]] :cite:p:`wen_autogrow_2020` proposes an explicit stopping policy, permanently freezing further growth of a block when an addition fails to improve performance. [[Splitting]] methods also naturally terminate when a local minimum of the loss is reached :cite:p:`wu_steepest_2021`, while grow-prune methods such as [[SENN]] :cite:p:`mitchell_self-expanding_2024` achieve a steady-state behaviour by iterative growing and pruning. **Following the Pareto front.** By considering growth operations :math:`\mathcal{T}` which are constrained to a neighbourhood of the existing architecture, one imposes an implicit constraint on the complexity :math:`C(A_t)`, such as parameters, FLOPs, or walltime of the model during the growing process. However, in practice, the objective lies in a constraint on the *final complexity*. In other words, architectures on the Pareto optimality front provide the best possible performance for our given resources, and to *stop* growing once the improvements no longer justify additional capacity. None of the stopping methods above identifies a desired performance-complexity tradeoff. Furthermore, the observation of Sec. `3.1 <#sec:adding_layers>`__ that new growth should occur before the convergence of the old network makes it difficult to explore the Pareto front directly. **Notions of complexity.** What is the most appropriate notion of complexity :math:`C(A, \theta)`? The reduction in FLOPs of smaller models does not necessarily translate into a significant reduction in training time. As a point of comparison, [[Variance Transfer|variance_transfer]] :cite:p:`yuan_accelerated_2023`, whose growing method requires negligible additional overhead, observes a mere 1.2x speedup when growing a :math:`1/4` channel-width ResNet-18 on CIFAR-100. Due to this, they propose adapting the batch size to maximise GPU memory usage and achieve a 1.7x speedup for a negligible change in final accuracy. In general, however, large-batch training often results in poor generalisation performance :cite:p:`keskar_large-batch_2017` and is itself a topic of research. The largest-scale evidence for walltime efficiency comes from transformer growth experiments (Sec. `6 <#sec:transformers>`__), which demonstrate significant speedups but also reveal that efficiency gains depend on architecture and scale in ways that current theory does not predict. **Supernetworks.** Apart from growing architectures, building a *supernetwork* that encompasses many operational sub-models in a nested manner is an alternative approach to recovering the entire Pareto front, at the cost of a single training procedure (in this case, joint training of all sub-models). To train a supernetwork, one can either impose a sub-modular structure through joint loss minimization, where each loss corresponds to the error of a given sub-model :cite:p:`matformer,slimmable,universally,branchynet`, or first train a large neural network and then distill its predictive power through fine-tuning smaller sub-modules :cite:p:`ofa`. Depending on the method, the nested structure can be obtained by varying the width of representations :cite:p:`slimmable,universally,matformer`, the depth of the neural network :cite:p:`branchynet`, or both width and depth :cite:p:`ofa`. In general, increasing the number of sub-modules provides greater flexibility in selecting a model that trades off model size against performance.