FRAGrow
=======

    **TLDR:** Grow fast enough to avoid underfitting, and slowly enough to retain the regularising benefit of growth on overfitting models. FRAGrow adapts the growth interval at run time based on the train/validation accuracy gap.

**FRAGrow** :cite:p:`wu_when_2024` revisits the *when to grow* question studied
by [[AutoGrow|autogrow]]. The authors argue that non-function preserving neural growth has an
*inherent regularisation effect*: because the blocks added later in
training spend fewer epochs being updated than the first block, their
final weights stay closer to their initialisation, which dampens what
the network can memorise. The strength of this regularisation is
controlled by the growth schedule (faster growth → larger average
training epochs per block → weaker regularisation). FRAGrow turns
this observation into an adaptive *when to grow* policy that
automatically grows faster on models that underfit (e.g. ResNet/VGG
on ImageNet) and slower on models that overfit (e.g. ResNet/VGG on
CIFAR-10/100).


Method
------

Vocabulary
^^^^^^^^^^

The notation ``ResNet-2-2-2-2`` denotes a ResNet with two residual
blocks in each of four stages, ``VGG-1-1-1-1-1`` a VGG with one
block in each of five stages, and similarly for MobileNetV2. A
*stage* groups blocks with the same output spatial size. The seed
networks (``ResNet-2-2-2-2``, ``VGG-1-1-1-1-2``,
``MobileNetV2-1-1-1-1-1``) are grown into the targets
(``ResNet-8-8-8-8``, ``VGG-2-2-4-4-4``, ``MobileNetV2-2-3-4-3-3``).

How
^^^

A new block is initialised by **duplicating the weights of the
preceding block** in the same stage :cite:p:`dong_towards_2020`,
except when the preceding block is a downsampling block — in that
case the new block is randomly initialised (Kaiming initialisation). During the entire growth
phase the learning rate is held at a large constant value (as
recommended by [[AutoGrow|autogrow]]); cosine decay with no restart
is then used for the final fine-tuning.

Where
^^^^^

Growth is **sequential, front-to-back**: new blocks are appended to
the current stage until that stage reaches its target depth, after
which the algorithm moves on to the next stage. (A round-robin
variant — the [[AutoGrow|autogrow]] *circulation* order — is
benchmarked in the ablation and gives comparable results.) The
target architecture is fixed by the user, so the total number of
blocks to add is known up front.

When
^^^^

This is the central contribution. Let :math:`E_T` be the total
number of training epochs, :math:`E_F^{\min}` the minimum number of
fine-tuning epochs reserved at the end, and :math:`n` the number of
blocks to add. The *maximum* allowed growth interval is

.. math::
   I_{\max} = \frac{E_T - E_F^{\min}}{n}.

At every growth check, FRAGrow computes the **overfitting risk
level** as the train/validation accuracy gap

.. math::
   \mathrm{ORL} = \text{train accuracy} - \text{validation accuracy},

then picks the next growth interval as

.. math::
   I = \frac{I_{\max}}{1 + e^{\alpha - \mathrm{ORL}}},

with :math:`\alpha` the only hyper-parameter (default
:math:`\alpha = 4`). When the model underfits (small ORL), the
exponential drives :math:`I` toward 0 and growth accelerates, which
weakens the regularising effect of growth. When the model overfits
(large ORL), :math:`I` saturates at :math:`I_{\max}` and growth
slows down, which strengthens the regularising effect. The
validation set used for ORL is :math:`1\%` of the training data, so
the overhead of evaluating ORL once per epoch is negligible.


Experiments
-----------

Setup
^^^^^

Three datasets are used: CIFAR-10, CIFAR-100 and ImageNet. Three
architectures are grown: ResNet, VGG and MobileNetV2. Training uses
SGD with momentum :math:`0.9`, cosine learning rate decay during
fine-tuning, He initialisation, and :math:`E_F^{\min}=30` epochs of
fine-tuning. Total training is :math:`E_T = 180` epochs on CIFAR
and :math:`E_T = 120` epochs on ImageNet. The initial learning rate
is :math:`0.5/0.1/0.1` for ResNet/VGG/MobileNetV2 on CIFAR and
:math:`0.1` for all models on ImageNet. Each experiment is repeated
:math:`3` times.

The contenders are:

- *Periodic*: grow at a fixed interval :math:`I_{\max}`.
- *Convergent*: grow whenever the validation accuracy
  stagnates (the [[AutoGrow|autogrow]] *c-AutoGrow* setting).
- *Lipgrow* :cite:p:`dong_towards_2020`: grow whenever the
  Lipschitz constant of the model exceeds a threshold, doubling the
  blocks at every growth.
- *FRAGrow*: the adaptive interval above with
  :math:`\alpha = 4`.
- *Vanilla*: train the target ``Large`` network from scratch (no
  growth).
- *Small*: train the seed shallow network from scratch.

The paper additionally uses the terms *slow growth* and *fast
growth* (Table 1 of the paper) to illustrate the regularisation
effect: both are periodic schedules, *slow growth* having a larger
growth interval and *fast growth* a smaller one. **The exact
interval values for these two schedules are not given in the
paper** — they are described only in relative terms — so the
mapping between *fast growth* and the *Periodic*
(:math:`I = I_{\max}`) configuration is left unspecified.


Main "when to grow" comparison
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following table consolidates the *when to grow* experiments
(Tables 1, 5, 6, 7 and 8 of :cite:p:`wu_when_2024`). Each cell
reports test error (%) and normalised training time (%, with
*FRAGrow* = 100%); lower is better for both columns. MobileNetV2
results on ImageNet are not reported in the paper.

.. table:: When-to-grow comparison across models and datasets. Test error (%) / normalised training time (%). Time is normalised so that FRAGrow = 100 within each (model, dataset) cell. Best test error per (model, dataset) in bold.
    :align: center

    +---------------+-------------------------------+------------------+------------------+------------------+
    | Model         | Method                        | CIFAR-10         | CIFAR-100        | ImageNet         |
    +===============+===============================+==================+==================+==================+
    | ResNet        | Vanilla (large)               | 6.66 / 121.29    | 29.56 / 125.11   | **24.14** / 111.1|
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Small (seed)                  | 8.37 / 39.23     | 32.97 / 36.95    | 29.06 / 58.79    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Convergent                    | 6.35 / 104.39    | 29.25 / 107.42   | 25.30 / 86.59    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Slow growth                   | – / –            | 28.91 / –        | 24.73 / –        |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Fast growth                   | – / –            | 29.33 / –        | 24.29 / –        |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Periodic                      | 6.58 / 102.33    | 29.29 / 98.62    | 24.79 / 93.39    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Lipgrow                       | 7.18 / 67.22     | 29.23 / 96.09    | 25.10 / 88.86    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | FRAGrow, :math:`\alpha=2`     | – / –            | 29.11 / 99.04    | 24.86 / 98.23    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | **FRAGrow,** :math:`\alpha=4` | **6.32 / 100**   | **29.14 / 100**  | 24.32 / 100      |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | FRAGrow, :math:`\alpha=6`     | – / –            | 29.20 / 97.70    | **24.27** / 100.9|
    +---------------+-------------------------------+------------------+------------------+------------------+
    | VGG           | Vanilla (large)               | 6.22 / 122.48    | 26.96 / 119.05   | **24.15** / 112.7|
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Small (seed)                  | 8.27 / 48.21     | 31.12 / 51.71    | 31.01 / 54.35    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Convergent                    | 6.33 / 99.72     | 26.83 / 92.65    | 26.42 / 77.61    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Slow growth                   | – / –            | 27.34 / –        | 25.70 / –        |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Fast growth                   | – / –            | 26.86 / –        | 24.44 / –        |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Periodic                      | 6.40 / 92.23     | 26.73 / 93.02    | 25.70 / 92.40    |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Lipgrow                       | 7.05 / 75.92     | 29.82 / 83.26    | 27.03 / 103.91   |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | FRAGrow, :math:`\alpha=2`     | – / –            | 26.75 / 95.11    | 24.22 / 101.20   |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | **FRAGrow,** :math:`\alpha=4` | **6.20 / 100**   | **26.57** / 100  | 24.39 / 100      |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | FRAGrow, :math:`\alpha=6`     | – / –            | 26.89 / 110.58   | 24.32 / 99.17    |
    +---------------+-------------------------------+------------------+------------------+------------------+
    | MobileNetV2   | Vanilla (large)               | **5.22** / 132.92| 23.95 / 123.83   | **29.71** / –†   |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Small (seed)                  | 7.32 / 38.12     | 27.49 / 40.07    | 36.97 / –†       |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Periodic                      | 5.66 / 95.11     | 24.35 / 99.67    | – / –†           |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Convergent                    | 5.50 / 105.86    | 24.11 / 106.87   | – / –†           |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | Lipgrow                       | 5.64 / 117.30    | 24.32 / 116.65   | – / –†           |
    +               +-------------------------------+------------------+------------------+------------------+
    |               | **FRAGrow,** :math:`\alpha=4` | 5.60 / 100       | **23.94 / 100**  | 30.25 / 100      |
    +---------------+-------------------------------+------------------+------------------+------------------+

† The paper does not report MobileNetV2 numbers on ImageNet for the
*Periodic*, *Convergent*, *Lipgrow*, *Small* and *Vanilla*
baselines — only the *FRAGrow* test error and normalised time are
given (Table 5). The corresponding cells are left blank.

Three findings stand out:

1. **Convergent growth is not clearly inferior to periodic growth.**
   The two policies trade places several times across (model,
   dataset) pairs — e.g. *Convergent* beats *Periodic* on ResNet/CIFAR-10
   (6.35 % vs. 6.58 %) and MobileNetV2/CIFAR-100, but loses on
   ResNet/ImageNet and VGG/ImageNet. This contradicts the clean
   ordering "Periodic » Convergent" claimed by [[AutoGrow|autogrow]],
   although the setup here differs (different :math:`I_{\max}`,
   different initialiser, different target architectures, and
   different :math:`K`).

2. **The right schedule depends on the fitting regime.** On the
   overfitting CIFAR datasets all growth schedules (including the
   slowest ones) reach or slightly beat the *Vanilla* baseline,
   indicating that the regularising effect of growth helps. On the
   underfitting ImageNet dataset, by contrast, *Periodic*,
   *Convergent* and *Lipgrow* all lose accuracy to *Vanilla* (up to
   :math:`-1.3` pp on VGG), while *FRAGrow* — which detects the
   underfitting via the ORL and grows faster — recovers most of the
   gap.

3. **Results don't align with the "slow growth overfit less" narrative.** On ImageNet, the fastest growth schedule is beating slower ones but on
   CIFAR the conclusions are more mixed. For example, on CIFAR-100 and VGG, faster growth (*Fast growth* and *Periodic*) outperforms slower growth (*Convergent* and *Slow growth*), which is the opposite of what the regularisation narrative would predict.

Effect of the growth-phase learning rate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A constant large learning rate during the growth phase outperforms
cosine annealing with restart, mirroring the
[[AutoGrow|autogrow]] finding.

.. table:: Constant vs. cosine-annealing learning rate during the growth phase, test error (%) on CIFAR-100 (Table 3 of :cite:p:`wu_when_2024`).
    :align: center

    +-----------+----------------------+-------------+
    | Model     | Learning rate        | CIFAR-100   |
    +===========+======================+=============+
    | ResNet    | Constant             | **28.91**   |
    +-----------+----------------------+-------------+
    | ResNet    | Cosine annealing     | 29.32       |
    +-----------+----------------------+-------------+
    | VGG       | Constant             | **27.34**   |
    +-----------+----------------------+-------------+
    | VGG       | Cosine annealing     | 28.06       |
    +-----------+----------------------+-------------+

Robustness ablations
^^^^^^^^^^^^^^^^^^^^

The remaining ablations are reported only qualitatively here, since
they do not change the picture:

- **Where to grow.** Replacing the sequential front-to-back order
  with the [[AutoGrow|autogrow]] round-robin order leaves the
  ordering between methods intact: FRAGrow remains as good as or
  better than *Periodic* and *Convergent* on CIFAR-100 and
  ImageNet, with a notable :math:`\sim 2` pp accuracy improvement
  on VGG/ImageNet (Table 9 of :cite:p:`wu_when_2024`).
- **Initialisation.** Replacing the *duplicate the preceding block*
  initialiser with *moment growth* :cite:p:`li_autoprog_2022`
  (copy the historical exponential-moving-average of the preceding
  block's weights) does not change the ranking: FRAGrow matches the
  baselines on overfitting CIFAR-100 and gains :math:`\sim 1` pp on
  the underfitting ImageNet (Table 10 of :cite:p:`wu_when_2024`).


Remarks
-------

- FRAGrow does not change *where* or *how* a block is added; it
  only changes *when*. The improvements on ImageNet therefore come
  from breaking the (over-)regularising effect of slow growth, not
  from a better initialiser or better target depth (the target
  depth is given by the user).
- The *Convergent vs. Periodic* tension with
  [[AutoGrow|autogrow]] is interesting but the two papers do not
  use the same setup: FRAGrow uses sequential growth with
  *duplicate-preceding-block* initialisation and :math:`I_{\max}
  \approx (E_T - E_F^{\min})/n`, whereas
  [[AutoGrow|autogrow]] uses circulation growth with random
  (``GauInit``) initialisation and :math:`K = 3`. The conclusions
  of either paper cannot be transferred to the other without care.


Open questions
--------------

- The *slow growth* and *fast growth* baselines used to illustrate
  the regularisation effect (Table 1) are described only in
  relative terms; the absolute growth intervals are not given. It
  is therefore impossible to cleanly position them on the
  Periodic-:math:`I_{\max}` axis or to reproduce them exactly.