FRAGrow

TLDR: Grow fast enough to avoid underfitting, and slowly enough to retain the regularising benefit of growth on overfitting models. FRAGrow adapts the growth interval at run time based on the train/validation accuracy gap.

FRAGrow [WWM+24] revisits the when to grow question studied by AutoGrow. The authors argue that non-function preserving neural growth has an inherent regularisation effect: because the blocks added later in training spend fewer epochs being updated than the first block, their final weights stay closer to their initialisation, which dampens what the network can memorise. The strength of this regularisation is controlled by the growth schedule (faster growth → larger average training epochs per block → weaker regularisation). FRAGrow turns this observation into an adaptive when to grow policy that automatically grows faster on models that underfit (e.g. ResNet/VGG on ImageNet) and slower on models that overfit (e.g. ResNet/VGG on CIFAR-10/100).

Method

Vocabulary

The notation ResNet-2-2-2-2 denotes a ResNet with two residual blocks in each of four stages, VGG-1-1-1-1-1 a VGG with one block in each of five stages, and similarly for MobileNetV2. A stage groups blocks with the same output spatial size. The seed networks (ResNet-2-2-2-2, VGG-1-1-1-1-2, MobileNetV2-1-1-1-1-1) are grown into the targets (ResNet-8-8-8-8, VGG-2-2-4-4-4, MobileNetV2-2-3-4-3-3).

How

A new block is initialised by duplicating the weights of the preceding block in the same stage [DLLS20], except when the preceding block is a downsampling block — in that case the new block is randomly initialised (Kaiming initialisation). During the entire growth phase the learning rate is held at a large constant value (as recommended by AutoGrow); cosine decay with no restart is then used for the final fine-tuning.

Where

Growth is sequential, front-to-back: new blocks are appended to the current stage until that stage reaches its target depth, after which the algorithm moves on to the next stage. (A round-robin variant — the AutoGrow circulation order — is benchmarked in the ablation and gives comparable results.) The target architecture is fixed by the user, so the total number of blocks to add is known up front.

When

This is the central contribution. Let \(E_T\) be the total number of training epochs, \(E_F^{\min}\) the minimum number of fine-tuning epochs reserved at the end, and \(n\) the number of blocks to add. The maximum allowed growth interval is

\[I_{\max} = \frac{E_T - E_F^{\min}}{n}.\]

At every growth check, FRAGrow computes the overfitting risk level as the train/validation accuracy gap

\[\mathrm{ORL} = \text{train accuracy} - \text{validation accuracy},\]

then picks the next growth interval as

\[I = \frac{I_{\max}}{1 + e^{\alpha - \mathrm{ORL}}},\]

with \(\alpha\) the only hyper-parameter (default \(\alpha = 4\)). When the model underfits (small ORL), the exponential drives \(I\) toward 0 and growth accelerates, which weakens the regularising effect of growth. When the model overfits (large ORL), \(I\) saturates at \(I_{\max}\) and growth slows down, which strengthens the regularising effect. The validation set used for ORL is \(1\%\) of the training data, so the overhead of evaluating ORL once per epoch is negligible.

Experiments

Setup

Three datasets are used: CIFAR-10, CIFAR-100 and ImageNet. Three architectures are grown: ResNet, VGG and MobileNetV2. Training uses SGD with momentum \(0.9\), cosine learning rate decay during fine-tuning, He initialisation, and \(E_F^{\min}=30\) epochs of fine-tuning. Total training is \(E_T = 180\) epochs on CIFAR and \(E_T = 120\) epochs on ImageNet. The initial learning rate is \(0.5/0.1/0.1\) for ResNet/VGG/MobileNetV2 on CIFAR and \(0.1\) for all models on ImageNet. Each experiment is repeated \(3\) times.

The contenders are:

  • Periodic: grow at a fixed interval \(I_{\max}\).

  • Convergent: grow whenever the validation accuracy stagnates (the AutoGrow c-AutoGrow setting).

  • Lipgrow [DLLS20]: grow whenever the Lipschitz constant of the model exceeds a threshold, doubling the blocks at every growth.

  • FRAGrow: the adaptive interval above with \(\alpha = 4\).

  • Vanilla: train the target Large network from scratch (no growth).

  • Small: train the seed shallow network from scratch.

The paper additionally uses the terms slow growth and fast growth (Table 1 of the paper) to illustrate the regularisation effect: both are periodic schedules, slow growth having a larger growth interval and fast growth a smaller one. The exact interval values for these two schedules are not given in the paper — they are described only in relative terms — so the mapping between fast growth and the Periodic (\(I = I_{\max}\)) configuration is left unspecified.

Main “when to grow” comparison

The following table consolidates the when to grow experiments (Tables 1, 5, 6, 7 and 8 of [WWM+24]). Each cell reports test error (%) and normalised training time (%, with FRAGrow = 100%); lower is better for both columns. MobileNetV2 results on ImageNet are not reported in the paper.

Table 14 When-to-grow comparison across models and datasets. Test error (%) / normalised training time (%). Time is normalised so that FRAGrow = 100 within each (model, dataset) cell. Best test error per (model, dataset) in bold.

Model

Method

CIFAR-10

CIFAR-100

ImageNet

ResNet

Vanilla (large)

6.66 / 121.29

29.56 / 125.11

24.14 / 111.1

Small (seed)

8.37 / 39.23

32.97 / 36.95

29.06 / 58.79

Convergent

6.35 / 104.39

29.25 / 107.42

25.30 / 86.59

Slow growth

– / –

28.91 / –

24.73 / –

Fast growth

– / –

29.33 / –

24.29 / –

Periodic

6.58 / 102.33

29.29 / 98.62

24.79 / 93.39

Lipgrow

7.18 / 67.22

29.23 / 96.09

25.10 / 88.86

FRAGrow, \(\alpha=2\)

– / –

29.11 / 99.04

24.86 / 98.23

FRAGrow, \(\alpha=4\)

6.32 / 100

29.14 / 100

24.32 / 100

FRAGrow, \(\alpha=6\)

– / –

29.20 / 97.70

24.27 / 100.9

VGG

Vanilla (large)

6.22 / 122.48

26.96 / 119.05

24.15 / 112.7

Small (seed)

8.27 / 48.21

31.12 / 51.71

31.01 / 54.35

Convergent

6.33 / 99.72

26.83 / 92.65

26.42 / 77.61

Slow growth

– / –

27.34 / –

25.70 / –

Fast growth

– / –

26.86 / –

24.44 / –

Periodic

6.40 / 92.23

26.73 / 93.02

25.70 / 92.40

Lipgrow

7.05 / 75.92

29.82 / 83.26

27.03 / 103.91

FRAGrow, \(\alpha=2\)

– / –

26.75 / 95.11

24.22 / 101.20

FRAGrow, \(\alpha=4\)

6.20 / 100

26.57 / 100

24.39 / 100

FRAGrow, \(\alpha=6\)

– / –

26.89 / 110.58

24.32 / 99.17

MobileNetV2

Vanilla (large)

5.22 / 132.92

23.95 / 123.83

29.71 / –†

Small (seed)

7.32 / 38.12

27.49 / 40.07

36.97 / –†

Periodic

5.66 / 95.11

24.35 / 99.67

– / –†

Convergent

5.50 / 105.86

24.11 / 106.87

– / –†

Lipgrow

5.64 / 117.30

24.32 / 116.65

– / –†

FRAGrow, \(\alpha=4\)

5.60 / 100

23.94 / 100

30.25 / 100

† The paper does not report MobileNetV2 numbers on ImageNet for the Periodic, Convergent, Lipgrow, Small and Vanilla baselines — only the FRAGrow test error and normalised time are given (Table 5). The corresponding cells are left blank.

Three findings stand out:

  1. Convergent growth is not clearly inferior to periodic growth. The two policies trade places several times across (model, dataset) pairs — e.g. Convergent beats Periodic on ResNet/CIFAR-10 (6.35 % vs. 6.58 %) and MobileNetV2/CIFAR-100, but loses on ResNet/ImageNet and VGG/ImageNet. This contradicts the clean ordering “Periodic » Convergent” claimed by AutoGrow, although the setup here differs (different \(I_{\max}\), different initialiser, different target architectures, and different \(K\)).

  2. The right schedule depends on the fitting regime. On the overfitting CIFAR datasets all growth schedules (including the slowest ones) reach or slightly beat the Vanilla baseline, indicating that the regularising effect of growth helps. On the underfitting ImageNet dataset, by contrast, Periodic, Convergent and Lipgrow all lose accuracy to Vanilla (up to \(-1.3\) pp on VGG), while FRAGrow — which detects the underfitting via the ORL and grows faster — recovers most of the gap.

  3. Results don’t align with the “slow growth overfit less” narrative. On ImageNet, the fastest growth schedule is beating slower ones but on CIFAR the conclusions are more mixed. For example, on CIFAR-100 and VGG, faster growth (Fast growth and Periodic) outperforms slower growth (Convergent and Slow growth), which is the opposite of what the regularisation narrative would predict.

Effect of the growth-phase learning rate

A constant large learning rate during the growth phase outperforms cosine annealing with restart, mirroring the AutoGrow finding.

Table 15 Constant vs. cosine-annealing learning rate during the growth phase, test error (%) on CIFAR-100 (Table 3 of [WWM+24]).

Model

Learning rate

CIFAR-100

ResNet

Constant

28.91

ResNet

Cosine annealing

29.32

VGG

Constant

27.34

VGG

Cosine annealing

28.06

Robustness ablations

The remaining ablations are reported only qualitatively here, since they do not change the picture:

  • Where to grow. Replacing the sequential front-to-back order with the AutoGrow round-robin order leaves the ordering between methods intact: FRAGrow remains as good as or better than Periodic and Convergent on CIFAR-100 and ImageNet, with a notable \(\sim 2\) pp accuracy improvement on VGG/ImageNet (Table 9 of [WWM+24]).

  • Initialisation. Replacing the duplicate the preceding block initialiser with moment growth [LZW+22] (copy the historical exponential-moving-average of the preceding block’s weights) does not change the ranking: FRAGrow matches the baselines on overfitting CIFAR-100 and gains \(\sim 1\) pp on the underfitting ImageNet (Table 10 of [WWM+24]).

Remarks

  • FRAGrow does not change where or how a block is added; it only changes when. The improvements on ImageNet therefore come from breaking the (over-)regularising effect of slow growth, not from a better initialiser or better target depth (the target depth is given by the user).

  • The Convergent vs. Periodic tension with AutoGrow is interesting but the two papers do not use the same setup: FRAGrow uses sequential growth with duplicate-preceding-block initialisation and \(I_{\max} \approx (E_T - E_F^{\min})/n\), whereas AutoGrow uses circulation growth with random (GauInit) initialisation and \(K = 3\). The conclusions of either paper cannot be transferred to the other without care.

Open questions

  • The slow growth and fast growth baselines used to illustrate the regularisation effect (Table 1) are described only in relative terms; the absolute growth intervals are not given. It is therefore impossible to cleanly position them on the Periodic-\(I_{\max}\) axis or to reproduce them exactly.

References

[DLLS20] (1,2)

Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards Adaptive Residual Network Training: A Neural-ODE Perspective. In ICML. 2020.

[LZW+22]

Changlin Li, Bohan Zhuang, Guangrun Wang, Xiaodan Liang, Xiaojun Chang, and Yi Yang. Automated Progressive Learning for Efficient Training of Vision Transformers. In CVPR. 2022. arXiv:2203.14509. URL: http://arxiv.org/abs/2203.14509, doi:10.48550/arXiv.2203.14509.

[WWM+24] (1,2,3,4,5)

Haihang Wu, Wei Wang, Tamasha Malepathirana, Damith Senanayake, Denny Oetomo, and Saman Halgamuge. When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks. In AAAI, volume 38, 5994–6002. 2024. Number: 6. URL: https://ojs.aaai.org/index.php/AAAI/article/view/28414, doi:10.1609/aaai.v38i6.28414.