Growing Transformers¶

Transformers provide the largest-scale empirical validation of growing methods. While the goal differs from classical neuron addition—accelerating training toward a predefined architecture rather than architecture discovery—these experiments test whether the theoretical framework of Sec. 2 holds at the billion-parameter scale. The results are striking: growing methods achieve substantial speedups, yet the evidence reveals significant gaps in our understanding of what makes growth effective.

What works. The largest-scale study [DLQ+24] demonstrates that depthwise stacking achieves 54.6% speedup when training a 7B parameter model on 750B tokens. Compound growth [GYL+21], expanding depth, width, and sequence length simultaneously, achieves 73.6-82.2% speedup on BERT. At smaller scales, MSG [YZLW24] achieves 2.2\(\times\) speedup on BERT-Large, and AutoProg [LZW+22] reaches 85.1% speedup on Vision Transformers. These results establish growing methods as a practical, frugal alternative to training from scratch at scale.

What fails. Strikingly, widthwise growth offers no advantage at scale [DLQ+24]—a finding that challenges the centrality of the neuron addition problem for modern architectures. This asymmetry between depth and width growth is not predicted by the theoretical framework of Sec. 2 and remains unexplained.

The function preservation contradiction. The theoretical framework of Sec. 2 emphasizes function-preserving initialization, yet the empirical evidence is contradictory. MSG [YZLW24] achieves strong results using strict function preservation, while AutoProg [LZW+22] finds that function preservation harms Vision Transformer performance (\(-3.21\%\)). Most strikingly, simple stacking [DLQ+24]—which violates function preservation entirely—achieves the best results at the largest scale. This suggests that the value of function preservation depends on architecture and scale in ways the current theory does not capture. Beyond efficiency, stacking provides an unexpected inductive bias toward reasoning [SKK+24]: models show improved performance on reading comprehension and mathematical reasoning despite similar perplexity, an emergent property not predicted by existing theory.

These experiments validate growing methods as a frugal training strategy at scale, but expose limits in current understanding. A significant scale gap remains: the largest experiments reach 7B parameters, while production language models exceed 70B. The contradictory evidence on function preservation, the asymmetry between depth and width growth, and the unexplained reasoning bias all point to dynamics that the framework of Sec. 2 does not yet capture, motivating the open questions of Sec. 8.

References¶

[DLQ+24] (1,2,3)

Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Wang, and Jing Shang. Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training. In NeurIPS. 2024. arXiv:2405.15319. URL: http://arxiv.org/abs/2405.15319, doi:10.48550/arXiv.2405.15319.

[GYL+21]

Xiaotao Gu, Liyuan Yao, Hainan Liu, Jiafeng Liu, Hongyin Xu, Wei Song, Wenjing Han, Zhen Xu, and Weiwei Chen. On the Transformer Growth for Progressive BERT Training. In Conference of the North American Chapter of the Association for Computational Linguistics. 2021. arXiv:2010.12562. URL: http://arxiv.org/abs/2010.12562, doi:10.48550/arXiv.2010.12562.

[LZW+22] (1,2)

Changlin Li, Bohan Zhuang, Guangrun Wang, Xiaodan Liang, Xiaojun Chang, and Yi Yang. Automated Progressive Learning for Efficient Training of Vision Transformers. In CVPR. 2022. arXiv:2203.14509. URL: http://arxiv.org/abs/2203.14509, doi:10.48550/arXiv.2203.14509.

[SKK+24]

Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank J. Reddi, and Sanjiv Kumar. On the Inductive Bias of Stacking Towards Improving Reasoning. In Advances in Neural Information Processing Systems. 2024. arXiv:2409.19044. URL: http://arxiv.org/abs/2409.19044, doi:10.48550/arXiv.2409.19044.

[YZLW24] (1,2)

Yiqun Yao, Zheng Zhang, Jing Li, and Yequan Wang. Masked Structural Growth for 2x Faster Language Model Pre-training. In ICLR. 2024. arXiv:2305.02869. URL: http://arxiv.org/abs/2305.02869, doi:10.48550/arXiv.2305.02869.