中国科学院院刊-国家科学思想库核心媒体

李兴腾,冯锋,黄鹂强.突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J].中国科学院院刊,2025,40(3):522-529.

突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考

Breaking through "data bottleneck" of AI large models -Reflections on building a national corpus operation platform

李兴腾^1*
浙江大学公共管理学院杭州 310058
LI Xingteng^1*
School of Public Affairs, Zhejiang University, Hangzhou 310058, China
冯锋²
中国科学技术大学管理学院合肥 230026
FENG Feng²
School of Management, University of Science and Technology of China, Hefei 230026, China
黄鹂强³
浙江大学管理学院杭州 310058
HUANG Liqiang³
School of Management, Zhejiang University, Hangzhou 310058, China

人工智能;大模型;语料库;数据瓶颈

artificial intelligence;large models;corpus;data bottleneck

当前，全球人工智能大模型行业竞争日趋激烈，语料库成为提升人工智能大模型技术性能和应用效果的关键。但是，我国语料库在数量和质量上均存在不足，难以满足快速发展的人工智能大模型训练需求。从全球来看，各国都在加快语料库发展，特别是推动高质量语料库的建设和应用。因此，文章基于国外对标和国内环境分析，从平台定位、总体架构、运营主体、核心内容等维度提出建设国家级语料库运营平台的建议。

At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China's corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions:platform orientation, architectural design, governing entities, and key functional components.