李兴腾,冯锋,黄鹂强.突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J].中国科学院院刊,2025,40(3):522-529.

突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考

Breaking through "data bottleneck" of AI large models -Reflections on building a national corpus operation platform
作者
李兴腾1*
浙江大学 公共管理学院 杭州 310058
LI Xingteng1*
School of Public Affairs, Zhejiang University, Hangzhou 310058, China
冯锋2
中国科学技术大学 管理学院 合肥 230026
FENG Feng2
School of Management, University of Science and Technology of China, Hefei 230026, China
黄鹂强3
浙江大学 管理学院 杭州 310058
HUANG Liqiang3
School of Management, Zhejiang University, Hangzhou 310058, China
中文关键词
         人工智能;大模型;语料库;数据瓶颈
英文关键词
        artificial intelligence;large models;corpus;data bottleneck
中文摘要
        当前,全球人工智能大模型行业竞争日趋激烈,语料库成为提升人工智能大模型技术性能和应用效果的关键。但是,我国语料库在数量和质量上均存在不足,难以满足快速发展的人工智能大模型训练需求。从全球来看,各国都在加快语料库发展,特别是推动高质量语料库的建设和应用。因此,文章基于国外对标和国内环境分析,从平台定位、总体架构、运营主体、核心内容等维度提出建设国家级语料库运营平台的建议。
英文摘要
        At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China's corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions:platform orientation, architectural design, governing entities, and key functional components.
DOI10.16418/j.issn.1000-3045.20240510001
微信关注公众号