突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考
Breaking through "data bottleneck" of AI large models -Reflections on building a national corpus operation platform
Breaking through "data bottleneck" of AI large models -Reflections on building a national corpus operation platform
作者
李兴腾1*(浙江大学 公共管理学院 杭州 310058)
冯锋2(中国科学技术大学 管理学院 合肥 230026)
黄鹂强3(浙江大学 管理学院 杭州 310058)
冯锋2(中国科学技术大学 管理学院 合肥 230026)
黄鹂强3(浙江大学 管理学院 杭州 310058)
中文关键词
人工智能;大模型;语料库;数据瓶颈
英文关键词
artificial intelligence;large models;corpus;data bottleneck
中文摘要
当前,全球人工智能大模型行业竞争日趋激烈,语料库成为提升人工智能大模型技术性能和应用效果的关键。但是,我国语料库在数量和质量上均存在不足,难以满足快速发展的人工智能大模型训练需求。从全球来看,各国都在加快语料库发展,特别是推动高质量语料库的建设和应用。因此,文章基于国外对标和国内环境分析,从平台定位、总体架构、运营主体、核心内容等维度提出建设国家级语料库运营平台的建议。
英文摘要
At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China's corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions:platform orientation, architectural design, governing entities, and key functional components.
DOI10.16418/j.issn.1000-3045.20240510001