科学研究

程序语言处理课题组

PI:Jia LI

研究方向:聚焦于程序语言处理(Program Language Processing,PLP)

课题组简介

课题组聚焦于程序语言处理(Program Language Processing,PLP,致力于探索人工智能技术在程序自动理解与生成方面的前沿应用,旨在推动软件工程及其相关领域的智能化发展。

其具体研究兴趣包括:

1. 程序语言处理的模型与算法研究

重点关注适用于程序源代码的深度学习模型设计、训练策略与推理机制,提升神经网络对程序结构与语义的理解能力,强化代码生成的准确性、效率与安全性

2.程序语言处理与多学科交叉融合

探索程序语言处理与其他学科的交叉赋能,例如,在软件工程中的应用,涵盖自动代码生成、测试用例生成、程序优化等方向,提升软件开发的智能化水平。同时关注其在具身智能、神经科学方面的交叉应用。

研究成果

有效推动了基于大模型的代码生成技术的发展。主导/参与训练多个面向代码的大语言模型,在代码生成等下游任务上取得国际领先结果,为研究社区提供坚实的基座模型。提出基于深度推理的代码生成技术,充分释放大模型的推理能力,提升模型解决复杂开发需求的能力。提出面向真实软件项目代码生成评估基准,促进大模型在真实软件开发中的应用。

近五年,在NeurIPS、ACL、ICSE、ASE、FSE等 CCF A 类顶会/顶刊发表论文二十余篇,包含多篇Oral文章。论文多次被麻省理工学院、斯坦福大学、南洋理工大学、香港中文大学等机构的研究者引用,累计达千余次,科研成果被《中国科技网》、《中国日报》、《机器之心》等主流媒体报道。

课题组风格和人才培养理念

清华大学程序语言处理研究组(THU-PLP)秉持“顶天立地”的风格,“立地”是指深入研究领域内的基础问题,做出高水平的学术研究;“顶天”是指构建出有效的跨学科应用,解决真实世界中的痛点问题。

重视学生的个人兴趣,在贴合课题组主线方向的背景下,鼓励学生自由探索有价值的研究问题。积极引导学生与工业界进行深度交流,从实际应用中发掘有价值的研究问题,旨在将学生培养成能够独立思考和解决问题的科研工作者。

定期举行周会和一对一讨论,提供充足的计算资源和专业的科研指导。

更多详情,欢迎访问课题组主页:https://lj2lijia.github.io/

课题组文章列表

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li, Bin Gu, and Mengfei Yang. 2024. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. In Findings of the 62st Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 3603–3614. Association for Computational Linguistics.

Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li. EvoCodeBench: An Evolving Code Generation Benchmark with DomainSpecific Evaluations. In the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), pages 57619-57641.

Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2024. AceCoder: An Effective Prompting Technique Specialized in Code Generation. ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 33, Issue 8, Pages 1-26.

Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured Chain-of-Thought Prompting for Code Generation. ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 34, Issue 2, Pages 1-23.

Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. SkCoder: A SketchBased Approach for Automatic Code Generation. In the 45th International Conference on Software Engineering (ICSE 2023). IEEE Press, 2124–2135.

Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. CodeEditor: Learning to Edit Source Code with Pre-trained Models. ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 32, Issue 6, Pages 1-22.

Jia Li, Yongmin Li, Ge Li, Xing Hu, Xin Xia, and Zhi Jin. 2022. EditSum: a retrieveand-edit framework for source code summarization. In the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE 2021). IEEE Press, 155–166.

Jia Li, Zhuo Li, Huangzhao Zhang, Ge Li, Zhi Jin, Xing Hu, and Xin Xia. 2024. Poison Attack and Poison Detection on Deep Source Code Processing Models. ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 33, Issue 3, Pages 1-31.

Siyuan Jiang*, Jia Li* (共同一作), He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, Gen Wang, Yihong Dong, Kechi Zhang, Ge Li. 2025. aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing. In the 47th International Conference on Software Engineering (ICSE 2025). Just Accepted (December 2024).

Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Jia Li, Zhi Jin, and Hong Mei. 2024. Hot or cold? adaptive temperature sampling for code generation with large language models. In the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI 2024), Vol. 38. AAAI Press, Article 50, 437–445.

Haojie Zhang, Ge Li, Jia Li, Zhongjin Zhang, Yuqi Zhu, and Zhi Jin. 2022. Fine-tuning pre-trained language models effectively by optimizing subnetworks adaptively. In the 36th International Conference on Neural Information Processing Systems (NeurIPS 2022). Curran Associates Inc., Red Hook, NY, USA, Article 1558, 21442–21454.

Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023. Self-Edit: Fault-Aware Code Editor for Code Generation. In the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), pages 769–787. Association for Computational Linguistics.

Kechi Zhang, Ge Li, Jia Li, Yihong Dong, Jia Li, Zhi Jin. 2025. Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points. In Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Just Accepted (May 2025).

Jia Li, Fang Liu, Jia Li, Yunfei Zhao, Ge Li, and Zhi Jin. 2023. MCodeSearcher: MultiView Contrastive Learning for Code Search. In the 14th Asia-Pacific Symposium on Internetware (Internetware 2023). Association for Computing Machinery, New York, NY, USA, 270–280.

Jia Li, Chongyang Tao, Jia Li, Ge Li, Zhi Jin, Huangzhao Zhang, Zheng Fang, and Fang Liu. 2025. Large Language Model-Aware In-Context Learning for Code Generation. ACM Transactions on Software Engineering and Methodology (TOSEM). Just Accepted (February 2025).

Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. 2024. Exploring and Unleashing the Power of Large Language Models in Automated Code Translation. In the ACM International Conference on the Foundations of Software Engineering (FSE 2024), Volume 1, Issue FSE, Pages 15851608.

Jia Li, Chongyang Tao, Zhi Jin, Fang Liu, Jia Li, and Ge Li. 2024. ZC3: Zero-Shot CrossLanguage Code Clone Detection. In the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023). IEEE Press, 875–887.

Huangzhao Zhang, Kechi Zhang, Zhuo Li, Jia Li, Jia Li, Yongmin Li, Yunfei Zhao, Yuqi Zhu, Fang Liu, Ge Li, Zhi Jin, Deep learning for code generation: a survey, SCIENCE CHINA Information Sciences, Volume 67, Issue 9, 2024, Pages 191101, ISSN 1674-733X.

Jia Li, Xuyuan Guo, Lei Li, Kechi Zhang, Ge Li, Jia Li, et al. 2025. LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding. In the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Just Accepted (May 2025).

课题组成员

  • 张奕彤

    22373337@buaa.edu.cn

  • 程正翔

    zhengxiangc@hust.edu.cn

  • 杨乐康

    ylk22@mails.tsinghua.edu.cn

新闻动态

TOP