人工智能大语言模型在本科电化学课程论文评分中的可靠性评估<sup>*</sup>

doi:10.13884/j.1003-3807hxjy.2025070238

Abstract
Figure/Table
References(0)
Related Citation (15)

Download: PDF (6044 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS) Supporting Info

Abstract The rapid development of artificial intelligence(AI) has brought unprecedented opportunities to chemistry education,while also raising a range of concerns.This study critically examines the capability of GPT-4 to score scientific essays in an undergraduate electrochemistry course.Using a multidimensional rubric,we compared GPT-4 evaluations with those from a human instructor across grammar,citation,logical structure,scientific accuracy,and critical thinking.Results revealed significant discrepancies:GPT aligned moderately in grammar but performed poorly in content-driven dimensions,with weak correlations across dimensions and the total score.The GPT model also showed score compression and ranking misclassifications,potentially disadvantaging technically strong students.These findings underscore the risks of relying on GPT as a sole grader for scientific writing.We recommend a hybrid framework combining GPT’s efficiency in objective aspects with human judgment in evaluating scientific reasoning and depth,ensuring fairness and academic integrity.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	YANG Chun-Peng

Key words： large language models automated essay scoring GPT electrochemistry paper writing

Received: 24 July 2025

Cite this article:

YANG Chun-Peng. Reliability Assessment of Large Language Models in Scoring Academic Essays: Case Study of an Undergraduate Electrochemistry Course[J]. Chinese Journal of Chemical Education, 2025, 46(24): 115-120.