Abstract The rapid development of artificial intelligence(AI) has brought unprecedented opportunities to chemistry education,while also raising a range of concerns.This study critically examines the capability of GPT-4 to score scientific essays in an undergraduate electrochemistry course.Using a multidimensional rubric,we compared GPT-4 evaluations with those from a human instructor across grammar,citation,logical structure,scientific accuracy,and critical thinking.Results revealed significant discrepancies:GPT aligned moderately in grammar but performed poorly in content-driven dimensions,with weak correlations across dimensions and the total score.The GPT model also showed score compression and ranking misclassifications,potentially disadvantaging technically strong students.These findings underscore the risks of relying on GPT as a sole grader for scientific writing.We recommend a hybrid framework combining GPT’s efficiency in objective aspects with human judgment in evaluating scientific reasoning and depth,ensuring fairness and academic integrity.
YANG Chun-Peng. Reliability Assessment of Large Language Models in Scoring Academic Essays: Case Study of an Undergraduate Electrochemistry Course[J]. Chinese Journal of Chemical Education, 2025, 46(24): 115-120.