0:00
/
0:00
Transcript

Assessing with AI?

Insights from our Exploration of AI-Assisted Grading

Following Pascal’s initial research, we recently experimented with AI-assisted grading, trying to assess both:

  • How (un)reliable the basic approach available to most teachers is

  • How much improvement more advanced techniques could yield

We started by exploring a relatively straightforward scenario: using AI to grade humanities assignments from 7th-grade students. Our objective was to use AI to give a grade, from 0 to 50, to three fictional students. We created a standard rubric based on one you might see in a typical school. The student assignments varied in their quality, and we fed the work to multiple AI bots with identical prompts.

Early Results: Inconsistency and the Challenge of Automated Grading

The initial results were eye-opening. Although one student’s work was consistently graded very high across different AI bots and different trials, the other students’ work showed large variability. Some of the bots were more consistent in their assessments than others, but overall, our initial experiments showed that the results were all over the place. This led us to question the reliability of AI when used directly for assigning grades.

This initial exploration confirmed our fears: AI, used without a deep understanding of its inner workings, was unlikely to provide reliable and consistent grading. It further reinforced what many educators already know: grading is a complex human process, relying on teachers’ expertise, their understanding of students and the curriculum, and their own biases.

Introducing Detailed Markschemes

These initial results spurred us to delve deeper. We realized the importance of rubrics and their impact on grading outcomes. Thank to our experience with the International Baccalaureate (IB) curriculum, we started to question the standard rubrics that most schools utilize, many of them being generic and not detailed enough, which led to our next step: Using AI not to assign grades, but to generate a markschemes, i.e., a detailed breakdown of expectations within a rubric that includes examples of how a teacher would assess different submissions.

This process created an important shift in our exploration: we were moving from AI as a replacement for human grading (which we were opposed to), to a model where AI assists human grading. The goal was not to rely on AI to assign the grades, but to leverage AI to enhance the teacher’s ability to assign grades more reliably. As we started to discover, efficiency and effectiveness might go hand-in-hand, here, the difficulty of connecting the unique wording of each essay with generic rubrics, which AI can facilitate, being taxing - and therefore unreliable.

We were quite amazed by the results. With the new detailed markscheme (created by a reasoning model, o1-mini, from comments generated by 4o-mini based on a rubric and human example), we saw a drastic change both in grading consistency and accuracy.

This workflow did not only put teachers at the center (e.g., reviewing the markscheme and determining final grades), simply using AI as a step in the grading process that would help them apply their expertise. It also avoided the negative consequences we initially thought to be associated with AI-grading, as it helped teachers understand both their rubrics and their students better.

Exploring AI through Multi-Step Processes & Multiple Models

The next step in our exploration focused on understanding how different models can interpret the same assignment and the same markschemes. We started to explore multiple models, including GPT, Gemini, and DeepSeek. What we noticed is that each of these models has what we came to call a “grading personality.” GPT models tend to be generous and provide higher grades than Gemini models, which tend to be much more strict. While this low inter-rater reliability confirmed our belief that AI should not be entrusted with grading, it also added to our finding that AI can provide assistance in the process - here, by providing multiple perspectives and synthesizing them in a concise report for teachers to use - and potentially discuss with their students.

Key Findings and Recommendations

Through this journey, we identified key points that we believe are essential for school leaders and educators to consider when thinking about AI in student assessment.

  • Unreliable “Vanilla” AI Grading: The quality of AI-assisted grading depends on the AI literacy of the teacher using it, technical complexity and reliability going hand in hand . A direct submission of student work for grading does not generate reliable results. A multi-step process, with a low temperature and a detailed markscheme, is required to achieve better outcomes.

  • Low Inter-Rater Reliability: Even though test-retest reliability can be high with detailed markschemes, different AI models will still give different results for the same inputs. It is important to consider these differences. If your school invests in a certain AI model, it is essential that you understand the "personality" of this model, and the potential effect it will have on outputs and outcomes.

  • (Un)Intended Consequences: Our initial opposition to AI grading stemmed not only from its unreliability, but also from the costs associated with this time-saving technology: loss of agency, skills, and relationships; and therefore poor AI role modeling. While we now believe that a best-of-both-worlds strategy is possible, it needs to be intentional, and requires some technical steps.

  • Shadow AI Grading: Our workflow is an example of what Ethan Mollick calls “co-intelligence”. Another one of his concepts that is highly relevant here is “shadow AI”, which refers to the undisclosed, and therefore unguided use of AI - due to multiple factors, including unclear institutinal stance and policies. AI grading being so tempting and unreliable, we believe it is critical for schools to be proactive, lay out how teacher should and should not use these technologies, and of course provide the necessary training and resources.

  • AI as a “Third Point”: A small study conducted by one of our teachers (Anna Drake, HS English Teacher at ISP) suggests that AI feedback is more likely to be well-received and acted upon, as it is not perceived as a personal judgment from the teacher. This can help build a more psychologically safe and constructive learning environment. AI’s objective and “impartial” perspective may also give students some new insights - although more testing is needed around equity.

  • Next Steps: Our workflow is not yet ready to be shared in an easy-to-use app. The user interface would be easy to create, but we still need proper red-teaming (e.g., to protect against prompt injection) and further testing. Some have also suggested potential improvements, such as Ryan Tannenbaum, who drew our attention to the manipulation of logits. We are also still exploring the right balance between improved workflows (which requires additional steps and more complex processing) and environmental impact (including making the latter explicit and offering ways to offset it).

  • The danger of the “True Grade”: As we try and improve AI-assisted grading, a final insight is that we might want (or need) to rethink our assumption that a piece of student work has one “true grade” encapsulating its inherent quality. Even a highly detailed instrument (grading document) might never be able to measure and assign a consistent, accurate, and informative number to a student’s demonstration of learning. AI may very well be just a symptom here, highlighting the limitations of grading itself, and therefore encourage us to consider “un-grading” approaches, where teachers and students discuss rich, qualitative feebdack.

Moving Forward: A Call for Collaboration

An important benefit of AI-assisted grading not mentioned above is that it is an opportunity to foster collaboration - for instance by supporting moderation during the exchange and review of markschemes. Likewise, alternative solutions exist, that would require schools around the world to come together, such as the creation of datasets for custom, fine-tuned, language models. If you are interested, please do leave a comment!