Assessment Credibility Plan
Preparing for What AI Has in Store for Education
Here is the full piece rewritten with an exploratory tone throughout:
There are moments in history when systems that seem immovable change almost overnight. They appear permanent right up until the moment the conditions supporting them shift, and then the transformation happens with surprising speed.
We saw this clearly during the COVID-19 pandemic, when schools across the world closed within days, universities rewrote admissions policies within weeks, and national examination systems that had operated for generations were suddenly suspended. Systems that had taken decades to construct pivoted almost instantly once the underlying assumptions no longer held.
Education assessment may now be approaching a similar threshold.
For more than a century, large-scale examinations have formed the backbone of modern educational accountability. Standardized assessments promised a clear answer to a deceptively difficult question: how do we know what students know, and how do we compare learning across large populations?
Entire infrastructures grew around that promise. Global examination systems capable of administering, scoring, and moderating assessments for millions of students were developed over generations. In the United States, examinations such as the SAT and ACT became powerful signals used by universities to guide admissions decisions.
These systems were not arbitrary. They were built on a long tradition of research in psychometrics, the scientific study of measurement in education. Statistical frameworks such as Item Response Theory allowed exam designers to calibrate question difficulty and estimate student ability with remarkable precision across very large populations.
The elegance of the system rested on a simple principle: if carefully designed questions were administered under controlled conditions, the resulting scores could function as reliable signals of learning.
For a long time, that assumption held.
Yet the stability of testing systems has always depended on something rarely stated aloud, which is the belief that the work produced during an examination represents the independent thinking of the student. The credibility of the entire system quietly rests on that shared assumption.
Artificial intelligence is beginning to unsettle it.
Generative systems such as ChatGPT and Claude, and other large language models can now produce essays, explanations, code, and analytical responses that closely resemble human work. Researchers at organizations including Educational Testing Service have already demonstrated that AI-generated responses can perform surprisingly well on many forms of traditional academic assessment.
Attempts to detect AI-generated work have proven far less reliable than many hoped. Detection tools produce both false positives and false negatives, and as language models improve the distinction between human and machine-generated writing becomes increasingly difficult to determine with confidence.
This development creates a challenge that goes well beyond questions of academic integrity, because the deeper issue is the credibility of the assessment system itself.
Tests only function when institutions trust that the results represent genuine learning. Once that trust begins to weaken, the meaning of the score begins to dissolve. Universities start questioning the reliability of exam results, employers begin doubting credentials, and students themselves may come to see the process as arbitrary.
Governments are especially sensitive to this kind of credibility problem, because assessment systems serve several crucial functions simultaneously. They help determine access to higher education, they guide resource allocation across school systems, and they provide the public with evidence that educational institutions are fulfilling their responsibilities.
For decades, high-stakes testing provided a reasonably stable answer to those demands.
The age of artificial intelligence complicates that answer in ways that are only beginning to become visible.
The scale of the shift is already measurable. A 2025 survey of 1,041 full-time undergraduate students conducted by the Higher Education Policy Institute and Kortext, authored by Josh Freeman, found that 88 percent of students now use generative AI for assessed work, up from 53 percent the year before. The share reporting no AI use for assessment dropped from 47 percent to just 12 percent. AI use has become essentially universal. This is not a discipline problem. It is a structural one.
At the same time, another pressure has been quietly building beneath the surface of the system. The capabilities that societies increasingly value in human beings are not always the capabilities that standardized tests are best equipped to measure.
Reports such as the World Economic Forum Future of Jobs Report consistently highlight complex problem solving, creativity, collaboration, and adaptability as central competencies for modern economies. These capabilities emerge through extended inquiry, experimentation, and interaction with real problems, yet they are difficult to capture through narrow test formats administered during a single sitting.
The paradox is striking. As technology grows more powerful, the qualities that distinguish human intelligence become both more valuable and harder to measure through traditional assessment systems.
And yet the need for credible evaluation does not disappear.
Before going further, it is worth saying something personal about this landscape.
During the 1980s and 1990s I worked as an examiner and moderator for the University of London examination system and later for Edexcel, now Pearson Education. These institutions carried immense responsibility for safeguarding the credibility of national qualifications, and they approached that responsibility with seriousness and intellectual care.
During those years I witnessed one of the most significant assessment transformations of the late twentieth century, which was the transition from the rigid O-Level (Ordinary Level) examination system to the GCSE (General Certificate of Secondary Education).
The earlier structure relied almost entirely on a single examination taken at the end of a course. A student’s final grade often depended on a single moment of performance under intense pressure, and that moment carried enormous weight in shaping future opportunities.
The introduction of the GCSE changed the architecture of assessment in important ways. Coursework and portfolio components allowed students to demonstrate learning developed over time, while moderation networks ensured that evaluation remained consistent across schools and regions.
At the time many observers believed that such a change would be impossible. The system was simply too large, too established, and too deeply embedded in institutional expectations.
And yet it happened.
Looking back, the transformation appears both courageous and pragmatic. The system did not abandon standards, but it expanded the ways in which evidence of learning could be gathered and interpreted.
That historical memory feels relevant now, because the assessment challenges emerging in the age of artificial intelligence may require a rethinking of similar scale. What that rethinking looks like, and whether it is even possible within the constraints of public accountability, is genuinely uncertain.
Across education, alternative approaches to evaluation have been developing quietly for many years, and it is worth understanding what already exists before reaching for entirely new solutions. Performance-based assessment systems such as those developed by the New York Performance Standards Consortium ask students to demonstrate learning through sustained research projects, scientific investigations, and oral defenses of their work. Competency-based education models, developed through the work of the Aurora Institute (now Full Scale), frame student progression around demonstrated mastery rather than single exam events. Both approaches have shown promise, and both have encountered real difficulties around consistency, scalability, and public trust that have limited their reach.
Within the public school system itself, ongoing assessment is already embedded in ways that rarely surface in policy conversations. The Every Student Succeeds Act explicitly invited states to use portfolios, projects, and extended performance tasks as part of their assessment systems, and created an Innovative Assessment Demonstration Authority allowing states to pilot new approaches in place of traditional statewide tests. Few states have found it straightforward to use. The Multi-Tiered System of Supports, now operating in districts across nearly every state, requires schools to screen students three times a year, track individual progress continuously, and adjust instruction in response to what the data reveal. Portfolio-based assessment, developed in earnest since the 1990s, asks students to curate bodies of work over time and, in some schools, to defend that work before panels of teachers and peers. These are not fringe experiments. They are operating at scale, with varying degrees of success, inside the same system that is now struggling with the implications of AI.
In Montessori environments, ongoing observation has been the primary means of understanding a child’s development for more than a century. It is worth describing what that actually looks like, not because it resolves the larger question, but because it may be one of several things worth examining carefully.
Every day, the trained Montessori guide watches without intervening, recording what each child chooses, how long they sustain engagement, where they struggle, and where they exceed what the lesson anticipated. The child’s relationship to error, to repetition, to collaboration, and to independent problem-solving is noted, not as data points extracted under test conditions, but as evidence of development accumulating across months and years. The materials themselves are designed to reveal understanding. Public Montessori schools have been documenting how children think, not simply what they produce, for decades, supported by research-based observation frameworks that have formalized what trained practitioners have long done. There are more than 590 public Montessori schools operating across the United States.
What is not yet clear is whether approaches like this can be made reliable and comparable at the scale that public accountability requires. That is a serious and unresolved question, and anyone proposing observation-based assessment as part of a policy response needs to engage with it honestly rather than set it aside. The psychometric tradition exists for reasons. The demands of fairness across large and diverse populations are real. Any credible path forward will need to satisfy those demands, not simply gesture toward richer forms of evidence and hope that the details resolve themselves.
What does seem clear is that the field is not starting from nothing. The components of a different approach to assessment have been developing for years across multiple traditions, and they are more widespread than policy conversations tend to acknowledge. The harder question is whether those components can be assembled into something that is simultaneously more honest about what learning is, and rigorous enough to earn the trust that assessment systems must carry.
That question does not belong to any single group. It belongs to researchers who understand measurement, to teachers who understand children, to policymakers who understand what accountability systems are actually asked to do, and to parents and students who live inside those systems every day. It is also a question that probably cannot wait very long for an answer.
For such systems to gain policy acceptance, they must still satisfy the demands that standardized testing once fulfilled. Reliability, comparability, and public trust remain essential components of any credible assessment system.
This is why preparation matters.
Institutional change in education often follows a pattern that scholars describe as punctuated equilibrium, in which long periods of stability are interrupted by sudden transformation once underlying assumptions shift.
If universities begin questioning the meaning of traditional test scores, if legal challenges arise around AI detection technologies, or if public trust erodes because assessments no longer appear to measure genuine learning, governments may find themselves needing credible alternatives very quickly.
The difference between chaos and thoughtful transformation will depend on whether those alternatives have already been developed.
This is where the idea of an Assessment Credibility Plan becomes worth considering.
Such a plan would not dismantle existing testing systems overnight. Instead it would prepare the infrastructure required if a rapid transition becomes necessary, ensuring that credible forms of evaluation are ready before the system reaches a breaking point.
At the heart of such a plan might be what could be described as an Evidence Stack, a layered approach to assessment in which learning is documented through multiple forms of evidence rather than a single high-stakes score. Portfolios of student work, teacher observation records, moderated performance tasks, and low-stakes retrieval assessments might collectively form a more complete picture of learning. Whether they can do so with sufficient consistency and fairness is exactly the kind of question that needs to be worked through carefully, with people who understand both the promise and the difficulty.
Moderation networks could play a central role in maintaining reliability, allowing educators across institutions to calibrate their judgments using shared exemplars and statistical checks. Sampling-based accountability could also carry more weight, enabling governments to monitor the health of education systems without subjecting every child to constant high-stakes testing. These are not new ideas. The question is whether the conditions now exist to develop them seriously.
Artificial intelligence itself may ultimately support this shift rather than only undermine it, particularly when assessment focuses on processes that are difficult for machines to fabricate, such as iterative design, collaborative inquiry, oral explanation, and reflective documentation of learning decisions. That possibility deserves exploration rather than assumption.
Universities and employers would also need to be part of the conversation, since any shift in how learning is assessed depends partly on whether new forms of credentialing are recognized and trusted by the institutions that use them.
The shift would require sustained experimentation, honest evaluation of what works and what does not, and the kind of careful policy design that tends to happen slowly and then, when conditions change, very fast.
I have seen a transformation of this kind before.
When the GCSE replaced the O-Level, it required courage from institutions and trust in new forms of evidence. It was not the end of assessment.
It was the evolution of it.
Before drawing any conclusions about what the next evolution might look like, it is worth spending a few minutes in a public Montessori classroom. Observe the children. Talk to their parents. Talk to the educators. A pattern tends to emerge, not just in what children know, but in how they think and how they feel about thinking. It is quiet and it is consistent and it is not accidental.
Whether what happens in those classrooms holds answers relevant to the larger challenge of assessment in the age of AI is a question worth asking seriously, without either dismissing it or overstating what it can offer. It is one of several places where people have been quietly working on the problem of how to know what a child actually understands. There are others. The task ahead may be to bring those people and those traditions into genuine conversation with each other, and with the researchers and policymakers who will ultimately need to be persuaded.
It is in those classrooms, operating as research environments at their heart, that AI can be studied safely as yet another tool of human creation. Not as something that replaces thinking. Not as a shortcut past the hard work of understanding. But as a subject of serious inquiry, examined by children who have been taught to think independently, to question, to create, and to take their time with difficult things.
Not users of AI. Inventors. Thinkers. The next generation of people who will decide what this technology is actually for.
This discussion is not a criticism of educators or students in the mainstream classroom. The vast majority of teachers are working with extraordinary dedication inside systems that were not designed for the world we are now living in. The question here is specific and it is structural. It is about assessment. It is about what we measure, how we measure it, and whether the signals we rely on still mean what we need them to mean.
That question belongs to all of us.
If you wish to follow the research and thinking that inform this work, the books Mapping Montessori Materials for AI Competency Development and Montessori & AI -Volume I are available through my website, katebroughton.com.


The point about trust is key. If we can no longer be confident that assessment reflects independent thinking, then the score loses meaning—and everything built on it starts to wobble. In schools, we’re already feeling that tension. The question isn’t whether students are using AI, but what our assessments are actually asking them to do. From my experience in international settings, there’s already more space to explore these models—portfolios, coursework, ongoing teacher judgement—but they bring their own challenges around consistency and trust. That balance between richness and reliability is the hard part. It does feel like we’re approaching a similar moment to past reforms: not the end of assessment, but a shift in what counts as evidence. The question is whether we prepare for that shift deliberately—or wait until the system is forced to change.