How to evaluate AI-generated teaching resources before using them
Most teachers I speak to have generated at least one resource with an AI tool by now. The first one usually feels miraculous. The fifth one is when you start noticing the cracks: A definition that is almost right but not quite, a question that does not match the specification, an example that subtly mischaracterises a group of people.
AI-generated resources are not bad. They are also not finished. The teacher's job has shifted, in this corner of the work, from drafting to editing. And editing well requires a clear set of criteria you can apply quickly, ideally in less time than it would have taken to write the resource from scratch.
This guide is a practical quality checklist for evaluating AI-generated teaching resources before they hit the classroom. It focuses on the kinds of resources teachers actually generate day to day: Lesson plans, worksheets, retrieval questions, model answers, knowledge organisers.
The five things that go wrong most often
Before getting into the checklist, it helps to know what you are looking for. AI-generated resources tend to fail in a small number of predictable ways.
Factual inaccuracy is the obvious one. AI tools confidently produce content that is wrong, especially on technical detail or anything where the answer hinges on a specific number. The confidence is often the problem: There is no sheepish hedge, just a clean assertion that happens to be incorrect.
Specification mismatch is the second. A resource on respiration might be accurate but pitched at A-Level depth when you needed GCSE, or framed using a model your exam board does not use. Accurate is not the same as aligned.
Flattened nuance is the third. AI tends to round off the edges of contested or complex topics. A worksheet on the causes of the First World War might list four neatly balanced factors when the actual historiography is messier.
Bias and representation issues are the fourth. AI-generated examples skew towards Western, male, and English-speaking defaults unless you specifically prompt otherwise. Names in maths problems, characters in language scenarios, scientists referenced in history: All tend to lack the range a thoughtful human author would build in.
Hallucinated sources are the fifth. If a resource cites a journal article, a textbook page, or a quotation, there is a meaningful chance the citation is invented. The author might be real, the title plausible, but the specific reference may not exist.
Every citation
Check
Citation errors are common enough that any reference in AI-generated content must be checked against the primary source before classroom use.
The accuracy check
Start with accuracy because everything else depends on it. If the resource contains factual errors, nothing about it is salvageable until those are fixed.
Read through the resource as if you were a sceptical colleague who teaches the same subject. Pause at any specific claim: A number, a date, a definition, a named example, a process step. For each one, ask yourself whether you would stake your reputation on it being correct. If you would not, check it against a source you trust.
Pay particular attention to anything at the edges of your own knowledge. AI tools tend to be most reliable on the central, well-trodden parts of a syllabus and progressively less reliable as topics get more specialised. For maths and science specifically, work through any worked examples yourself. Calculation errors and sign errors are common, and they spread to pupils quickly if you do not catch them at the source.
A useful rule of thumb: If you cannot easily verify a claim from your own training and a quick check, do not put it in front of pupils. Replace it with content you can stand behind, or generate a different question that lives on firmer ground.
The specification alignment check
An accurate resource that does not match your exam specification is only half useful. It might teach pupils something true but irrelevant, or use language and emphasis that does not match what an examiner will reward.
The quickest alignment check is to have your specification document open while you read the resource. For each section, find the bullet point in the spec it maps to. If you cannot find one, the resource is probably going off-piste. For each command word ("describe", "explain", "compare", "evaluate"), check it matches the kind of command words your board uses at this level.
Watch out for board-specific terminology. Different exam boards use slightly different language for the same concept, and AI tools default to the most common version, which may not be yours. If your board uses a particular term in their mark schemes, your resources should use that term too.
Level of demand is the other thing to check. AI tools sometimes produce content that is technically about your topic but at the wrong cognitive level for your year group. Read each question and ask: Is this what an average student at this stage should be able to engage with?
The bias and representation check
This one is easy to skip and important not to. AI-generated content reflects the patterns in its training data, which means it carries the biases of that data. Without active prompting, you tend to get a narrow set of names, contexts, and reference points.
Look at the examples used in the resource. If it is a maths worksheet, are the names varied? If it is a language reading comprehension, who are the protagonists and what backgrounds do they have? If it is a science worksheet referencing scientists, is the list a roll call of dead European men? None of these are dealbreakers in isolation, but a pattern across all your AI-generated resources will quietly shape what pupils see as the norm.
The other angle is whose perspectives the resource centres. A history resource on the British Empire generated with a generic prompt will tend to default to a particular narrative. The version with imperial subjects' perspectives, anti-colonial scholarship, or contemporary critique included is usually a different resource, and you have to ask for it explicitly.
Fix this at the prompt level rather than after the fact. "Use a varied set of names including non-Western ones" or "include perspectives from at least two of the following groups" baked into your default prompt template will save you editing time on every resource.
The pedagogical check
Accurate, aligned, and representative is still not enough. The resource also needs to actually teach well, which is a question about pedagogy, not content.
Ask whether the resource works the way good resources work. For practice questions: Do they ramp in difficulty, or are they all at the same level? For knowledge organisers: Is the layout actually usable, or is it dense text that pupils will not engage with? For lesson plans: Is there a clear learning outcome, a check for understanding, and a sensible balance of teacher input and pupil work?
AI tools are pattern-matching engines. They produce resources that look like the average resource in their training data, which means they tend toward the middle of what exists rather than what is best. A worksheet that is structurally fine but does not include retrieval, interleaving, or hinge questions is a perfectly valid worksheet that misses some of the most evidence-backed moves in classroom teaching.
If you find yourself adjusting the same pedagogical elements every time, that is a sign your prompt template needs upgrading. Build the moves you want (retrieval starters, worked examples with explicit reasoning, varied practice) into the default prompt rather than adding them by hand each lesson.
Quick evaluation rubric
The table below is a compressed version of the checks above. Run each AI-generated resource through it before classroom use. Most resources will fail at least one row first time, which is normal: The job is to catch and fix, not to expect perfection from the model.
| Check | What to look for | Typical failure mode |
|---|---|---|
| Accuracy | Every specific claim, number, or definition verified against a trusted source. | Confidently stated but subtly wrong facts in less-common topics. |
| Specification alignment | Maps to your spec, uses your board's terminology, pitched at the right level. | Right topic, wrong board's language or wrong cognitive demand. |
| Bias and representation | Varied names, perspectives, and contexts. No defaulting to a narrow set. | Western, male, English-speaking defaults across all examples. |
| Pedagogy | Ramps in difficulty, includes retrieval, has clear learning intent. | Flat practice with all questions at the same level. No diagnostic moves. |
| Sources | Any cited references verified. Quotations checked against the original text. | Invented citations and plausible-but-fake page numbers. |
| Tone | Appropriate register, no patronising language, age-appropriate phrasing. | Either too childish for the age group or unnecessarily formal. |
When to bin the resource and start again
Sometimes the best move is to throw the AI output away and either rewrite the prompt or write the resource yourself. The sunk-cost trap is real: You spent ten minutes generating something, so you feel obliged to spend another twenty editing it. Often the edited version takes longer and ends up weaker than starting fresh would have done.
A few signals that should make you walk away. If the underlying premise of the resource is wrong, no amount of editing fixes that. If you find more than two or three factual errors in a single resource, the tool was on shaky ground for this topic and the errors you have not spotted yet are probably worse. If the structure is fundamentally not the kind of resource your class needs, swapping examples will not save it.
Keep a running note of the topics and tasks where AI tools have served you well and where they have not. AI tends to be reliable for certain things in your subject and unreliable for others, and knowing which is which saves a lot of wasted time.
The honest test is time. If editing the AI resource is taking longer than writing the resource yourself would have done, the AI is not helping. Walk away and use a different starting point, even if that feels like admitting defeat in the moment.
Building reusable prompt templates
Most teachers we have spoken to find that prompt quality matters more than tool choice. The same model can produce a useless lesson plan from a vague prompt and a very usable one from a structured prompt. Building a small library of templates is one of the best uses of an inset hour.
A workable template usually includes the year group, the exam board and specification reference, the learning outcome, the prior knowledge you can assume, the kinds of activities you want, and any tone or representation requirements. The more of this you front-load, the less editing you do afterwards.
Store your templates somewhere you can find them. Most teachers we know keep a single document with five or six templates: One for retrieval starters, one for worked examples, one for differentiated worksheets, one for model answers, and one or two subject-specific ones. Updating the template when you find a recurring problem (a representation gap, a recurring inaccuracy) is more efficient than fixing the same issue in every generated resource.
Where Cognito-style content fits
For high-stakes work like exam questions and mark schemes, some teachers prefer to start from a curated bank of pre-vetted content rather than generating from scratch. Platforms like Cognito have exam-aligned quiz libraries for GCSE and A-Level sciences and maths, which can act as a quality baseline against which AI-generated questions can be compared. The combination of a vetted starting bank for the highest-stakes content and AI-generated material for everything else tends to work better than either approach on its own.
A pre-use checklist
Before you hand it out: AI resource pre-use checklist
Five minutes with this checklist before any AI-generated resource reaches a classroom. The first few will feel slow. After a fortnight it becomes automatic.
- Every specific claim, number, and definition verified against a trusted source
- Specification reference checked against the resource: Right board, right level, right command words
- Names, examples, and contexts varied. No defaulting to a narrow demographic
- Worked examples and calculations checked by working through them yourself
- Any cited sources, quotations, or references confirmed as real
- Pedagogical structure includes retrieval, ramping difficulty, and clear outcome
- Tone and register appropriate for the age group, not patronising or overformal
- Time spent editing under the time it would have taken to write from scratch