AI review systems

The Critic Needs a Contract: Why AI Review Systems Need Profiles, Schemas, and Evidence

A useful AI reviewer is not just a vision model with a prompt. It needs versioned behavior, schema-constrained output, evidence, retrieval rules, and failure metadata that make every decision auditable.

AI Business Consultant Engineering Lead Human-in-the-loop AI

The moment an AI system starts reviewing generated work, it becomes part of the product's decision machinery. It is no longer enough for the model to "sound right." Its decisions need to be explainable, repeatable enough to debug, and connected to the same evidence the business cares about.

In one product-design workflow, the generation model produced batches of hardware concept images. A local vision-language model acted as a Critic: it ranked images, identified physical or prompt failures, explained why an output worked or failed, and helped reduce the volume that humans needed to inspect. Human reviewers still made the final call.

That last sentence is the important boundary. The Critic was not a magic approval machine. It was a review layer in a larger system. That means it needed a contract.

The Bad Pattern: A Prompt Hidden In The Runtime

A surprisingly fragile AI review pattern is to bake the behavior into a runtime prompt, container image, or one-off worker script. The model has instructions somewhere. The worker asks for JSON. The app tries to parse it. If the result looks plausible, the system proceeds.

This works until something changes. The prompt gets edited without a version. The model starts using different tags. A rule is interpreted differently after an upgrade. A malformed response is retried with a looser format. The app stores a decision but not the raw evidence that produced it.

AI review is not useful because it returns an opinion. It is useful when the opinion can be traced back to rules, inputs, examples, and a schema.

Profiles Make Behavior Reviewable

The better pattern is to treat Critic behavior as repo-backed product configuration. A Critic profile should define the system prompt, batch instructions, hard rejection rules, ignore rules, scoring definitions, and retrieval budget. The model server should remain a model server. It should not secretly own product judgment.

Each Critic run should snapshot the effective profile into metadata: profile id, version, path, selected examples, model id, request parameters, schema version, image object keys, timings, and errors. That turns a model response into a record the team can inspect later.

review job
  -> load critic profile
  -> select retrieval examples
  -> send image batch and context to the model
  -> validate structured output
  -> store parsed decision plus raw response
  -> preserve profile, schema, timings, and errors

This is not bureaucracy. It is how an AI review system becomes improvable. If a reviewer disagrees with the Critic, the team can ask a useful question: did the model fail, did the profile fail, did retrieval select poor examples, or did the schema fail to capture the judgment?

Tags Are Training Data, Not Decorations

Review tags often begin as UI shortcuts: "good," "bad," "incorrect," "low quality." Those tags are easy to click and almost useless for learning. They repeat the decision instead of explaining it.

In a self-improving review workflow, tags need to be small reasons. Approved tags might name width, mouthpiece, reservoir layout, materials, screen size, part borders, or ergonomics. Rejected tags should explain the failure: too wide, poor ergonomics, impossible mouthpiece, messy part borders, nonfunctional details, or render artifact.

The difference matters because the tag is not just for the current reviewer. It becomes retrieval context, disagreement evidence, future prompt feedback, and eventually training data.

Schemas Are How The System Says No

Loose JSON prompting is not enough for unattended review. A model can return something that looks like JSON but misses required fields, changes labels, numbers images incorrectly, or mixes summary text with data. When that happens, silently retrying with a looser request is a product bug.

A Critic job should use schema-constrained structured output. If the runtime cannot support the schema, that is a runtime compatibility failure. If parsing fails, the queue job should fail visibly and keep the raw model response, chunk metadata, model id, usage, and image-number range attached to the run.

This protects the human reviewer and the future dataset. A malformed Critic run should not be promoted into review truth just because the app wanted to finish the queue.

Retrieval Needs Rules Too

Retrieval is powerful in AI review systems because it lets the Critic compare new outputs against prior human decisions and known failure patterns. But retrieval can also become another hidden behavior path.

The profile should define the retrieval budget. The run should record which examples were selected. If no embedded memory exists yet, the Critic may run without examples rather than falling back to arbitrary recent samples. That keeps the path honest: either the model had relevant memory, or it did not.

Profile What behavior, rules, scoring definitions, and retrieval budget were active?

Schema What exact output structure did the model have to satisfy?

Evidence Which image object keys, prompts, settings, and examples did the model see?

Truth Which decision is canonical now, and which older decisions remain only as history?

Current Truth Is Not History

Review systems need a clean separation between history and current truth. A human may review an image today and change their mind tomorrow. A Critic profile may be improved. A later run may disagree with an earlier run. The system should preserve that history without letting old decisions drive current dashboard counts.

The practical rule is simple: keep at most one canonical human decision and one canonical Critic decision per image. New decisions demote older canonical decisions for the same actor type, while older rows remain available for analysis.

That separation is what allows disagreement to become useful. The system can compare current human truth against current Critic truth, then use the historical trail to improve prompts, retrieval, rubrics, and future training data.

The Consulting Lesson

AI review systems become business assets when their judgments are governed by contracts, not vibes. Profiles define behavior, schemas define acceptable output, evidence makes decisions auditable, and canonical state keeps the product honest.

The Critic is not valuable because it is autonomous. It is valuable because it can reduce review load while preserving the information a human expert would need to trust, challenge, and improve it. That only happens when the system makes the model's work inspectable all the way down.