How agencies actually get AI live with clients
The 4 levels of maturity, why most stall at level 2, and the 8-step playbook for the agencies who intend to win.
A lot of big agencies are going through “AI transformation” right now.
But most are just using Microsoft Co-pilot (or maybe Gemini) at a surface-level. Getting small, but frequent value — saving a bit of time, but nothing particularly ground breaking.
Clients want lower prices, are bringing more work in-house and management are figuring out how to use AI to mitigate that whilist maintaining their own profits.
Times are getting tougher.
Know this right now
Thriving agencies will know how to level-up their AI initiatives by creating products and services that leverage a competitive moat — usually around 1st-party data, something most of them don’t have.
So, on one side we have a competitive position. The other is a just an LLM wrapper that anyone can copy very easily.
Where does your agency live?
The gap between the two is widening fast.
Those that have moved up the AI maturity ladder are pricing differently, scoping differently, delivering work in half the hours and figuring out their compounding advantage.
There are four levels of AI maturity. Each one gives the AI more autonomy, more capability, and each one demands more of the wrapper around it.
Everyone is talking about level four. The reality is that most companies right now can’t climb past level two.
Not because they don’t want to. Because of three things their suppliers, their IT teams, and their clients haven’t sorted out yet.
This piece walks the ladder, names the hardest part, and gives you a road map to move past it.
The four levels
I’ll through walk each one using one example a PM or account manager will recognise: taking a client kickoff recording and turning it into the deliverables that follow (scope doc, status update, internal brief).
Level 1 — Chatbots
You paste the kick-off transcript into your LLM and ask for a scope doc.
It comes back with something. It reads fine. It uses too many bullet points and doesn’t sound like your agency. It doesn’t know your client. It doesn’t know your last three scopes of work. It doesn’t know the process your clients IT team follows for hosting and deployment.
So you rewrite it.
You can improve this by pasting in your scope template, your tone of voice, your project history, some client guidelines.
You can store some of it in a Project or a Gem. But it’s all static context, and you’re still the one pasting it in every time. The chatbot doesn’t go and do anything. It waits.
Most “we’re using AI” stories in agencies stop here.
Level 2 — AI workflows
You build a Power Automate or n8n flow. Every time a kickoff lands in Drive, the flow pulls the transcript, sends it to an AI node with your scope template hardcoded into the prompt, and drops a draft into a review folder.
It feels like magic the first month. You’ve gone from rewriting scopes from scratch to editing a draft that’s already 70% there. You build the same flow for status updates, retros, meeting summaries.
Here’s where it stops being magic. The workflow can’t think. If the client is in a regulated industry and your last scope needed a compliance addendum, the workflow doesn’t know.
If the transcript covers two projects instead of one, it can’t split them. If your scope template was right six months ago and is now out of date, the flow keeps producing out-of-date scopes until someone goes in and rewrites the prompt.
The workflow does the work. It doesn’t make decisions. Same steps, same order, regardless of the input.
This is where most agencies sit in 2026. And it’s where the conversation usually stops, because climbing higher means dealing with three blockers nobody in the agency tells you about.
The hard part
Between level two and level three, the conversation changes. Not technically. Commercially and contractually.
Level one and two are easy. An enterprise LLM contract. A workflow running on your current IT stack. Some updates to your SOWs.
Level three is when a tool starts reading your files, running on your machine, calling APIs with enterprise-level credentials. That’s the moment IT, security, and your client’s procurement team all wake up at once.
Three things make this the hard part.
1. Tooling access
Claude Code, Cursor, Codex, MCP servers — these get installed and run on the same machine that holds your client files.
Most agency IT teams run approved-software lists. None of these tools are on it yet. A sole trader installs what they want. An agency PM files a ticket and waits six weeks. By the time you’re approved, the client project is over.
2. Enterprise IT and security approval
The moment a tool reads files and calls external APIs, it triggers your InfoSec process. Your clients vendor compliance questionnaire — incoming!
And now you wish your agency holds ISO 27001 or SOC 2 — your client has high expectations for your data security controls and practices.
Most of these tools route through US-based model providers. That alone is a procurement conversation for any UK financial services, healthcare, or public sector client.
Add the DPA from the model provider, the sub-processor disclosure to your client, and the security questionnaire your client will fire back at you, and you’ve added two to four months to your timeline.
3. Client contracts
Most MSAs and SOWs you signed before 2025 say nothing about AI. Some explicitly forbid third-party data processors. A few regulated clients require human-only work for parts of the engagement. And a small but growing number now mandate AI use, with a discount baked in.
None of this is in your standard contract template, and none of it is being negotiated by your account manager on the fly during a renewal call.
The agencies winning this year aren’t waiting for clients to ask. They’re leading the conversation.
This is the hard part. Most agencies hit it and retreat back to level two, where the brief n8n flow keeps running and nothing fundamental changes.
You don’t get past the hard part by being clever about the tools. You get past it by doing the slow, unglamorous work of getting your IT team, your client’s IT team, and your contracts onside before you need to scale.
And agencies that are ISO27001, SOC2 and ISO42001 certified and are going to win.
Level 3 — Agentic workflows
Past the hard part, level three is where the time starts to come back.
You open Claude Code, point it at the kickoff transcript, and tell it: “Draft the scope doc, the internal brief, and the first status update for this project. Use the templates in the templates folder and the brand voice in voice.md.”
It reads your templates. It reads your voice file. It looks at the transcript and decides what goes where: what’s a deliverable, what’s a risk, what’s a dependency. It drafts the scope doc. It checks it against your voice file. It rewrites the parts that don’t pass. It drops everything in a review folder.
You didn’t write the steps. The model did. And it did them in the order that made sense, with the context it pulled in at the right moments.
The ReAct loop
The technical name for this is the ReAct loop: Reason, Act, Observe, repeat. The model reasons about what to do, acts on it, looks at the result, and decides whether to keep going or change tack. You’re not the orchestrator anymore. You’re the reviewer.
The wrapper around the model — Claude Code, Cursor, Codex — is what makes this reliable. Without it, you’ve got a chatbot in a tab. With it, the model has access to your files, your tools, your conventions. Same model. Different scaffolding around it. The technical name for that scaffolding is a harness.
This level is enough for most agency PMs. One agent, one goal, one session in VS Code or Claude Co-work.
It tops out when you want it to run multiple workflows in parallel, remember what worked last month, and coordinate across functions.
Level 4 — Agentic AI systems
Level four is what people are calling harness engineering. It’s where the agency operations actually start to run on AI rather than just being helped by it.
It’s a coordinated team. Not multiple agents shouting at each other. A set of skills, each with its own instructions, its own quality bar, its own output format, loaded only when the system needs them.
A delivery system for one client engagement might include:
A scoping skill that takes a kickoff transcript and produces a scope doc, with a checklist of items that need human input
A status reporting skill that pulls from Jira or Monday and drafts the weekly RAG report in the client’s preferred format
A QA skill that reviews any deliverable against the client’s brand guide and flags issues before send
A retro skill that runs at the end of each sprint and writes both the internal post-mortem and the client-facing wrap-up
The system loads the right skill at the right moment. It carries memory between sessions. It remembers which formats this client prefers, which deliverables always come back with the same set of revisions. It checks its own work. It flags what needs you and handles the rest.
Underneath all of this, there’s no magic. It’s folders of markdown files on top of Claude Code. The brand voice is a file. The QA checklist is a file. The memory of what worked last month is a file. The model reads them when needed and updates as it learns.
Which means the work of building it isn’t engineering. It’s writing clear instructions. Which makes agency PMs and operators — the people who already write SOPs, briefs, and templates — the ones best placed to build it.
The playbook for the hard part
You don’t move from level two to level four in one quarter. You move by doing eight unglamorous things in parallel over six to twelve months. This is the runway to get AI live and working on client engagements.
You have objectives right?
Initiatives that could have a big impact on business need to be driven from a company vision and set of objectives. And they should drive your teams objectives too.
Use OKRs — find out what they are and how to set them up on whatmatters.com
Back to the playbook:
1. Get a senior sponsor and name an owner.
Nothing moves without a named exec (MD, CEO, CFO) putting their weight behind this, and an operational lead (Head of Delivery, Head of Ops) running it day-to-day. Everything stalls without these two people. They’ll also need a support team of individuals for enablement.
2. Run an internal pilot and document it.
Pick three core processes that are time consuming — your own scope drafting, status report generation, retro write-ups. Run them through Claude Co-work for two to four weeks. Capture four numbers: hours before, hours after, error rate, and a quality score from a senior reviewer.
Produce a one-page case study with those numbers. This will help to demostrate the value of AI quickly.
3. Get ISO 27001 and SOC 2 in motion.
This is the certification gate most agencies underestimate. Regulated clients in financial services, healthcare, and public sector will not let AI tools touch their data without one of these certifications on file.
Even non-regulated clients are now asking, because their compliance teams have started flagging AI tools as third-party data processors that need due diligence.
ISO 27001 is the international standard for information security management. It proves your agency runs a working ISMS — policies, controls, risk assessments, access management, incident response. Budget £15–30k and six to nine months for a first certification through a UKAS-accredited body.
Mainly for the US market — SOC 2 Type 2 is the AICPA report covering trust services criteria (security, availability, confidentiality, processing integrity, privacy) measured over a 6–12 month observation window. Type 1 is a point-in-time snapshot; Type 2 is where you have built up evidence over time (and what clients actually want). Budget £30–60k for an agency under 100 people.
The driver for what to get first is who are you main market. ISO 27001 is faster to achieve and more recognised in UK and EU markets. SOC 2 Type 2 is non-negotiable for US enterprise sales.
The agencies winning regulated AI work in 2027 are the ones whose audits started in 2026.
4. Get your tools on to the approved register.
Claude Enterprise for organisational AI usage (gives you the DPA, SSO, data residency commitments your IT team will demand).
Your chosen agentic harness — Claude Code, Co-work, Cursor, or Codex — configured through your enterprise account.
The MCP servers that connect to your delivery stack (Jira, Monday, Notion, Confluence, your file storage). Each tool needs: vendor security review, signed DPA, sub-processor disclosure, internal data flow assessment.
Allow 8–12 weeks per tool. Run them in parallel.
5. Draft the AI contract package
Three documents your legal counsel writes once and you reuse across every renewal and new SOW.
(a) An AI addendum for your MSA covering what you use AI for, what data is processed where, where the human review steps sit, and the client’s right to opt out.
(b) An AI use disclosure form attached to each engagement scope.
(C) A sub-processor list you keep current. Bring this package to clients proactively at renewal. They will respect you more for raising it than for being asked to disclose it later.
6. Pick three client-safe use cases for the first live quarter
Not everything at once. Three.
Examples that work:
Scope drafting where the PM authors and AI assists
Meeting transcript summaries with client consent
Internal status report drafts the PM reviews before sending to client.
Each use case gets a documented workflow with a defined human review point and a named PM accountable for output quality.
Run them on two or three willing clients first, not your whole portfolio.
7. Train delivery teams on workflows, not prompts
Prompt-writing isn’t the skill. The skills are: knowing when to reach for AI versus when not to, what to disclose to the client and when, how to spot when the output is wrong before the client sees it, and how to escalate if a deliverable goes out with an AI error.
Run a two-day workshop for delivery leads, then a structured one-hour rollout for every PM and AM on the team. Build an internal AI playbook that captures the agreed answers, make sure it’s back by a solid business-level AI Policy.
8. Set up tracking and a quarterly review
Per use case: hours saved versus the internal baseline from step two, output quality score from senior reviewers, errors caught at review, errors that reached the client.
Review the numbers quarterly with the exec sponsor from step one. Kill use cases that aren’t paying back. Scale the ones that are.
Where this doesn’t pay off
Remember. This is still client work.
The agencies failing at level three and four are the ones that confused “agentic” with “autonomous”. They’re not the same. The systems that actually deliver have humans in the loop at the moments that matter: input quality at the start, output quality at the end.
You will still need account managers. You will still need senior PMs. You will still need someone with taste and judgement looking at every deliverable that goes to a client.
What changes is the ratio. One senior reviewer can manage the output of a system that does the work of three juniors.
That’s the unit economics shift. That’s the margin recovery. It’s not “no humans”. It’s fewer humans who run AI systems.
The other thing nobody mentions: the first six months of this are slower, not faster. You’re writing skills, debugging prompts, sitting through IT meetings, updating MSA language. Trying stuff out and learning from the journey.
What to do this week
Pick the level you’re actually at. Be honest. If your agency’s AI strategy is “we have ChatGPT licences and one person built an n8n flow for proposals”, you’re at level two on a good day and level one on a normal one.
Then do two things.
The default way: buy more AI tool licences, schedule another all-hands about AI, hope your team figures it out.
The way that actually moves you up: start the IT conversation this week, draft the AI addendum for your next MSA renewal, pick one internal process to run as a level three experiment this month.
The agencies that climb the ladder in the next twelve months won’t be the ones with the smartest people. They’ll be the ones who started the contract conversations early and built the boring infrastructure (and compliance) in the background.
That’s the hard part. That’s the work.
Most won’t do it. The few that do will own the next decade of agency delivery.



