I was on a panel recently about the technology bets leaders need to make now.
I loved the vibe of the conversation. It was practical, direct, and refreshingly far from the usual "AI will change everything" theatre. The feedback afterwards was very positive, so I thought I would write down the main idea I was trying to get across.
Most organizations are still thinking about AI at the wrong level.
They are treating it like an application.
Give engineers Claude Code, Codex, Copilot, or whatever tool is popular this month. Give business users access to a model. Buy some seats. Spend more tokens. Hope the company becomes AI-enabled.
That gives you a few faster people. Useful, yes. But not transformation.
The bigger shift is to stop thinking about AI as a smart chat window and start thinking about it as an operating system that grows with the people using it.
An app helps someone do one thing. An operating system creates the environment where work, context, lessons, mistakes, automations, tests, and patterns compound.
That is where the value starts.
The organization has to learn
The first win with AI is obvious.
An engineer debugs faster. A product manager prepares faster. A finance person analyzes faster. A founder gets from scattered notes to a decision faster.
Great.
But if that learning stays inside one person's session, one person's machine, or one person's head, the organization did not really get smarter. It got a productivity spike, not a capability.
The better leadership question is:
"What did this session teach us that should survive?"
If a prompt had to run three times before it worked, there is a lesson there. If the model made the same mistake twice, there is a lesson there. If the human had to explain a local convention again, there is a lesson there. If the model could not finish because context was missing, there is a lesson there too.
Those lessons should not disappear when the session ends.
They should turn into things the system can reuse:
- configuration changes
- skills
- hooks
- agents
- tests
- checklists
- project context
- startup rules for the next session
That is the beginning of the operating system.
Stop fixing only the output
One mistake I see is that people use AI, get a bad output, and then fix the output.
That is fine once.
But if you keep doing it, you are doing manual labor around the AI.
The better question is:
"Why did the system produce this mistake?"
Was the requirement unclear? Was the context missing? Was the harness wrong? Was the testing weak? Was the convention not written down? Was there a skill that should have existed?
At that point, the work is no longer only fixing the PR. The work is improving the ecosystem that produced the PR.
This is where many companies miss the bigger opportunity. They are trying to make engineers faster at writing code. The higher-leverage move is to make engineers builders of the system that produces better work next time.
The engineer is not just writing code. The engineer is improving the meta-system that delivers the code.
A simple daily loop
If I were putting this into an organization, I would start with something simple.
Every engineer keeps their AI sessions. Once a day, a process reviews those sessions and asks:
- where did the model struggle?
- what did the human repeat?
- what failed before it worked?
- what convention was missing?
- what context would have helped?
- what should become reusable?
Then the system proposes concrete changes:
- add this project instruction
- update this skill
- add this hook
- create this checklist
- change this test-selection rule
- add this architectural note
- create this agent for a repeated workflow
Every useful session should raise the floor for the next one.
So if your LLM keeps making the same mistake, I would ask first whether you are harvesting the lessons from the sessions. If not, of course it repeats itself. You are resetting the learning every time.
Context matters, but the harness builds it faster
One of the other panel points was reliability.
JP made a strong argument that AI agents need to understand not just what a product does, but why it does it. I agree with that completely.
If your system has an old API that one major customer depends on, the AI needs to know that. If one part of your checkout flow is commercially sensitive, the AI needs to know that. If one feature can be degraded safely and another cannot, the AI needs to know that.
Otherwise it will build something statistically reasonable and still wrong for your business.
So yes, context is essential.
My angle is that AI also changes how fast you can build that context.
You do not need to disappear for a year and write the perfect internal encyclopedia before you begin. Start with the minimum useful context. Build the minimum useful harness. Then let the harness discover missing context through real work.
You do a little bit of context. You do a little bit of tooling. The tooling helps you find more context. That context makes the tooling better.
That loop matters more than the first version being perfect.
Company-grade intelligence
The next step is moving from individual setups to shared intelligence.
Imagine one engineer builds a useful skill. Another engineer has a workflow that saves hours. Another team has a strong testing pattern. Someone else has a good incident checklist.
Normally, those things stay local. Maybe they get mentioned in Slack. Maybe they go into a wiki. Maybe they disappear.
In an AI operating system, they should flow into the company safely.
Not blindly. Not with secrets. Not with personal mess.
One detail I talked about on the panel was psychological safety.
People can be afraid to share their setup because they think it looks bad. They assume everyone else has a clean setup and theirs is embarrassing, so they keep it private.
That is a product problem.
You need a sanitizer. It should remove secrets and private data, of course, but it should also protect the person professionally.
Something as simple as: "Make sure nothing leaves my machine that reflects badly on me as an engineer."
That matters.
Once people feel safe to share, the system can recommend useful skills across teams. In the morning, my setup could look at what was shared yesterday and say:
"These three things are relevant to your project. Do you want to wire them in?"
I say yes. It wires them in. It tests them. My system gets better.
That is company-grade intelligence: not one person with a clever prompt library, but a shared floor that keeps rising.
Side note: if you are working in the Claude Code ecosystem and want to see a concrete example of this direction, Memnyx is worth looking at:
https://github.com/MDGrey33/memnyx
It is not the whole answer for every company, but it shows the shape of the problem: layered memory, session lifecycle, lessons learned, and a skill-improvement loop around AI-assisted work.
Business users will become super-users
This is not only for engineers.
Finance will use AI. Marketing will use it. Operations will use it. Product will use it. Customer teams will use it.
People who used to wait for engineering will start building.
That is powerful, and it will also create chaos.
A business user can have an idea in the morning, build a proof of concept by the afternoon, run it for a week, and come back with real data before asking for one engineer.
But they will also ask for access, connect systems, touch data, and create things that look useful while hiding a real blast radius.
You cannot complain too much, because you asked them to move faster.
So engineering's role changes.
Engineering becomes the part of the organization that builds the runway for people to run fast:
- sandboxes
- access rules
- cost limits
- audit trails
- data boundaries
- review paths
- deployment gates
- production pathways
The goal is not to stop people improvising. The goal is to let them improvise safely.
The safer people feel trying new things, and the less harm they can do while trying, the braver they become. That is how you get innovation out of this.
The dark factory
One pattern I mentioned was the dark factory.
The idea is simple: business users should not need to produce production-quality code. They should be able to express requirements.
Then a pipeline takes those requirements, combines them with company context, engineering standards, tests, and architectural rules, and produces something engineering can review.
At first, it will be rough. That is fine.
The mistake would be for engineers to spend all their time fixing each rough PR by hand. The better move is to improve the pipeline that generated it.
If the PR violates a convention, add the convention to the system. If it misses a test, improve the test-selection rule. If it misunderstands the architecture, add the missing architecture context. If the requirement was vague, improve the requirement template.
That is how you get from "AI made a messy PR" to "a ticket can eventually become clean, reviewable code."
Not immediately. Through the loop.
Testing is the contract
In the Q&A, someone asked whether the answer to AI-generated incidents is just better testing.
My answer was: both.
You need testing, and you need context.
But I do not think about AI testing mainly as adding more unit tests at the end. I think about it as defining the contract.
What does this feature need to prove before we say it works?
Once the requirement is clear, you can fork the work.
One session develops the solution. Another session, which only has the requirements, evaluates the PR and decides what needs to be tested.
That separation matters. If the same reasoning path writes the code and judges the code, you risk testing to confirm the development logic.
So the loop becomes:
- clarify requirements
- build the solution
- test from the requirements
- send only the failure reason back
- iterate
- repeat until the contract passes
- then bring in the engineer for serious review
At that point, the code is the replaceable part. The valuable part is the contract, the loop, and the system that can keep iterating until the output is good enough to review.
If that costs another 10 or 30 dollars in model time but saves days of engineering time, the math is not hard.
Incidents make this real
The security part of the panel made the operating-system idea even sharper.
We all want to believe we can build the perfect secure system. Anyone who has been in tech long enough knows that is the unachievable dream.
Something will go wrong.
The question is how you show up when it does.
A serious incident is not only a technical problem. It is a context problem.
One person is checking logs. One person is checking deploys. One person is on the call. One person is worried about revenue. One person is worried about customer data. One person is worried about destroying evidence while fixing the system.
The incident commander has partial data and conflicting goals.
You cannot invent the tools and habits in that moment, so build them before.
And do not build them only for the security incident. Build a general incident management system your teams use regularly, so when the serious incident happens, the muscle already exists.
Imagine each engineer investigates using their AI harness. Every few minutes, or at the end of each turn, the harness syncs useful findings into a shared incident corpus:
- new datasets
- open theories
- rejected theories
- meeting transcript points
- decisions
- rationales
- gaps
- things dismissed too early
Now the incident commander has a live shared book of the incident.
Not scattered fragments. Not five people with five different versions of the truth.
A shared corpus helps the team respond, account for what happened, and learn afterwards.
That is what AI should be doing in incidents: not replacing judgment, but making judgment better informed.
Pre-mortems become cheap
The other point I made is that we should stop thinking only about the incident that already happened.
If AI gives you something like a cheap, tireless analytical workforce, ask a different question:
What would I do if I had an unlimited number of people working on this?
I would not only analyze my own incidents. I would look at incidents that happened elsewhere. I would look at public failures. I would look at threat intelligence.
Then I would ask:
"If this happened in my organization, how would it play out?"
With your context. Your infrastructure. Your controls. Your teams. Your constraints.
The system can ask:
- what would be exposed?
- which systems look similar?
- which logs would matter?
- which playbook would we need?
- which decision would be hardest?
- what should we fix now?
The old constraint was choosing which scenarios were worth simulating.
With AI, increasingly, the answer is: run far more of them.
Not perfectly. Not blindly. But broadly enough that when the real incident happens, you are not starting from zero.
How I would measure it
Another audience question was how to prove the value.
My view is simple: use the metrics you already trust.
Do not invent a completely new measurement religion just because AI arrived.
Look at:
- time to first PR
- lead time for change
- deployment frequency
- change failure rate
- incident frequency
- time to recovery
- review burden
- cycle time from idea to prototype
- cycle time from prototype to production
- cost of delivery
Then compare before and after.
How long did this task take before? How much human time did it cost? How long does it take now? How much model time, review time, and infrastructure did it require?
When someone says, "This AI run cost 30 dollars," compare that to the week of engineering time it replaced or compressed.
Do not measure token spend in isolation. Measure outcomes.
The takeaway
If I had to compress the whole panel thread into one idea, it would be this:
AI adoption is not about giving everyone a smarter tool. It is about building a learning system.
Think of AI as:
- an operating system, not an app
- a pipeline, not a prompt box
- a continuum, not a one-off session
- a company capability, not an individual trick
Start small.
Build minimum context. Build minimum harness. Store sessions. Harvest lessons. Turn lessons into skills, hooks, agents, tests, and configuration.
Let business users prototype safely. Make engineers builders of the framework. Use testing as a loop. Build incident memory before the incident. Run pre-mortems before you need them. Measure against the metrics you already trust.
Most importantly, do not let your organization learn privately.
That is the waste.
The value is in making every useful session raise the floor for the next one.
That is how you move from AI demos to AI capability.
That is the bet I would make now.
Thumbs up if you would attend a similar panel. Heart if you would like more articles on this topic.
Comment with any topic around AI operating systems, engineering productivity, AI-enabled delivery, incident response, or business super-users that you would like me to write about next.