Rendered at 21:37:18 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
_zoltan_ 2 days ago [-]
"In this post, I’ll cover a third, not-so-obvious approach: building ways for the agent to validate more of its own work before a human has to step in. "
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
dirtbag__dad 2 days ago [-]
You can get really far with the 20x Claude Code and Codex plans. They are many orders of magnitude cheaper than api calls.
_zoltan_ 2 days ago [-]
Agreed! Until they fit. :-)
oblio 2 days ago [-]
Enjoy that until the token economy comes crashing down on you.
SR2Z 1 days ago [-]
Anthropic is profitable. When will people stop pretending like AI has not found real applications where it creates value?
It's here to stay, and IMO once VLA-driven robots enter the real world there will be enough money to pay for the datacenters. This coding stuff is great but there are only so many engineers to sell to.
Satya Nadella once said (more or less) "if AI is so good, why doesn't it show up in GDP?"
That's gonna be the step where it shows up in the GDP. Being able to train a machine to solve any problem that can be phrased in tokens (i.e.: most of them) is going to remake society.
kami23 2 days ago [-]
This is where most of my productivity gains have come, I have a special harness I move from project to project now that does my testing orchestration, lots of my work day is setting up a prompt or two early and just letting them loop till they return evidence that the feature is working having gone through the big QA loop.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
osigurdson 2 days ago [-]
I can see how you could avoid regressions this way, but what do you add to your harness to prove that a new feature is working?
kami23 2 days ago [-]
I have it record a series of gifs or videos that I look over. If something looks off I'll dig into it, but I break down work into very very small chunks that are usually easily verifiable or don't require multiple steps.
Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.
I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).
None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...
pstorm 2 days ago [-]
I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.
kami23 2 days ago [-]
For video/image stuff I found the ability for the LLMs to use ffmpeg and imagemagick to be quite fun.
_zoltan_ 2 days ago [-]
for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
jaggederest 2 days ago [-]
> One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature.
Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints
osigurdson 2 days ago [-]
>> if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough.
Maybe I misunderstand but this seems like a fairly low bar in the test suite only covers existing bugs.
I'd argue that if you aren't going to look at the code you actually need a fully comprehensive test suite - in the sense that if the tests pass, the code is correct and you don't have to look at it at all. The problem is, that isn't very quick to create it seems. Of course, if there is a way to do it quickly in a way that is reproducible by others I'd love to hear about it.
jaggederest 2 days ago [-]
I don't mean just bugs, I mean any known issues. I test infra, I test UI, I test binary protocols, you name it. There is certainly no fast way to do it, even with AI (an AI generated suite is better than nothing but not as good), and it's a serious investment, but it's worth it. Testing becomes a process of correctness checking that snowballs over time, making everything else easier and better (or else the tests need further adjustment!)
osigurdson 2 days ago [-]
Right. You mean all behaviors are tested, essentially.
So if you / team are going to implement a new feature, what does that look like? Do you write Gherkin or similar, unit tests or both? Can you provide an example of what that might look like? How much of this has changed for you since the pre-AI days?
jaggederest 2 days ago [-]
These days, yes, integration test at the high level (usually a 1-to-3 liner), then unit tests as I go, often some mocked functional tests. This is basically the same but a ton faster in the AI days, you have to hold the AI accountable and demand quality and iterate, but this weekend I've built an entire test suite for a monorepo I just started working on. It's garbage quality but better than no tests, of course, and will improve as I work.
There's also older work on github you can see over the years, a mishmash and grab bag, I would prefer if more of my work were open source but somehow most employers still default to closed source
Edit: While I'm thinking about it, the other thing you can do with AI is demand that it TDD things - I'm more of a "test all the fucking time" adherent, I don't care whether the tests are written first, but AI is perfectly happy to skate by making a tautological test unless you make it write the test first, ensure it fails correctly, make your change, and don't let it modify the test.
teleforce 2 days ago [-]
The rest of the paragraph explanations are more important.
"The goal to make longer unattended sessions safe enough to be useful without fully removing the human from the loop. It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself."
>safe enough to be useful without fully removing the human from the loop
This is the fundamental concept for AI usage, assistance and adoption for every fields not only code generation.
Essentially AI including LLM, ML, DL, is just a tool, like any other automation tools operating based on the principle of expert-in-the-loop as safety and quality gatekeeper, for sensible and responsible decision making [1].
[1] Domain expertise has always been the real moat (brethorsting.com) (519 comments):
it's fine to remove the human from the loop. set a macro goal, tell the agent how you think it could go there, and let it go nuts.
with enough scaffolding around self-reflectivity and metrics, it will converge.
closeparen 2 days ago [-]
It is too bad that so many companies have built these huge microservices beasts where you cannot run much more than unit tests locally, and you pretty much have to guess at the impact of your change until you have merged it, deployed it, and turned on the feature flag for your own account / test account. This loop is slow, which is perhaps a cost they are willing to pay, but it is also not safe for agents, which will be a massive setback.
dodu_ 2 days ago [-]
> I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
Do you have infinite money?
_zoltan_ 1 days ago [-]
compared to salaries, our bill isn't outrageous.
edfletcher_t137 2 days ago [-]
yep, this has been obvious to a lot of people for awhile. especially after Cherny posted about exactly this in a massively-popular thread... four months ago: https://x.com/bcherny/status/2007179861115511237
psychoslave 2 days ago [-]
What license do you use then?
_zoltan_ 2 days ago [-]
you can pay by just volume ("API pricing")
lovich 2 days ago [-]
What’s the cost of all that though? I don’t doubt that productivity could be gained but when I see articles like the one on the Open Claw guy spending 1.3 million on tokens in a single month I am reminded of drag racing engines that can reach incredible speeds but also need to be completely rebuilt after a single race.
ip26 2 days ago [-]
Depends on the quality of your validation loop. Can the agent find the bug in a five second unit test, or does it have to run the full deployment test?
It also presents tradeoffs in compute budget. Cycles spent executing large arrays of tests could mean less tokens spent debugging.
lovich 2 days ago [-]
> Depends on the quality of your validation loop. Can the agent find the bug in a five second unit test, or does it have to run the full deployment test?
I am not asking about time or completeness. I am asking if this person is spending 1 dollar to make more than a dollar, or if they are spending 1 dollar to make less than a dollar.
Any other criteria is not necessary to consider, if the activity is not profitable.
oblio 2 days ago [-]
Who cares about that?
Vibes, baby, vibes!!!
lovich 23 hours ago [-]
Based on the person I was talking to’s reply, it does seem that way
_zoltan_ 9 hours ago [-]
I don't know if you guys are trolling or not, honestly. Vibe coding is misrepresented. I'm building extremely complex features that are grounded in our codebase but fundamentally rests on the training data which includes all academic papers on the very subject I am iterating on.
you can't just handwave this all away with "vibe baby, vibe" and then high-fiving each other that oh you're so clever because you manually write code/think the code is too high.
_zoltan_ 1 days ago [-]
I can't give exact numbers but I find the cost more than OK.
xg15 2 days ago [-]
Isn't this a bit of an incorrect usage of the term "backpressure"?
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
kqr 2 days ago [-]
It is incorrect, and that annoys me probably more than it should. Lean people have talked about this for a long time:
- single-piece flow means not making large batches of things and then sending them all downstream at once, but instead working on one thing at a time so downstream has a chance to reject before too much of the wrong thing is produced.
- autonomation (or jidoka) means giving the machine the ability to detect when something is wrong and not continue at that point.
- poka-yoke is a process that forces results to be conformant by construction.
Any and all of these terms would be better than backpressure in this context.
(This made me realise that lean people have been spending decades dealing with the problems we encounter with the new robots that write code. Half of the lean philosophy is about setting up processes and structures that have positive optionality on people's creativity, without undue requirement on their level of responsibility. That's exactly what we want for robots that write code too. We want to capture the benefits of what they do well, without suffering from their innumerable mistakes. But we can't just chastise them for making mistakes, so we have to think the way lean people do.)
lucasfcosta 2 days ago [-]
Author here. Well noted. I do think backpressure might not be the ideal analogy/term.
It comes from previous posts I’ve come across, but I haven’t considered exactly what you mentioned. That’s on me.
ptx 2 days ago [-]
Maybe it's more like shift-left testing [1]? You're trying to move some checks to earlier in the process, if I understood you correctly, and get cheaper feedback loops.
Test Driven Development. There's more than several /tdd skills that are popular.
ricksunny 2 days ago [-]
lol your comment sounds like a Claude apology
root-parent 2 days ago [-]
I had a colleague called Claude. Half of the company was blaming everything on him...
leodavi 2 days ago [-]
Insincere apologies ought to be mocked to shit but this apology seemed well-meant. (I know you're not mocking them and the last sentence is actually something Claude would say.)
I post this angry comment because LLMs are colonizing the language we use for creating an earnest and genuine tone in online discussion and I sometimes wonder if the suspicion surrounding LLM-ish language is worse for the health of our online spaces than the LLM slop itself. Thinking about it, I don't think it is; and it would be impossible to measure anyway.
senderista 2 days ago [-]
Honestly? The point lands.
SadErn 2 days ago [-]
[dead]
2 days ago [-]
jeffbee 2 days ago [-]
It is an incorrect use of what was already a flawed metaphor. Pressure is isotropic. Directed pressure makes no sense, like all other fluid analogies in unrelated fields of engineering.
brookst 2 days ago [-]
Wait so cross ventilation, where a breeze will flow through a house if windows are open on opposite sides at a much greater rate than if windows are only open on the upwind side… isn’t really a thing?
bmm6o 2 days ago [-]
I took the analogy to be about the location of the pressure and not the direction. If you allow pressure to build on the input pipe when you can't accept more, the component that is upstream in the flow is able to observe that and respond. Maybe the difference is I envisioned a series of pipes and not a single one.
marcosdumay 2 days ago [-]
The act of "making pressure" means applying a force and is completely directional.
DoctorOetker 2 days ago [-]
so this is about lower or upper back pressure?
pshirshov 2 days ago [-]
A very long post about a simple and very obvious idea with many different implementations.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", there is no producer-consumer disbalance or something similar. From what I can see, the author just proposes a structured feedback loop. That's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
2001zhaozhao 23 hours ago [-]
You can still do backpressure and auto review workflows all you want on your CC subscription. You just have to start the task using the Claude interactive CLI, and implement your multi-agent backpressure mechanisms using Claude Code's native Subagent system with skills that tell the model to trigger them (for any subagents you want to be Claude models).
dbmikus 2 days ago [-]
To stop agents from pausing for checkpointing, you can have a deterministic outer loop that re-runs until a stop condition is met.
I think teams need to be able to write nested workflows that transition between code-led and agent-led, with either supporting human-in-the-loop checkpoints.
Been iterating on what this should look like at our startup (https://www.amika.dev/). Model labs are also improving capabilities here, such as Codex's `/goal` and Claude Code's dynamic workflows[1]
The points about API usage cost still stand, but model intelligence is getting cheaper every month! No need to use the frontier model for every part of the work.
It's hard to get that outer loop done, especially considering that Claude doesn't let you automate the harness anymore (it gets prohibitively expensive). Same for gemini. The only option is Codex.
/goal is a dynamic workflow itself, from what I know. Dynamic workflows do not hold the initiative (and can't use any libraries or I/O).
Dynamic workflows do not prevent checkpointing.
I don't see the actual point of your startup, it's a cheap idea - such as most LLM startups out there.
I don't see how models are getting cheaper - I clearly see the opposite trend.
dbmikus 2 days ago [-]
Claude Code's dynamic workflows are AI-generated JavaScript, so unlike `/goal` they can in theory import libraries and perform I/O (not sure that they can currently).
On checkpointing: I explained myself poorly. You're right that using higher level workflows doesn't turn off checkpointing. One can simply make harnesses non-interactive, but that can make models lose coherence over long tasks (because they can't ask for feedback). A higher level coordinator (/goal, CC dynamic workflows) is designed to provide this feedback without human intervention.
On price: older models keep getting cheaper, and most tasks don't need frontier capability. (I'm ignoring the part about subscription subsidies right now, and just talking about API price for tokens)
On my startup Amika: we run programmable cloud computers for agents, plus the workflow systems to guide them. We let people run any agent (Codex, Claude, etc.), prompt it from anywhere (Slack, web, CLI + SSH, API). It's like devboxes for humans + agents, with guardrails[1] to deterministically ensure things about the changes coding agents make (ie don't let agent modify module boundaries, require every DB query carry a multi-tenant org ID filter).
Maybe our website is bad at explaining it, in which case I appreciate any feedback!
> all the flows where a model has the initiative are strictly biased towards unwarranted stops
Can you elaborate on what you think causes such a bias? My experience is that Qwen3.6, Claude Sonnet 4.6 and Opus 4.6/4.7 will work as far as they can given direction and a way to test their work. My so-far limited experience with Opus 4.8 is that it does stop somewhat earlier for feedback, but in places where I am glad it is checking assumptions or where I agree with it identifying a change in scope (for example, where the following work deserves a separate commit or merge request). I would call those justified stops rather than unwarranted.
pshirshov 2 days ago [-]
Ask Claude! It will quote its constitution aka soulfile. It says the constitution instructs it to perform regular checkpointing no matter what.
The problem is not "backpressure", that's just one of the tools and there are different approaches with the same effect.
You can't express orchestration in terms of "backpressure" only, I think.
Implement-Review-Repeat loop does not involve backpressure in the strict meaning of the term.
monkpit 2 days ago [-]
> 2) Claude is about to make all automation very expensive
Wait, what happened here??
pshirshov 2 days ago [-]
They will charge use of -p and agent SDK at API rates since 14 June. So, x20..x50 price increase.
wellpast 2 days ago [-]
I’m willing to be wrong but this industry-wide emphasis on AI creative/coding workflows seems way over-engineered.
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
Yokohiii 2 days ago [-]
LLMs are too flaky for high quality code. On tougher problems it's very common for an LLM to contradict itself and run in circles. It simply doesn't know what the right thing is, but on each turn it is super confident to do the right thing.
Maybe I've chosen hardmode to learn C with LLM assistance, plus my pet project turned out to be a bit less trivial then anticipated. But I know that I have to think three times about my choices how to deal with C problems and seeing how a LLM struggles to give reasonable answers is a a huge red flag and forces me to think about it a fourth time.
Doing all this with a fast autonomous workflow with just little user guidance is asking for trouble.
coffeefirst 2 days ago [-]
It’s not just you. My last dumb pilot program making a Pocket clone in Python also got stuck in a loop regularly, which should be its strong suit...
I suspect that the “right” way to use LMs in coding, including accounting for focus, control, and costs is not a settled debate. We probably haven’t even seen the best ideas yet. But I’m really dislike the maximalist approach.
doom2 1 days ago [-]
> Maybe I've chosen hardmode to learn C with LLM assistance
May you speak a little more about how you're approaching this? I was thinking of doing similar
Yokohiii 10 hours ago [-]
Well my process was/is quite messy. It started out as an experiment for a desktop app (which turned out to be a bit more complex than anticipated), mostly using copy and paste to try different approaches for a working prototype.
The progress is bottlenecked by how things are done with C. Most features went through an considerable LLM research phase to find the sweet spot how I want to solve it. In the feature phase I've coded more and more on my own and used the LLM as a reviewer on the critical parts.
I know a few things about C, but my knowledge is very passive, but there is already a path carved in my brain from influential C programmers you find on yt and blogs. That being said, LLMs will actively gatekeep and bullshit you about C and system architecture. You need to drill down hard, from different angles i.e. what is efficient and what is pragmatic or you question about how hardware actually works. They still may have blind spots and casually keeping things from you. Naively asking for code snippets will be tutorial style and often backfires, not unseen that they rant about their own code in another session. So you need to maintain a healthy balance of knowledge from trustworthy humans and the quickly available, but potentially inaccurate/incomplete LLM knowledge. Be critical, you are the captain.
LLMs are not bad at reviewing C code. They catch a lot of noob mistakes and are more helpful then compilers. But again they tend to produce tutorial solutions. They litter everything with malloc, which is something I am trying to avoid. I've not touched arenas yet, I want to fall flat on raw C to value them. They can teach you arenas, but I suggest to cross check with the real world, especially because you want ergonomics and correctness on that behalf.
Most C adjacent topics are easy to pick up on the go. Build systems, debugging, API vs ABI, static vs dynamic linking, macros, etc. A seasoned programmer can get a quick overview and ad hoc help, for things that are often buried in mediocre docs.
Also if you happen to come from more higher level languages or dynamic languages. Expect C to require easily 5x more code for literally anything.
If you prefer to learn the language more structured, work your way towards learning how to create data structures and algorithms in C. Make a clear distinction between static and dynamic allocation and learn along the road how C/hardware wants you to deal with memory.
0x696C6961 2 days ago [-]
Yeah it's wild watching so many people decide waterfall is great all of a sudden.
NamTaf 2 days ago [-]
Never mind stumbling into proper engineering principles like having documented, testable requirements specifications.
dcrazy 2 days ago [-]
I”ve been pretty happy with this side effect of the agentic coding bubble.
NamTaf 2 days ago [-]
As a non-tech engineer (mechanical, trains) it's fascinating seeing what is essentially the "not real engineers" SWE crew finally pay the piper because they've invoked what is in essence a non-compliant, cost-focused subcontractor and now need all of the same engineering rigours they never previously understood.
bluGill 2 days ago [-]
Every large project in the coming back to waterfall. While the problems are certainly known and it was ultimately developed as a straw man, everything else ends up working worse. That said, you shouldn't be thinking pure waterfall as it's drawn up as a strawman, but rather a waterfall variation with feedback loops. But in the end, in very, very many cases, you have to know an end date in order to get things done because so many other things depend on you being done at the same time. If something is going to get done sooner you can't use it anyway without all the other pieces.
NamTaf 2 days ago [-]
ITT we discover that project managers actually serve a purpose and not all of them are the stereotypical useless roles Dilbert riffed on.
0x696C6961 2 days ago [-]
"waterfall variation with feedback loops" lol next we're going to have "agile where you plan everything up front"
bluGill 2 days ago [-]
Pretty much what most agile ends up. Plans are worthless but planning is a valuable exercise. (Attributed to various generals)
piazz 2 days ago [-]
100% agree. Took me ages of working with the agents to circle back around this, which was the best way to get work done before AI automation anyways.
mewpmewp2 2 days ago [-]
For me it's usually that I start with a single agent, but then I won't have anything to do while it is churning and I have other ideas/features that keep building up that I want to do, so I need to scale, and while I'm scaling I need to start to have those workflows, so eventually I end up with many agents, most which are autonomous working on their own worktrees, but I will have a specific agent that I will talk to more iteratively.
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
artursapek 2 days ago [-]
I agree. I have gotten an incredible amount of work done iterating with 5-30 minute long agent tasks. But it requires I stay engaged, and not go chill on the beach, which I guess is a lot of agentmaxxers’ goal.
bunderbunder 2 days ago [-]
I suspect that letting agents spin away unattended for long stretches of time will become less and less popular as more and more companies blow their token budgets and start requiring some answers to difficult questions before agreeing to further loose the purse strings.
fassssst 2 days ago [-]
The trick is to have 5 of those huge plans running in parallel.
_zoltan_ 2 days ago [-]
you do not need micro iterations. you can set macro goals and let the agent/LLM/model whatever you want to call it figure it out.
it works.
denysvitali 2 days ago [-]
This seems to be the coding agents 101: build a strong feedback loop. Am I missing something?
artursapek 2 days ago [-]
Yeah I don’t really see the backpressure analogy here - it implies that the agent is constantly producing new stuff, which isn’t really possible since the solution is very detailed specs/goals.
lucamark 2 days ago [-]
No, there is nothing new in this article. This methodology is already adopted - and further optimized - by the sw community.
jon-wood 2 days ago [-]
This what hooks[1] are for, except hooks allow specifying criteria in certain conditions (like the agent believing it’s done and ready to hand back to the user) in a manner that the agent won’t just forget about once it’s a few turns deep, and doesn’t require triggering a whole other LLM instance to read some plain text instructions while you hope it interprets them correctly.
It absolutely makes sense to have a system in place that allows the code generated by an LLM to be automatically validated but there’s no need to resort to a non-deterministic system for these sort of deterministic pass/fail conditions.
I was thinking the exact same thing. There are multiple places to implement hooks (git hooks, Claude hooks, etc.).
One thing I've been wondering about is how to reliably protect specific portions of the system from unexpected/unnecessary change (for example, a failing test that Claude decides to comment out or rewrite to get it to pass). My only thought for this was to automatically revert test changes during specific portions of the implementation, but that feels overly rigid and potentially prevents things like refactoring code.
EMM_386 2 days ago [-]
I always use a standard workflow and it has never been a problem.
- Define the task and the goal, write a short spec document (markdown is fine)
- Point the agent at it in plan mode and have it write the plan to disk with phases. Iterate on its plan if necessary here and now.
- Have each agent tackle a phase and have it update it as a living document (switch models if some phases are more difficult than others)
- Clear and repeat until done
I've never had to overcomplicate this and it's worked both on enterprise-scale projects and personal projects. I am not sure what I'm missing - if anything.
visarga 2 days ago [-]
I think what you are doing is good, I also have a similar workflow, but the idea here is to automate some of your manual approval work with coded tests. Since they are easy to generate, have as many as possible, think hard about what to test for, and the agent will deviate less and be more autonomous.
cadamsdotcom 2 days ago [-]
Everyone looking into this and other verification should be moving away from long prompts and complex skills, and looking into hooks.
If you put all these checks in your stop hook and your git commit hook, your repo docs can tell your agent that checks will run automatically when it stops work, and it should fix any problems found.
It’s wonderful to reintroduce determinism at the QA end of your process. I find it very calming to know the agent can’t skip or forget to check its work because with hooks the checks are run by the harness.
manmal 2 days ago [-]
I think pi-subagents (which can form arbitrarily long chains of subagents, with up to 8 in parallel) and Claude Code‘s new workflows feature, are quite convenient abstractions that can be setup quickly.
jasonlotito 2 days ago [-]
100% agree. None of this who watches the watchmen thing. Force it as a pre-commit hook. The best part there is it means you don't have to hope other people have setup there agents in the same way. It just works, every time.
vermilingua 2 days ago [-]
> It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself.
Oh boy.
Alifatisk 2 days ago [-]
Care to elaborate?
atq2119 2 days ago [-]
That quote shows an utter disregard for basic human decency.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
ovi256 2 days ago [-]
It seems the OP agrees with you, and he's proposing a method for how to do so using agents.
erooke 2 days ago [-]
> It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality.
And
> he's proposing a method for how to do so using agents
Are not in agreement. The claim being made is that you shouldn't be sending PRs you haven't personally vetted to be high quality. Definitionally a bot cannot be used to personally vet something.
khafra 2 days ago [-]
This is not a contradiction; it's an augmentation. As an operations guy, I can tell you that well-constructed automation to reduce the amount of manual checking a human has to do almost always increases the quality of the overall process's output.
atq2119 1 days ago [-]
Of course it does, but that's beside the point.
As a software developer, you must never subject your team mates to a PR that you yourself believe to be low quality. The point of code review by others is to catch things that you missed.
There are multiple lines of defense for quality. Yes, automation can and should be one of them, but your own self-review always has to come before review by your team mates.
khafra 17 hours ago [-]
And for a dev, that's essential professional ethics, and good personal pride as a craftsman.
However, from an operations perspective, a dev is a piece of the qa pipeline with a nonzero error rate, and an optimal throughput rate, above which that error rate rises dramatically.
As a dev, you'll never merge a bad PR; in ops, we want to help you with that goal, and also have plans for what happens when it fails.
deadbabe 2 days ago [-]
At work, there is a way to combat this behavior: approve everything without even reading the code.
mpalmer 2 days ago [-]
They are probably reacting to the laughable idea that by making PRs 20% better (or whatever), devs will continue to review the code with sufficient rigor to catch even the bugs they're supposedly now preventing. Assuming such rigor was ever present in their work!
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
mcint 21 hours ago [-]
The overriding of click behavior is quite annoying. 30 years of browser user-agent behavior.
Next, Vercel, already handle this correctly. It takes special effort to violate "least surprise" here. Cmd-click on a link, should open it in a new tab.
It does appear to be an issue with SimpleAnalytics, now Adobe's,
This gives me the best of both worlds, hand curated reviews and automation. I often get the best quality if I do both, with an agent doing a pass first.
tim-projects 1 days ago [-]
I'm building a tool that automates most of this. What the author didn't even touch on is just how much AI cheats.
The more guardrails you provide the more it cheats.
AI is like a wild animal that needs to do something, and it takes a fair bit of work to corner it. And only when it's cornered and at the point of giving up, can you then offer it a way out.
If you don't do what I said, I can guarantee it's fooling you somehow.
mark_l_watson 2 days ago [-]
Interesting ideas for generalizing goals to reduce human labor in human <—> agent interactions. That said, maybe it is better to set up customized skills and infrastructure for large projects? At our early stage of trying to capture value of agentic systems, the good ideas in this article might be premature optimization.
yearesadpeople 2 days ago [-]
If the systems invariants are well defined, and a suite of conformance + requirements tests (ensuring invariance is respected) are defined, wouldn't this be a broad - _'base case'_ - approach in general?
socketcluster 2 days ago [-]
I've been advocating for this approach for years. It's useful for any kind of data processing. You can't avoid race conditions without using some kind of queueing mechanism and you need backpressure to measure queue capacity. I built this into every aspect of https://socketcluster.io/ - From pub/sub channels, RPCs to event listeners.
If that's third then I have fourth. Self plug obviously, but figured that I'd like something between smart autocomplete and an agent -
an autocomplete that has wider context.
Called it rik, and it's on GitHub if anyone's interested checking it.
interesting idea, unfortunately programming the structure is equivalent (P=NP) to just programming itself. same as TDD.
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.
einpoklum 2 days ago [-]
In other words: Spending more tokens is all you need.
The main kind of pressure I'm feeling is the pressure of the giant AI, GPU & datacenter companies with their insane capital expenditure and circular deals, trying to get enough people to develop an expensive reliance on their service. And the more expensive, the better, so don't just pay for the LLM to code for you, have another LLM interact with the first LLM and pay double, treble, 5x or whatever. Then you can get the most refined slop.
Slowing down development is the wrong goal. I see a desire for slowness come up a lot with developers. If you pursue that goal all the way to its logical conclusion then eventually you would stop all coding completely. Which would prevent new bugs but obviously we can't do that and keep our jobs.
By all means add tons of quality gates to your SDLC pipeline. But thinking about slowness purely for the sake of slowness will not solve your problems.
apsurd 2 days ago [-]
If AI makes code a commodity, then why is the prescription for everyone to ship even more code even more urgently?
My gut reaction, as a professional developer, to my (previous) company's AI mandates was an instinctive "wait but..." -- it didn't logic out to me. Now that I have much more AI experience under my belt, I understand the tension, it's a superpower and net-net ok so more features and more "stuff" will be built. But it's a very hard thing to balance. It's always been a bad idea for a company to position themselves as the one with more "stuff" in it.
bilbo-b-baggins 2 days ago [-]
Bro just rediscovered software best practices and thinks its a novel AI thing.
Fuck, we’re so cooked.
jwpapi 2 days ago [-]
Such a fantasy, it leads to two problems.
Increased complexity of your systems.
Increased pipelines of your system.
You might reduce the likelihood of errors, but at an overproportinal cost of time it takes to complete (which some might argue is irrelevant, but has the cost of human context), and with an way higher time and focus needed for all bugs that the system doesnt work.
You’ll have to fix adapt and maintain all your verification layers, because just because you set them up they are not perfect.
Your testing pipeline becomes incredible slow and you need to maintain it as well.
It’s tremendously weaker than a hands-on approach.
I’ve written this exact same article in January and since then completely switched my position.
Good luck on everyone trying this. You shuffling your own grave and waste time.
dnnddidiej 2 days ago [-]
Oh this is 101. Anyone not doing this? If not do it now!
Arodex 2 days ago [-]
Because your token use explodes?
dnnddidiej 2 days ago [-]
Can be done on a claude pro. But if you are low on tokens then yeah probably stick to more of a non-thinking copilot type arrangement (which is fine!)
slow_typist 2 days ago [-]
Who is going to write tests? But I like the fact that this approach implicitly approves of the stochastic parrot model. I mean, given enough computing power and sufficiently well made tests, I could just generate random strings of increasing length until one compiles into a program that passes all tests, mission accomplished. Like one million apes typing on one million typewriters.
lofaszvanitt 1 days ago [-]
Lipstick on a pig.
jasonlotito 2 days ago [-]
I feel like a lot of people just forget you can put this stuff in pre-commit hooks. This forces the AI to deal with issues. You don't have to hope and pray it remembers your "Pretty please, check your work" markdown file.
A pre-commit hook has been wonderful. Sure, you can add instructions, but pre-commit hooks are where you want to put the guards.
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
It's here to stay, and IMO once VLA-driven robots enter the real world there will be enough money to pay for the datacenters. This coding stuff is great but there are only so many engineers to sell to.
Satya Nadella once said (more or less) "if AI is so good, why doesn't it show up in GDP?"
That's gonna be the step where it shows up in the GDP. Being able to train a machine to solve any problem that can be phrased in tokens (i.e.: most of them) is going to remake society.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.
I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).
None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints
Maybe I misunderstand but this seems like a fairly low bar in the test suite only covers existing bugs.
I'd argue that if you aren't going to look at the code you actually need a fully comprehensive test suite - in the sense that if the tests pass, the code is correct and you don't have to look at it at all. The problem is, that isn't very quick to create it seems. Of course, if there is a way to do it quickly in a way that is reproducible by others I'd love to hear about it.
So if you / team are going to implement a new feature, what does that look like? Do you write Gherkin or similar, unit tests or both? Can you provide an example of what that might look like? How much of this has changed for you since the pre-AI days?
You can find some open source examples on github, either directly https://github.com/pgdogdev/pgdog/commits/main/?author=jagge... or through my profile - that repo has a pure-sql integration suite I wrote essentially entirely with AI: https://github.com/pgdogdev/pgdog/tree/main/integration/sql
There's also older work on github you can see over the years, a mishmash and grab bag, I would prefer if more of my work were open source but somehow most employers still default to closed source
Edit: While I'm thinking about it, the other thing you can do with AI is demand that it TDD things - I'm more of a "test all the fucking time" adherent, I don't care whether the tests are written first, but AI is perfectly happy to skate by making a tautological test unless you make it write the test first, ensure it fails correctly, make your change, and don't let it modify the test.
"The goal to make longer unattended sessions safe enough to be useful without fully removing the human from the loop. It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself."
>safe enough to be useful without fully removing the human from the loop
This is the fundamental concept for AI usage, assistance and adoption for every fields not only code generation.
Essentially AI including LLM, ML, DL, is just a tool, like any other automation tools operating based on the principle of expert-in-the-loop as safety and quality gatekeeper, for sensible and responsible decision making [1].
[1] Domain expertise has always been the real moat (brethorsting.com) (519 comments):
https://news.ycombinator.com/item?id=48340411
with enough scaffolding around self-reflectivity and metrics, it will converge.
Do you have infinite money?
It also presents tradeoffs in compute budget. Cycles spent executing large arrays of tests could mean less tokens spent debugging.
I am not asking about time or completeness. I am asking if this person is spending 1 dollar to make more than a dollar, or if they are spending 1 dollar to make less than a dollar.
Any other criteria is not necessary to consider, if the activity is not profitable.
Vibes, baby, vibes!!!
you can't just handwave this all away with "vibe baby, vibe" and then high-fiving each other that oh you're so clever because you manually write code/think the code is too high.
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
- single-piece flow means not making large batches of things and then sending them all downstream at once, but instead working on one thing at a time so downstream has a chance to reject before too much of the wrong thing is produced.
- autonomation (or jidoka) means giving the machine the ability to detect when something is wrong and not continue at that point.
- poka-yoke is a process that forces results to be conformant by construction.
Any and all of these terms would be better than backpressure in this context.
(This made me realise that lean people have been spending decades dealing with the problems we encounter with the new robots that write code. Half of the lean philosophy is about setting up processes and structures that have positive optionality on people's creativity, without undue requirement on their level of responsibility. That's exactly what we want for robots that write code too. We want to capture the benefits of what they do well, without suffering from their innumerable mistakes. But we can't just chastise them for making mistakes, so we have to think the way lean people do.)
It comes from previous posts I’ve come across, but I haven’t considered exactly what you mentioned. That’s on me.
[1] https://en.wikipedia.org/wiki/Shift-left_testing
I post this angry comment because LLMs are colonizing the language we use for creating an earnest and genuine tone in online discussion and I sometimes wonder if the suspicion surrounding LLM-ish language is worse for the health of our online spaces than the LLM slop itself. Thinking about it, I don't think it is; and it would be impossible to measure anyway.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", there is no producer-consumer disbalance or something similar. From what I can see, the author just proposes a structured feedback loop. That's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
I think teams need to be able to write nested workflows that transition between code-led and agent-led, with either supporting human-in-the-loop checkpoints.
Been iterating on what this should look like at our startup (https://www.amika.dev/). Model labs are also improving capabilities here, such as Codex's `/goal` and Claude Code's dynamic workflows[1]
The points about API usage cost still stand, but model intelligence is getting cheaper every month! No need to use the frontier model for every part of the work.
[1]: https://code.claude.com/docs/en/workflows
/goal is a dynamic workflow itself, from what I know. Dynamic workflows do not hold the initiative (and can't use any libraries or I/O).
Dynamic workflows do not prevent checkpointing.
I don't see the actual point of your startup, it's a cheap idea - such as most LLM startups out there.
I don't see how models are getting cheaper - I clearly see the opposite trend.
On checkpointing: I explained myself poorly. You're right that using higher level workflows doesn't turn off checkpointing. One can simply make harnesses non-interactive, but that can make models lose coherence over long tasks (because they can't ask for feedback). A higher level coordinator (/goal, CC dynamic workflows) is designed to provide this feedback without human intervention.
On price: older models keep getting cheaper, and most tasks don't need frontier capability. (I'm ignoring the part about subscription subsidies right now, and just talking about API price for tokens)
On my startup Amika: we run programmable cloud computers for agents, plus the workflow systems to guide them. We let people run any agent (Codex, Claude, etc.), prompt it from anywhere (Slack, web, CLI + SSH, API). It's like devboxes for humans + agents, with guardrails[1] to deterministically ensure things about the changes coding agents make (ie don't let agent modify module boundaries, require every DB query carry a multi-tenant org ID filter).
Maybe our website is bad at explaining it, in which case I appreciate any feedback!
[1]: https://docs.amika.dev/guides/code-annotations
Can you elaborate on what you think causes such a bias? My experience is that Qwen3.6, Claude Sonnet 4.6 and Opus 4.6/4.7 will work as far as they can given direction and a way to test their work. My so-far limited experience with Opus 4.8 is that it does stop somewhat earlier for feedback, but in places where I am glad it is checking assumptions or where I agree with it identifying a change in scope (for example, where the following work deserves a separate commit or merge request). I would call those justified stops rather than unwarranted.
You can't express orchestration in terms of "backpressure" only, I think.
Implement-Review-Repeat loop does not involve backpressure in the strict meaning of the term.
Wait, what happened here??
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
Maybe I've chosen hardmode to learn C with LLM assistance, plus my pet project turned out to be a bit less trivial then anticipated. But I know that I have to think three times about my choices how to deal with C problems and seeing how a LLM struggles to give reasonable answers is a a huge red flag and forces me to think about it a fourth time.
Doing all this with a fast autonomous workflow with just little user guidance is asking for trouble.
I suspect that the “right” way to use LMs in coding, including accounting for focus, control, and costs is not a settled debate. We probably haven’t even seen the best ideas yet. But I’m really dislike the maximalist approach.
May you speak a little more about how you're approaching this? I was thinking of doing similar
The progress is bottlenecked by how things are done with C. Most features went through an considerable LLM research phase to find the sweet spot how I want to solve it. In the feature phase I've coded more and more on my own and used the LLM as a reviewer on the critical parts.
I know a few things about C, but my knowledge is very passive, but there is already a path carved in my brain from influential C programmers you find on yt and blogs. That being said, LLMs will actively gatekeep and bullshit you about C and system architecture. You need to drill down hard, from different angles i.e. what is efficient and what is pragmatic or you question about how hardware actually works. They still may have blind spots and casually keeping things from you. Naively asking for code snippets will be tutorial style and often backfires, not unseen that they rant about their own code in another session. So you need to maintain a healthy balance of knowledge from trustworthy humans and the quickly available, but potentially inaccurate/incomplete LLM knowledge. Be critical, you are the captain.
LLMs are not bad at reviewing C code. They catch a lot of noob mistakes and are more helpful then compilers. But again they tend to produce tutorial solutions. They litter everything with malloc, which is something I am trying to avoid. I've not touched arenas yet, I want to fall flat on raw C to value them. They can teach you arenas, but I suggest to cross check with the real world, especially because you want ergonomics and correctness on that behalf.
Most C adjacent topics are easy to pick up on the go. Build systems, debugging, API vs ABI, static vs dynamic linking, macros, etc. A seasoned programmer can get a quick overview and ad hoc help, for things that are often buried in mediocre docs.
Also if you happen to come from more higher level languages or dynamic languages. Expect C to require easily 5x more code for literally anything.
If you prefer to learn the language more structured, work your way towards learning how to create data structures and algorithms in C. Make a clear distinction between static and dynamic allocation and learn along the road how C/hardware wants you to deal with memory.
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
it works.
It absolutely makes sense to have a system in place that allows the code generated by an LLM to be automatically validated but there’s no need to resort to a non-deterministic system for these sort of deterministic pass/fail conditions.
[1] https://code.claude.com/docs/en/hooks
One thing I've been wondering about is how to reliably protect specific portions of the system from unexpected/unnecessary change (for example, a failing test that Claude decides to comment out or rewrite to get it to pass). My only thought for this was to automatically revert test changes during specific portions of the implementation, but that feels overly rigid and potentially prevents things like refactoring code.
- Define the task and the goal, write a short spec document (markdown is fine)
- Point the agent at it in plan mode and have it write the plan to disk with phases. Iterate on its plan if necessary here and now.
- Have each agent tackle a phase and have it update it as a living document (switch models if some phases are more difficult than others)
- Clear and repeat until done
I've never had to overcomplicate this and it's worked both on enterprise-scale projects and personal projects. I am not sure what I'm missing - if anything.
If you put all these checks in your stop hook and your git commit hook, your repo docs can tell your agent that checks will run automatically when it stops work, and it should fix any problems found.
It’s wonderful to reintroduce determinism at the QA end of your process. I find it very calming to know the agent can’t skip or forget to check its work because with hooks the checks are run by the harness.
Oh boy.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
And
> he's proposing a method for how to do so using agents
Are not in agreement. The claim being made is that you shouldn't be sending PRs you haven't personally vetted to be high quality. Definitionally a bot cannot be used to personally vet something.
As a software developer, you must never subject your team mates to a PR that you yourself believe to be low quality. The point of code review by others is to catch things that you missed.
There are multiple lines of defense for quality. Yes, automation can and should be one of them, but your own self-review always has to come before review by your team mates.
However, from an operations perspective, a dev is a piece of the qa pipeline with a nonzero error rate, and an optimal throughput rate, above which that error rate rises dramatically.
As a dev, you'll never merge a bad PR; in ops, we want to help you with that goal, and also have plans for what happens when it fails.
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
Next, Vercel, already handle this correctly. It takes special effort to violate "least surprise" here. Cmd-click on a link, should open it in a new tab.
It does appear to be an issue with SimpleAnalytics, now Adobe's,
Free debugging of how the site tweaks, breaks, the 30 year consensus web standard behavior.Good sites, good blogs, *don't override onclick for links.* Or handle it correctly. I'll leave an issue on the github.
Between your footer, and dotfiles repo, OP does seem to appreciate standards & norms, in principle.
My agent forces this workflow by disabling modifications outside the coding step.
I added looping to this not too long ago. https://github.com/hsaliak/std_slop/blob/main/docs/mail-loop...
This gives me the best of both worlds, hand curated reviews and automation. I often get the best quality if I do both, with an agent doing a pass first.
The more guardrails you provide the more it cheats.
AI is like a wild animal that needs to do something, and it takes a fair bit of work to corner it. And only when it's cornered and at the point of giving up, can you then offer it a way out.
If you don't do what I said, I can guarantee it's fooling you somehow.
Called it rik, and it's on GitHub if anyone's interested checking it.
https://github.com/exlee/rik
https://pura.xyz
https://github.com/puraxyz/puraxyz/blob/main/docs/paper/main...
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.
The main kind of pressure I'm feeling is the pressure of the giant AI, GPU & datacenter companies with their insane capital expenditure and circular deals, trying to get enough people to develop an expensive reliance on their service. And the more expensive, the better, so don't just pay for the LLM to code for you, have another LLM interact with the first LLM and pay double, treble, 5x or whatever. Then you can get the most refined slop.
By all means add tons of quality gates to your SDLC pipeline. But thinking about slowness purely for the sake of slowness will not solve your problems.
My gut reaction, as a professional developer, to my (previous) company's AI mandates was an instinctive "wait but..." -- it didn't logic out to me. Now that I have much more AI experience under my belt, I understand the tension, it's a superpower and net-net ok so more features and more "stuff" will be built. But it's a very hard thing to balance. It's always been a bad idea for a company to position themselves as the one with more "stuff" in it.
Fuck, we’re so cooked.
Increased complexity of your systems. Increased pipelines of your system.
You might reduce the likelihood of errors, but at an overproportinal cost of time it takes to complete (which some might argue is irrelevant, but has the cost of human context), and with an way higher time and focus needed for all bugs that the system doesnt work.
You’ll have to fix adapt and maintain all your verification layers, because just because you set them up they are not perfect.
Your testing pipeline becomes incredible slow and you need to maintain it as well.
It’s tremendously weaker than a hands-on approach.
I’ve written this exact same article in January and since then completely switched my position.
Good luck on everyone trying this. You shuffling your own grave and waste time.
A pre-commit hook has been wonderful. Sure, you can add instructions, but pre-commit hooks are where you want to put the guards.