Kacper Ryniec

The Blank Page Was the Point

Kacper Ryniec — Tue, 12 May 2026 08:31:02 GMT

What gets lost when AI drafts your PRDs in thirty seconds — and how to keep the thinking that actually matters

Every product team has the same Monday morning. Someone — a stakeholder, a CEO, a customer who’s been waiting too long — asks for a feature. The PM nods, retreats to a Notion page, and stares at it. By Thursday there’s a PRD. By the next sprint, there’s a backlog. Somewhere in that journey the messy idea became a buildable thing, and most of the value the PM created was in that compression.

For most of the last decade, that compression was a bottleneck. Specs took days. Discovery interviews piled up untranscribed. Backlogs drifted out of sync with strategy. The 7-person team had one PM and the PM had three open tabs of customer feedback they hadn’t gotten to yet.

In 2026, that bottleneck has moved. The PRD that took four days now takes forty minutes. The user research that piled up gets clustered into themes overnight. The backlog stays coherent because something is watching it. And the role of the PM — and the PO and the BA — is shifting in ways that are easy to miss if you’re only counting throughput.

This piece is for the engineering leader trying to figure out what to actually do with that.

What’s changed in the last twelve months

The honest summary is that frontier models finally got good enough at three things that matter for the front end of the SDLC: synthesizing unstructured input, holding product context across artifacts, and operating inside the tools your PMs already live in. Those three together are what turned demoware into something that ships.

A discovery call now becomes a structured insight in minutes — not a transcript dump, but a themed cluster tied to a user goal, with the relevant quotes attached. A one-sentence feature idea becomes a draft PRD with user stories, acceptance criteria, success metrics, and the awkward open questions you’d usually only catch on the third pass. A change to the roadmap propagates to the dependent tickets, and the agent flags where the release notes no longer match what’s actually shipping.

A year ago this required three different tools, a custom GPT, and a PM willing to babysit the output. Today it runs in the background of Productboard or Jira while the PM is in a customer interview. The work is being done; the question is what the human is now doing instead.

The case for adopting it

The pitch is concrete and it’s not really about speed.

Yes, the artifacts get drafted faster. A senior PM at a 12-person team I’ve spoken to said her PRD cycle compressed from “two days of writing plus a day of stakeholder review” to “ninety minutes plus a day of stakeholder review” — and the second number is the one that matters. The thinking still takes the same time. The typing doesn’t.

What actually changes is what fills the time the typing used to take. Discovery work that used to be deferred — the second customer interview, the competitor teardown, the “let me actually go look at the support tickets from last quarter” — gets done because there’s now room for it. The PRDs get reviewed against the strategy doc, because the agent can do that comparison in seconds and the PM only has to read the diff. The acceptance criteria get tighter because the PM has time to sit with the engineer for an hour rather than handing off a doc.

There’s also a quieter shift in artifact coherence that’s worth naming. In a typical 7–15 person team, the PRD lives in Notion, the backlog lives in Jira, the roadmap lives in a third tool, and the release notes live in someone’s drafts folder. They drift. They contradict each other by week three. The current generation of agents can hold all of those in a single context window and flag the drift — not perfectly, but well enough that the PM stops being a manual reconciliation engine. That role was never strategic, and getting it off the PM’s plate is one of the genuine wins.

For Business Analysts and Product Owners working in regulated or enterprise contexts, the shift is different but real. The drudgery of translating a stakeholder conversation into a structured user story, complete with acceptance criteria mapped to compliance requirements, is exactly the kind of pattern-matching work the models are now good at. The BA’s job becomes editing and challenging the draft, not producing it.

The case for skepticism

This is where I’d ask you to slow down.

The first failure mode is the obvious one: the agent confidently writes a PRD that sounds reasonable but is built on a misreading of the customer problem. It cites three insights from the discovery research that aren’t quite what the customer said. It proposes success metrics that are easy to measure but don’t actually correspond to the business outcome.

The team builds the wrong thing, faster.

This is not a hypothetical — it’s what happens when the synthesis layer gets trusted without scrutiny, and it’s the most common pattern I see in teams six months into adoption.

The second failure mode is more interesting, and it’s the one I’d flag most strongly to executives: AI-drafted artifacts can quietly erase the thinking that the artifact was supposed to surface.

A PRD is not just a document. It’s the trace of a struggle — the PM forced themselves to articulate what the problem actually is, who the user actually is, what success actually looks like. The blank page is uncomfortable because the thinking is uncomfortable. When an agent fills the page in thirty seconds, the discomfort goes away. So does, often, the thinking.

The PM who uses the agent as a sparring partner — drafts something, has the agent challenge it, redrafts, asks the agent to argue against the success metric — gets the thinking and the speed. The PM who uses the agent as a stenographer ships PRDs that look right and aren’t. The difference between these two PMs is invisible in the artifact and obvious in the product six months later. This is the apprenticeship problem from code review, but worse — because in code review there’s at least a senior reviewer in the loop who might catch a subtle mistake. In product, the test is the market, and the feedback loop is months long.

The third is what I’d call stakeholder flattening. A skilled BA or Product Owner does something subtle when they sit between business stakeholders and engineering: they negotiate. They translate, they push back, they hold the tension between “what was asked for” and “what should be built.” When you replace that human translation layer with an agent that produces a clean structured user story directly from a stakeholder transcript, you lose the negotiation. The story is faithful to what was said, which is often not what was meant, and which is usually not what should be built. The senior BAs I’ve talked to are using AI as a first-pass tool and adding their judgment on top. The risk is the cost-pressured organization that decides to skip the BA entirely.

The fourth is data residency, and it’s the same story you’ve already been having about code: where does the customer feedback go, where do the PRDs go, who gets to train on them. For most teams the right answer is a vendor with a zero-retention agreement. For some industries it’s a self-hosted or in-tenant deployment. “We just plugged it into a consumer chatbot” is not a defensible posture if the input is your customer interview transcripts.

Tools worth evaluating

The market for AI in product management has bifurcated more cleanly than the code review market did. Three categories cover most of the realistic decision space for a 7–15 person team.

ChatPRD is the focused-tool pick. It’s purpose-built for PRD drafting, document review, and PM coaching, and it integrates outward to Linear, Jira, Notion, and the prototyping tools your team uses (v0, Lovable, Bolt). The pitch is essentially “an on-demand CPO who reviews your specs,” and in practice the structured templates and the document feedback loop are the most useful parts. Pricing is roughly in the range of a single PM tool subscription per seat, with a usable free tier for evaluation. The trade-off is scope: it’s a great PRD tool, not a feedback synthesis or roadmap coherence tool. For a small team where the bottleneck is really “writing the spec,” it’s the path of least resistance.

Productboard Spark (and the Productboard AI add-on for existing Productboard customers) is the broader-platform pick. Spark is the standalone product agent at roughly $19–$24 per maker per month, and the AI add-on for existing Productboard plans is around $20 per maker per month on top of the base license. Where ChatPRD focuses on the PRD, Productboard’s offering pulls together customer feedback aggregation, insight clustering, brief generation, competitive analysis, and roadmap context. It’s the right pick if your team’s actual bottleneck is “we have customer feedback in eight tools and nobody is reading it” rather than “we can’t write specs fast enough.” It’s heavier to set up. It pays back in teams where discovery is the constraint.

Claude Projects (or ChatGPT Projects) plus a thoughtful workflow is the lightweight pick, and it’s the one I’d start most teams on. Create a Project per product area, load it with the strategy doc, the personas, the existing PRDs, the style guide, and the recent customer interview notes. Use it for drafting, review, synthesis, and stakeholder communication. Pricing is whatever your existing Claude or ChatGPT team plan costs. The trade-off is integration: you’re copy-pasting between the agent and your PM tools, which is fine for a small team and miserable at scale. But the workflow flexibility is enormous, and you learn what your team actually wants the agent to do before you commit to a vendor.

A few honorable mentions worth knowing about. Granola and Otter for interview transcription and structured notes — Granola in particular has won fans among PMs for its templated outputs over raw transcripts. Jira Product Discovery’s AI is worth a look if you’re already deep in the Atlassian stack and want the prioritization and theme-extraction layer without changing tools. Notion AI has improved meaningfully and may be enough if your team’s docs already live there. And Aha! remains the most comprehensive end-to-end platform if you’re willing to commit to a single, opinionated system — but the implementation cost is real, and it’s overkill for most 7–15 person teams.

A 90-day implementation plan for a 7–15 person team

The mistake teams make is buying a platform first and figuring out the workflow later. Reverse it.

Weeks 1–2: scope and select. Map your current PM/PO/BA workflow. Where does the time actually go? For most small teams it’s one of three places: drafting specs, synthesizing customer feedback, or keeping artifacts in sync. The bottleneck determines the tool category. If it’s specs, ChatPRD or a Claude Projects workflow. If it’s feedback synthesis, Productboard Spark or a Granola+Claude pipeline. If it’s coherence across roadmap/backlog/release notes, you’re looking at a heavier platform — but be honest about whether the coherence problem is severe enough to justify the switching cost. Run a security review on whoever you pick. For most commercial product teams you want a zero-retention agreement at minimum.

Weeks 3–4: pilot with one PM and one workflow. Pick one PM (or BA, or PO — whoever has the most representative workload) and one specific workflow: usually PRD drafting or interview synthesis. Have them use the tool for that workflow only, log what worked, what was wrong, and what they had to redo. Two metrics worth tracking from day one: time-to-first-draft on a PRD, and a subjective quality score from the engineering lead on the resulting spec. The second is the one that matters.

Weeks 5–8: expand to the full PM/PO/BA group, advisory only. Roll the tool out to the rest of the product roles. Resist two temptations. First, don’t make it mandatory yet — let people use it where it helps and skip it where it doesn’t. Second, don’t over-customize the prompts and templates in the first month; the defaults are usually well-chosen and aggressive customization tends to encode whatever bad habits the team already has. Set up a weekly 30-minute retro for the product group to share what’s working. The first three weeks of these retros are gold — that’s where you find out the agent is hallucinating customer quotes, or that two PMs have independently invented the same workaround.

Weeks 9–12: codify and decide. Pull the data. Run a real retro. Three questions: are PRDs getting from idea to engineering-ready faster? Is the quality of those PRDs as judged by engineering leads holding or improving? Are PMs spending the freed time on discovery and stakeholder work, or on more PRDs they didn’t need to write? The third question is the one that catches the bad pattern early. If the answers are positive, formalize it: pick a default tool, document the workflow, set expectations that PMs use it for first drafts and bring their own judgment to the second pass. If the answers are mixed, swap tools or descope. Don’t keep it running on inertia.

For a team of this size, expect total cost in the range of $100–$400 per month at standard pricing (lower if you stay on Claude/ChatGPT Projects, higher if you pick Productboard Spark with multiple makers), plus roughly 15–30 hours of product-team time spread across the quarter for setup, tuning, and evaluation. The break-even is faster than for code review tools because the artifacts are higher-leverage and the workflows are more concentrated. If a single PRD cycle compresses by even half, you’ve paid for it many times over in a quarter.

What this is actually about

The interesting question isn’t whether AI agents can write PRDs. They can, and they’re getting better quickly. The interesting question is what your product organization should look like when the cheapest, fastest first draft on every spec, every user story, every backlog refinement is produced by a machine — and what your PMs, POs, and BAs should be doing with the time and attention that frees up.

The optimistic answer is: more discovery, more customer time, more strategic thinking. The pessimistic answer is: more PRDs that nobody needed, drafted faster.

Which one your team gets is mostly a function of leadership. If you measure the product team on PRD throughput, you’ll get throughput, and you’ll find out in eighteen months that the products got worse. If you measure them on discovery depth, customer outcomes, and the quality of the questions they’re asking — and you treat the agent as the thing that bought you the time to ask those questions — you’ll get a meaningfully better product organization than you had a year ago.

The blank page was the point. The discomfort of it was the point. AI can take the typing off your team’s plate, but it can’t take the thinking. And if you let it pretend to, the thinking is what you’ll lose.

That’s the question worth your thinking time.

CodeRabbit + Azure DevOps: practical setup notes

Kacper Ryniec — Thu, 07 May 2026 07:47:31 GMT

The short version

CodeRabbit’s Azure DevOps integration works well, but the setup is meaningfully different from the GitHub flow. There’s no native OAuth app — you connect via a Personal Access Token (PAT) tied to an Azure DevOps user. Plan accordingly.

Official docs: https://docs.coderabbit.ai/platforms/azure-devops

Best community walkthrough (2026): https://dev.to/rahulxsingh/coderabbit-azure-devops-setting-up-ai-code-review-524h

Before you start

Make sure you have:

An Azure DevOps organization with at least one project and Azure Repos repository
Project Administrator or Organization Administrator permissions — required to install extensions and configure service connections
An organizational email address — personal emails are not supported for this integration
Admin approval rights for Microsoft Apps consent requests (or someone who has them, sitting next to you)

You don’t need Azure Pipelines changes, Docker, or any local tooling. CodeRabbit runs on its own infrastructure and connects via webhook + PAT.

Recommended setup approach

The single most important decision: don’t use a real engineer’s PAT. Create a dedicated service account in Azure DevOps for CodeRabbit, generate the PAT from that account, and document the rotation date somewhere your team will actually see it (e.g., as a calendar event owned by the platform team, not a single person).

Why this matters: PATs expire. When they do, CodeRabbit silently stops reviewing PRs. If the PAT belongs to an engineer who leaves the company, you lose code review across the org with no warning. A service account with a long-lived token managed by your secrets solution is the only sane long-term setup.

Step-by-step

Create the service account user in Azure DevOps. Give it Reader access at the organization level, and Contributor access on the repos you want reviewed.
Sign in as that service account and generate a PAT:
- Click the settings icon next to your avatar → Personal Access Tokens → New Token
- Scope to “All accessible organizations” (or specific orgs)
- Set the longest expiry your security policy allows (typically 90 or 180 days)
- Required scopes: Code (Read & Write), Pull Request Threads (Read & Write), Project and Team (Read), User Profile (Read)
Sign up at coderabbit.ai with your organizational email. Choose Azure DevOps as the platform during onboarding.
On the “Azure DevOps User” page, paste the PAT generated in step 2.
On the “Repositories” page, toggle on the repos you want reviewed. Start with one or two.

The integration is live within a minute. Open a test PR to confirm CodeRabbit posts a walkthrough comment and inline review.

The branch policy gate (do this in week 4, not week 1)

Once your team trusts the review quality, Azure DevOps lets you make CodeRabbit a required status check before merge. In your project: Repos → Branches → three-dot menu on main → Branch policies → Require status checks to succeed. Add CodeRabbit’s status name (typically review — confirm in CodeRabbit docs for the current value), set it as Required, and configure it to reset on new commits.

Don’t enable this on day one. Run advisory-only for at least 2–3 sprints first. Gating merges on AI suggestions before the team has calibrated their trust creates friction, not quality.

Configuration

CodeRabbit reads a .coderabbit.yaml file from the root of each repo. Two things worth doing early:

Tone instructions — set the reviewer’s voice to match your team’s culture:

language: “en-US”

tone_instructions: “You are an expert code reviewer working in an enterprise team. Be concise, prioritize correctness over style, and skip nitpicks unless they affect maintainability.”

Path-specific instructions — different rules for different parts of a monorepo:

reviews:

path_instructions:

- path: “services/auth/**”

instructions: |

Focus on token validation, password hashing (bcrypt only),

and OWASP authentication best practices.

- path: “services/payments/**”

instructions: |

Pay close attention to idempotency, currency rounding,

and PCI-DSS compliance concerns.

You can also set organization-level instructions in the CodeRabbit dashboard that apply to all repos without needing to be added to each .coderabbit.yaml. Repository-level instructions merge with org-level ones, they don’t replace them.

Known sharp edges vs. the GitHub integration

The Azure DevOps integration is fully functional, but a few things to know:

PAT rotation is on you. GitHub’s OAuth app handles credentials transparently; Azure DevOps doesn’t. Set a calendar reminder for two weeks before expiry.
Comment threading is slightly different. CodeRabbit posts comments as PR threads rather than as part of a formal “review submission” the way GitHub allows. In practice this is fine, but if your team is used to GitHub’s “Files changed → Submit review” flow, the muscle memory won’t transfer cleanly.
Branch policy status names can drift between CodeRabbit versions. If you wire up a required check, verify the status name still matches after CodeRabbit updates.

Pricing (as of early 2026)

CodeRabbit Pro is $24–30 per developer per month. For a 7–15 person team, that’s $170–$450/month. There’s a free tier with rate limits (200 files/hour, 4 PR reviews/hour) that’s fine for evaluation but will throttle a real team quickly.

Self-hosted is available but only for CodeRabbit Enterprise customers with 500+ seats — not realistic at this team size. If you have data residency or air-gap requirements that rule out the SaaS, Greptile or Claude Code Review (with appropriate tenancy) are likely better fits.

What “done” looks like

After 30 days you should be able to answer yes to all of these:

CodeRabbit posts a review on every PR within 5 minutes of opening
Your .coderabbit.yaml is checked into each repo and reflects team conventions
The PAT is owned by a service account, not a person, and the rotation date is on a shared calendar
At least one engineer per team has been designated as the “CodeRabbit owner” for tuning and feedback
You have baseline metrics (time-to-first-review, defect escape rate) from before adoption to compare against

If you can’t answer yes to all five, don’t expand the rollout — fix the gap first.

The Code Review Bottleneck Is Solved. Now What?

Kacper Ryniec — Tue, 05 May 2026 09:40:54 GMT

What it actually means to put AI agents in your code review pipeline

Code review is the part of engineering that nobody quite figured out how to scale. You hire more engineers, you ship more code, and the review queue grows faster than your senior reviewers can drain it. PRs sit for two days. Junior engineers context-switch. Bugs slip in not because the reviewers were careless, but because by Thursday afternoon, “LooksGoodToMe” starts to feel like the path of least resistance.

This is the gap that AI code review agents are now filling — and as a Engineering Director, the question is no longer whether to evaluate them, but how to think clearly about what they actually do, what they don’t, and where they fit in your engineering culture.

What’s changed in the last eighteen months

We’ve moved past the era of linters that catch unused imports and call it AI. The current generation of agents — the ones built on frontier models with proper repository context — reads diffs the way a thoughtful staff engineer might. They notice when a new function duplicates logic that already exists three directories away. They flag a race condition in the async code your junior dev wrote at 11pm. They ask, in a comment on the PR, why the retry policy was changed and whether that interacts with the idempotency guarantees in the upstream service.

A year ago, this was a demo. Today, it’s a Tuesday.

The agents that have proven useful in production share a few characteristics. They run automatically on every PR, leave inline comments rather than monolithic reports, prioritize their findings by severity, and — crucially — know when to stay quiet. The ones that comment on every PR with seven low-value suggestions get muted within a sprint. The ones that flag two real issues and say nothing else become part of the team.

The case for adopting them

The honest pitch is straightforward. AI reviewers compress feedback latency from hours or days to minutes. A junior engineer pushes a branch and gets a first pass of comments before lunch — not from their tech lead, who’s in back-to-back meetings, but from an agent that’s already read the diff, the related files, and (if you’ve wired it up properly) the relevant ADRs and style guides.

This shifts the role of human reviewers in a way that’s worth naming explicitly. Your senior engineers stop being grammar-checkers and pattern-matchers. The boring catches — null handling, missing error paths, inconsistent logging, obvious test gaps — get surfaced before a human ever opens the PR. Humans are freed to focus on the things only humans currently do well: judgment about architecture, trade-offs about technical debt, mentorship in the comments, and the social work of keeping a codebase coherent.

There’s also a quieter benefit that doesn’t show up in a Gantt chart. Code reviews are one of the most emotionally charged parts of engineering culture. An AI catching a bug feels different from a peer catching it. Some of the friction — the defensiveness, the perceived hierarchy, the awkwardness of pushing back on someone senior — simply isn’t there. Teams I’ve spoken to report that engineers iterate more freely on agent feedback, treating it like a sparring partner rather than a verdict.

And the economics are favorable in a way that’s easy to miss. The marginal cost of an AI review is measured in cents. The marginal cost of a senior engineer’s attention is measured in opportunity cost on the most important problems your company faces.

The case for skepticism

None of which means you should turn it on and walk away.

The first failure mode is over-trust. AI agents are confident even when wrong, and they will occasionally produce comments that sound reasonable but reference functions that don’t exist, suggest refactors that break unstated invariants, or hallucinate security issues. If your team treats agent comments as gospel, you’ll ship worse code, not better — because human reviewers will start deferring to a system that doesn’t deserve that deference.

The second is review fatigue. An agent that comments thirty times on a fifty-line PR is worse than no agent at all. Engineers learn to scroll past the noise, and when they scroll past the noise, they scroll past the signal too. The agents that work in practice are aggressively tuned for precision over recall — which means accepting that they will miss some real issues to avoid drowning the team in false ones.

The third, and the one I’d flag most strongly to executives: AI review can quietly erode the apprenticeship function of code review. When a senior engineer comments on a junior’s PR, two things happen. The code gets better, and the junior learns. If the agent catches everything before the senior looks, the senior stops looking carefully — and the junior stops getting the kind of mentorship that builds judgment over years. This is a long-horizon cost that won’t show up in next quarter’s metrics. It will show up in five years, when your senior bench is thinner than you expected.

The fourth is security and IP. Sending your proprietary code to a third-party model is a decision that needs to be made deliberately, with your security and legal teams in the room. The answers are getting better — on-prem deployments, zero-retention agreements, dedicated capacity — but “we just plugged it in” is not a defensible posture if the code in question is your competitive moat.

Tools worth evaluating

The market in 2026 has consolidated into a few clear categories. For a team standing this up for the first time, three options cover most of the realistic decision space.

CodeRabbit is the broadest pick and the easiest place to start. It installs as a GitHub, GitLab, Bitbucket, or Azure DevOps app, runs automatically on every PR, and posts severity-tagged inline comments with one-click fixes. It integrates with the static analysis and SAST tools you already pay for, and pricing lands around $24–30 per developer per month on the Pro tier. The trade-off is depth: it analyzes the diff well but has weaker cross-codebase reasoning, so it’s better at catching local issues than systemic ones. For a team of 7–15 with a heterogeneous stack, this is the path of least resistance.

Greptile sits at the other end of the spectrum. Instead of analyzing diffs in isolation, it indexes your entire repository and builds a code graph, then traces dependencies and follows leads across files when reviewing a PR. This is the right choice if your codebase has the kind of cross-file coupling where a change in one service quietly breaks another — which is most non-trivial codebases by year three. Setup takes longer because of the indexing step, but the signal-to-noise ratio on systemic issues is meaningfully higher.

Claude Code Review is Anthropic’s own offering, currently in research preview for Team and Enterprise subscriptions. It uses a fleet of specialized agents that examine PRs in the context of your full codebase and posts severity-tagged inline comments. Behavior is tunable through a CLAUDE.md or REVIEW.md file checked into the repo, which is a clean way to encode team conventions. It’s worth a look if you’re already using Claude Code elsewhere in your engineering workflow, or if you want the review behavior tied to the same model your engineers are using to write code.

A few honorable mentions: GitHub Copilot Code Review bundles into existing Copilot subscriptions and is the lowest-friction option if you’re already paying for Copilot, though it’s lighter in depth. Graphite Agent is excellent if you’re willing to adopt stacked PRs as a workflow, but the workflow shift is real cost. Qodo is worth considering if test generation alongside review is a priority.

A sample 90-day implementation plan for a 7–15 person team

The mistake most teams make is rolling this out as a top-down mandate. The plan below assumes you want adoption to be earned, not enforced.

Weeks 1–2: scope and select. Have your engineering lead pick a tool — most teams in this size range should default to CodeRabbit unless you have specific reasons (deep cross-file coupling pushes you toward Greptile; existing Claude Code investment pushes you toward Claude Code Review). Run a security review with whoever owns that function. Confirm data handling: where does the code go, is it used for training, what’s the retention policy. For most commercial codebases, you’ll want a zero-retention agreement and either SOC 2 Type II or a self-hosted option.

Weeks 3–4: shadow mode on one repo. Install the tool on a single non-critical repo. Configure it to comment but not request changes or block merges. Brief the team: this is an experiment, the bot is advisory, push back on its comments freely. Pick two engineers — ideally one senior and one mid-level — to act as observers and log instances where the bot was useful, useless, or actively wrong.

Weeks 5–8: expand to all repos, still advisory. Once the team has calibrated their trust, turn it on across your codebase. Tune the configuration: most tools support a config file where you specify what to flag and what to ignore. Resist the temptation to over-configure early — the defaults are usually well-chosen, and aggressive customization tends to produce worse outcomes for the first month than just letting it run.

Weeks 9–12: evaluate and decide. Pull the data. Run a retro with the team. Three questions matter: are PRs getting first feedback faster? Are defects in production trending down or flat? Do engineers find the comments useful or do they ignore them? If the answers are positive, formalize the rollout — make it a default expectation that PRs get an AI review pass before human review. If the answers are mixed, either swap tools or descope. Don’t keep it running on inertia.

For a team of this size, expect total cost in the range of $200–$500 per month at standard pricing, plus roughly 20–40 hours of engineering time spread across the quarter for setup, tuning, and evaluation. The break-even is fast: if the tool saves your senior engineers two hours a week each, you’ve paid for it many times over.

The frame that actually matters

The interesting question isn’t whether AI agents can review code. They can, and they’re getting better quickly. The interesting question is what your engineering organization should look like when the cheapest, fastest reviewer on every PR is a machine that never gets tired and never gets defensive — and what the humans on your team should be doing with the time and attention that frees up.

That second question is the one worth your weekend.

The Two-Layer Stack: How to Actually Compare BMAD, SpecKit, Superpowers, and the Rest

Kacper Ryniec — Thu, 30 Apr 2026 21:07:49 GMT

Five agentic coding frameworks now hold over 170,000 combined GitHub stars. BMAD, GitHub SpecKit, Superpowers, GSD, OpenSpec — plus AWS Kiro on the commercial side. Every few weeks someone posts a new comparison, and every comparison reaches a different conclusion.^¹

Here’s what most of them get wrong: they treat these tools as alternatives to each other when they often aren’t. They mix in coding agents (Claude Code, Cursor, Copilot) as if they belong in the same bucket. And they recommend frameworks based on GitHub stars rather than what’s actually breaking in your workflow.

After spending a chunk of Q1 2026 going through these frameworks, the practitioner write-ups, and the available production data, one framing keeps holding up. The AI coding stack has two distinct layers, and 90% of the confusion in this space comes from treating them as one.

The Two Layers

Layer 1 — The Agent. This is the tool that actually writes the code: Claude Code, Cursor, GitHub Copilot, Windsurf, Codex CLI, Aider, Cline. These are software products with their own UX, pricing, and model integrations. They’re what you fire up when you want to ship something.

Layer 2 — The Methodology Framework. This is the structure layered on top of the agent: BMAD, SpecKit, Superpowers, GSD, OpenSpec. They define how the agent should think, plan, and hand off work. They’re not software you install in the traditional sense — they’re prompts, personas, skill files, and workflows that run on top of whatever agent you’ve already got.

This distinction sounds pedantic until you notice it explains every confused comparison article in the space. Comparing BMAD to Cursor is like comparing Scrum to Microsoft Word. They operate at different altitudes.

Once you accept the two-layer model, the question changes. It’s no longer “which framework should I pick?” It’s “what’s breaking at which layer?” That question has a tractable answer.

The Methodology Layer: What’s Actually Competing

With the agent layer set aside, the methodology layer has six serious players as of Q2 2026. Each makes a different bet about what kind of structure AI agents need.

BMAD-METHOD

The bet: Simulate an entire enterprise software team. BMAD defines specialized AI personas — Mary (Business Analyst), Preston (PM), Winston (Architect), Sally (Product Owner), Simon (Scrum Master), Devon (Developer), plus QA — each with its own scope and explicit handoff protocol. Work moves through phases sequentially, and each persona produces structured artifacts (briefs, PRDs, architecture docs, stories) that feed the next.

Where it’s strongest: Coordination at scale. February 2026 reviews consistently identify BMAD as “the healthiest project” in the methodology space, with responsive Discord support and the broadest agile lifecycle coverage.^² One independent case study reported “a level of precision and speed unattainable with unstructured AI development methods” using BMAD on a multi-tenant SaaS build.^³ V6’s scale-adaptive planning automatically adjusts depth based on project complexity (Quick Flow for bugs, full BMAD for new products), partially addressing the over-engineering complaint.^⁴

Where it breaks: Twelve agents and a heavy artifact set produce a steep learning curve. Practitioners testing BMAD Full in February 2026 reported six-day cycles for features that didn’t justify them.^² BMAD also defaults to a shared output directory, so parallel work by multiple engineers requires manual configuration. And the artifacts are static — when implementation drifts from the spec, manual reconciliation is on you.^⁵

Use it when: Multiple engineers are sharing an AI workflow, requirements are still being shaped, or organizations need reviewable decisions between phases.

GitHub SpecKit

The bet: Specifications are the source of truth; code is just their expression. Released by GitHub in September 2025, SpecKit drives development through slash commands — /constitution, /specify, /clarify, /plan, /tasks, /analyze, /implement — with explicit human checkpoints between each. Crossed 55,000 GitHub stars within months.

Where it’s strongest: GitHub-backed credibility, agent-agnostic design (works with Claude Code, Copilot, Codex, Gemini CLI), and a clean slash-command UX that feels native to anyone already using a terminal-based agent.^⁶ The /clarify gate flags unknowns as “NEEDS CLARIFICATION” rather than guessing — a small thing that prevents a lot of downstream pain. IBM has published a fork specifically for infrastructure-as-code workflows.^⁵

Where it breaks: No built-in code review step — iteration effectively stops at the planning phase. Changing direction means re-running affected commands, each regenerating its full document with no diff of just the changed sections. February 2026 testing also found that produced code didn’t always faithfully map to spec intent. Community signals are weaker than BMAD: a stale PR queue and no community channel as of early 2026.^²

Use it when: Greenfield work where you want lighter scaffolding than BMAD but more structure than vibe coding. Particularly strong if you’re already in the GitHub ecosystem.

Superpowers

The bet: Discipline through enforcement, not knowledge. Created by Jesse Vincent in October 2025, Superpowers installs composable skills (SKILL.md files) that force the agent through brainstorm → spec → plan → TDD → review. The TDD skill literally auto-deletes code written before a failing test. Accepted into Anthropic’s official marketplace January 2026, and growth has been explosive — over 124,000 stars by March 2026.^⁷

Where it’s strongest: It solves a problem the others don’t fully address. AI agents already know about TDD and good engineering practice — they just skip it under “time pressure.” Superpowers makes skipping mechanically impossible. It also uses git worktrees for isolation and subagents that start fresh on each task to prevent context drift. Notably, it leans on Cialdini’s persuasion principles to keep agents from rationalizing their way out of the rules.^³

Where it breaks: TDD-centric philosophy fits app code better than infrastructure or research code. The cognitive overhead is real — managing the structured workflow requires effort, and fast refactors or throwaway scripts feel over-constrained. It also doesn’t decide delivery phases, so it doesn’t replace BMAD or SpecKit at the project layer.

Use it when: Production code where quality matters more than speed, autonomous multi-hour agent sessions, or any context where you can’t supervise every step the agent takes.

GSD (Get Stuff Done)

The bet: Most projects don’t need elaborate agent hierarchies — they need clear task decomposition, fast iteration, and defense against context rot. GSD uses a flatter agent structure with “wave parallelism” to isolate context across subagents. The discuss → plan → execute → verify loop is brutally simple, with each phase running in a fresh context window.

Where it’s strongest: Speed. The Reddit consensus on r/ClaudeCode is striking: “I’ve tried BMAD, SpecKit, Taskmaster. GSD has delivered the best results for me. By far.” Naturally fits ambiguous or fluid requirements. Solo developers and small teams report it as the fastest path to working output.^⁸

Where it breaks: Without structured artifacts, knowledge can vanish between sessions. Easy to build quickly in the wrong direction. Doesn’t scale well to teams or long-term maintenance — it’s explicitly described in current reviews as a “solo vibe-coding tool.”

Use it when: Prototypes, MVPs, internal tools, solo work. Skip it for anything that has to be maintained by a team six months from now.

OpenSpec

The bet: Brownfield first. Instead of generating full specs from scratch, you write delta specs — only what’s changing. Completed specs archive and merge into a source-of-truth document that grows with the project.

Where it’s strongest: Lightest footprint of the methodology frameworks. Works well when modifications must be reviewed before implementation begins. February 2026 evaluations found OpenSpec’s delta specs kept tracking accurate and made it easy to verify implementation against plan — a real advantage on existing codebases.^²

Where it breaks: Built and maintained by one person, which is a real bus-factor concern. Iteration is frictionless but not enforced — no review gates between phases. Not suited for large multi-service initiatives where specification drift during implementation creates coordination problems.^⁵

Use it when: Iterative changes to existing codebases that need structured approval gates without heavy upfront planning overhead.

AWS Kiro (the productized outlier)

The bet: Take the spec-driven concept and ship it as a polished IDE rather than an open-source framework. Kiro is AWS’s AI-powered IDE running on Claude Sonnet via Bedrock. It generates requirements in EARS notation, system design, and dependency-ordered task lists — all before code. Adds agent hooks (event triggers on file save), steering files for persistent project context, and deep AWS integration including IAM Policy Autopilot and GovCloud support.^⁹

Where it’s strongest: If you’re building on AWS, the integration depth is genuine — Lambda, DynamoDB, IAM, CDK, CloudFormation all benefit. The hooks system enforces quality without relying on developer discipline. For greenfield enterprise development on AWS, the spec-first approach has real teeth.

Where it breaks: Model lock-in (Claude only via Bedrock), AWS lock-in for the high-value features, and a $19/month price tag with a 50-credit free tier that, per multiple Q1 2026 user reports, “burns through in a single session.” The rigid Requirements → Design → Task List → Coding pipeline has been described as something that “kills momentum during iteration.”^⁵ And there was a notable early-access incident where Kiro-generated code reportedly triggered an AWS service disruption — the “vibe too hard, brought down AWS” story that made the rounds on Hacker News.^¹⁰

Use it when: AWS-native projects where compliance and traceability matter more than iteration speed, and where your team already lacks documentation discipline.

Quick Comparison

All six frameworks at a glance, based on Q1–Q2 2026 evaluations:

The Insight Most Teams Miss: Layer Them

Here’s what changed my mind about this whole space. The most informed practitioners in Q1 2026 aren’t picking one framework. They’re stacking them at different layers.

The cleanest combination people are running: BMAD for the project layer (PRDs, stories, architecture notes), Superpowers for the task layer (TDD, verification, root-cause debugging during a single Claude Code session). BMAD answers “what order should work happen in?” Superpowers answers “how does my agent stop skipping tests and inventing APIs?” They’re not competing — they’re complementary.^¹¹

This is also why the GitHub-stars comparison is misleading. Superpowers crossing 124K stars doesn’t mean it’s “better” than BMAD — it means more individual developers want better-behaved agents than want full team simulation. Both numbers are true. Neither tells you what to use.

A practical heuristic

Diagnose what’s breaking. Then pick the layer.

Coordination problem? (”Nobody agrees what we’re building,” “AI keeps losing the plot between sessions”) → BMAD or SpecKit at the project layer.
Discipline problem? (”Agent skips tests,” “declares victory too early,” “invents APIs”) → Superpowers at the task layer.
Speed problem? (”Process is killing momentum on a prototype”) → GSD, or no framework at all.
Brownfield problem? (”Existing codebase, can’t redesign from scratch”) → OpenSpec.
Compliance / AWS problem? (”Need traceability, deploying to GovCloud”) → Kiro, accepting the lock-in tradeoff.

One Honest Caveat About “Case Studies”

If you go looking for rigorous, third-party-validated case studies in this space, you’ll come up short. The frameworks are too new, and almost everything you’ll read — including most of what’s cited above — is practitioner write-ups, not audited outcomes.

What we have is directional. BMAD has the most independent case study coverage, including documented multi-tenant SaaS builds. SpecKit has GitHub itself as a case study (used internally before open-sourcing) plus published implementations from IBM, Microsoft, and EPAM. Superpowers has Simon Willison’s October 2025 endorsement and Spring AI’s January 2026 adoption of a similar Agent Skills pattern as ecosystem validation. Kiro has AWS’s own deployment behind it.

What we don’t have, yet, is the kind of controlled study that would tell you whether “10x productivity” claims survive contact with reality. Anyone telling you they have those numbers in Q2 2026 is reporting their own experience, not measured outcomes. Adjust your priors accordingly.

There’s one number worth knowing: the OSVBench research published in 2025 found that spec-driven approaches reduce logic errors by 23–37% compared to direct generation.^¹² That’s the most credible empirical finding in this whole space, and it applies to the entire spec-driven category — BMAD, SpecKit, Kiro, and OpenSpec all benefit from it.

Where This Is Heading

A few patterns are worth watching as Q2 2026 unfolds.

SKILL.md is becoming the cross-platform standard. Eleven-plus tools now support it — Claude Code, Cursor, Copilot, Codex, Gemini CLI, Kiro, Amp, Manus, OpenCode, Goose, Roo Code. That portability matters: skills written for one agent increasingly work everywhere.^³

Living specs are eating static specs. The biggest unresolved gripe with BMAD and SpecKit is that their artifacts go stale the moment implementation drifts. Newer entrants (Augment’s Intent, for example) are betting on specs that update as agents work. Expect this to be the next major architectural fight.^⁵

Security is the dark side of skill libraries. Snyk’s February 2026 ToxicSkills audit found prompt injection in 36% of skills they tested, with 1,467 malicious payloads. Treat skills like npm packages: vet before installing, prefer official catalogs, and audit third-party additions. The ecosystem mirrors early npm/PyPI risk patterns.^³

If you only take one thing from this: stop asking “BMAD or SpecKit?” and start asking “agent layer or methodology layer — and what’s actually breaking?” Pick the layer first, then the tool. Most teams who have been burned by framework adoption skipped that step.

What’s working for your team in 2026? Drop a comment — I’m especially curious about teams running BMAD + Superpowers stacks, or anyone who tried Kiro past the 50-credit free tier.

Sources

1. Hightower, R. (March 2026). “The Great Framework Showdown: Superpowers vs. BMAD vs. SpecKit vs. GSD.” AI in Plain English — https://ai.plainenglish.io/the-great-framework-showdown-superpowers-vs-bmad-vs-speckit-vs-gsd-360983101c10

2. Ran the Builder (February 2026). “I Tested Three Spec-Driven AI Tools. Here’s My Honest Take.” — https://ranthebuilder.cloud/blog/i-tested-three-spec-driven-ai-tools-here-s-my-honest-take/

3. Walker, R. (February 2026). “Agentic Skills Frameworks Compared.” Ry Walker Research — https://rywalker.com/research/agentic-skills-frameworks

4. BMAD-METHOD official repository (v6 documentation). GitHub — https://github.com/bmad-code-org/BMAD-METHOD

5. Augment Code (March 2026). “6 Best Spec-Driven Development Tools for AI Coding in 2026.” — https://www.augmentcode.com/tools/best-spec-driven-development-tools

6. GitHub Blog (September 2025). “Spec-driven development with AI: Get started with a new open source toolkit.” — https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/

7. Pulumi Blog (April 2026). “Superpowers, GSD, and GSTACK: Picking the Right Framework for Your Coding Agent.” — https://www.pulumi.com/blog/claude-code-orchestration-frameworks/

8. Obvious Works (April 2026). “Agentic Coding Tools 2026: The 7 frameworks that will take your development to the next level.” — https://www.obviousworks.ch/en/agentic-coding-tools-2026-the-7-frameworks-that-take-your-development-to-a-new-level/

9. Kiro official site. AWS — https://kiro.dev/

10. OpenAIToolsHub (March 2026). “Kiro Review: Amazon’s Spec-Driven IDE Powered by Claude.” — https://www.openaitoolshub.org/en/blog/kiro-review-amazon-ide

11. AWS Galaxy (April 2026). “BMAD and Superpowers: A Process Framework and a Skill Library, Side by Side.” — https://awsgalaxy.com/blog/2026-04-24/bmad-and-superpowers

12. DEV Community (January 2026). “The Paradigm Shift from Reactive to Proactive AI in Software Development” (citing OSVBench, April 2025) — https://dev.to/kirodotdev/the-paradigm-shift-from-reactive-to-proactive-ai-in-software-development-a-comparative-analysis-of-148p