One Agent Kept Recommending the Same 5 Themes

I pointed one Claude agent at 120 GTM articles and asked it to pick the best 8 for a newsletter. It worked for two weeks. By week three, every issue led with AI SDRs, mentioned Gartner twice, and ignored infrastructure, pricing, and org design entirely. Recency bias, hype amplification, no way to catch its own blind spots.

The obvious next move was prompt engineering. Scoring rubrics, pillar weighting, diversity constraints. The single agent improved from terrible to mediocre. But it couldn't hold multiple perspectives simultaneously -- one persona can't be both the AI-skeptic VP Sales and the tool-obsessed founder. A single context window doesn't argue with itself well.

So this is a buildlog. The system I actually run in production every week: 7 Claude sub-agents with distinct GTM perspectives, a moderator running 3-round debates, and a human signal from daily Slack curation. 120+ articles from 65 RSS feeds down to 8 newsletter picks. I'll walk through the architecture, the failures, and what I'd do differently if I started over.

When I say "evaluate," I mean full-text analysis with persona-specific reasoning. Not metadata scanning or headline matching. Each agent reads the article content, scores through their professional lens, and writes 100+ words of reasoning per article. That produces roughly 840 evaluation judgments per week across the full team, though agents agree about 60% of the time. The interesting stuff lives in the other 40% -- the disagreements are where editorial angles emerge. The output isn't a ranked list. It's a moderated conversation that produces a thesis.

The 7 Agents: A Team, Not a Leaderboard

Most multi-agent setups treat agents as independent voters -- parallel scorers who never talk to each other. That's an expensive leaderboard. The STEEPWORKS pipeline is a Claude team: agents with distinct roles, a moderator, and a human signal that shapes evaluation before the agents start.

The Builder (AI-native founder). Tools, build-vs-buy, startup strategies. When a new orchestration framework drops, the Builder reads the docs and asks whether it actually works or just demos well. Last month the Builder was the only agent to flag a workflow automation tool as "interesting architecture, zero production evidence" -- the kind of catch that keeps the newsletter from turning into a hype amplifier.

The Revenue Leader (CRO). Org-wide adoption, ROI at scale, board metrics. Filters anything that doesn't survive a CFO conversation. If content can't connect to pipeline, retention, or expansion revenue, it gets flagged as noise.

The Scrapper (Series A marketer). Channel alternatives, efficiency plays, lean execution. The Revenue Leader would never recommend a scrappy outbound sequence built on a free tool. The Scrapper would, and that's the point.

The Closer (Enterprise AE). Deal intelligence, buyer psychology, competitive edge. Evaluates whether content helps someone close a deal, not just understand a trend. The Closer cares about objection handling, champion enablement, and timing -- not thought leadership abstractions.

The Architect (RevOps Director). Stack fit, integration patterns, data flows. Asks "does this connect to anything?" not just "is this interesting?" Catches the infrastructure articles every other agent dismisses as boring but that RevOps teams desperately need.

The Contrarian (AI-skeptic VP Sales). This is the most important design decision in the system. The Contrarian has inverted scoring: a STRONG_PICK means the article exposes a problem nobody's discussing. When six agents love a piece about AI SDRs, the Contrarian asks what the failure rate looks like after month three. The incentive structure is inverted, not just the persona -- that distinction matters, and I'll explain why in the failures section.

The Editor (editorial voice). Doesn't score articles independently. Moderates the debate, synthesizes the weekly thesis, forces advocacy when agents agree too quickly, prevents premature convergence. The team's moderator, not its boss.

How the Orchestration Works

These aren't seven prompts in a loop. The evaluation phase is a single Claude Code session spawning 7 Agent tool calls in parallel. Each gets the article batch plus its persona prompt. Outputs return as structured JSON -- speech acts, confidence scores, reasoning. The debate phase is sequential: the Editor's prompt includes all 7 evaluations and runs 3 turns with specific agents pulled in by name.

The parallelism is an isolation mechanism. The Builder's excitement about a new tool can't bias the Contrarian's skepticism if they never share context until debate.

Why 7 and not 3? I tried 3 first. Three agents gave me consensus without friction -- every article that two agents liked, the third liked too. No disagreement meant no interesting editorial angles. The Contrarian broke the groupthink. The Architect caught integration angles the generalists missed. The Scrapper found lean-execution content the enterprise-focused agents dismissed. Going from 3 to 7 wasn't about more coverage. It was about generating genuine editorial friction. Those additions didn't just fill gaps -- they created arguments, and the arguments produced better newsletters.

After building the debate protocol through trial and error, I found Du et al.'s research on multi-agent debate describing the same dynamics in controlled experiments -- across math, strategic reasoning, and factual verification. Heterogeneous debate reduced factual errors by 30%+ over single-agent approaches. I'd already built the pattern before I had the paper to explain why it worked.

The Pipeline: 120 Articles to 8 Picks

Stage 1: Daily Ingestion

The ingestion is boring by design. A Cloudflare Worker hits 65 RSS feeds every weekday morning via Inoreader, covering GTM tools, AI infrastructure, revenue operations, sales methodology, and marketing technology.

The worker runs two-pass AI classification:

  • Pass 1: Scores every article on relevance (1-10)
  • Pass 2: Deep-classifies anything scoring 5+, tagging content pillar, article type, urgency, and source credibility

Everything lands in Supabase with structured metadata. A daily Slack notification summarizes the haul: "42 articles classified. 18 scored 7+. 3 flagged as breaking."

About $7/month total. Cloudflare free tier, Supabase free tier, Claude API for classification.

Stage 2: My Daily Curation

Every morning, classified articles land in Slack. I spend 10-15 minutes marking them: thumbs-up, thumbs-down, or annotated -- "good contrarian angle," "this is hype," "pair with the Segment piece from Tuesday."

Here's the design choice that matters most: my picks start at 60/100 in the consensus algorithm. Agent-only discoveries cap at 70/100. The human signal is architecturally dominant -- not just a filter, but a weight that the math respects.

This isn't just a gate. The agents receive my picks and annotations as input context during evaluation. An article I flagged as "hype" gets scrutinized differently than one I tagged "underrated." The Contrarian specifically weights against articles I marked as strong consensus, which is a structural check on my own blind spots.

The curation takes 10-15 minutes per day. I tried removing it once. The result was newsletters that were technically competent but editorially flat -- well-structured, well-sourced, and completely lacking a point of view. The agents handle scale. I handle judgment. Turns out you need both.

Stage 3: Parallel Evaluation

Claude Code's Agent tool spawns all 7 as sub-agents evaluating in parallel. No agent sees another's output at this stage -- the isolation is deliberate. Each returns structured JSON:

  • Speech act: STRONG_PICK, RECOMMEND, CONDITIONAL, or SKIP
  • Confidence score
  • 100+ words of reasoning
  • Newsletter angle suggestion
  • Tension flags (where this view likely conflicts with others)

Consensus scoring runs on two tracks:

  • Track 1 (my picks): Base 60 + bonuses for annotations, agent agreement, STRONG_PICKs, and flagged disagreements
  • Track 2 (agent-only discoveries): Base 15 + per-agent bonuses, hard-capped at 70

Top 15 by consensus score advance to debate.

Stage 4: The 3-Round Debate

Round 1: Quick takes. Each agent states their position on the top 15. Consensus articles get locked. Contentious ones (2+ agents disagree above 75 confidence) get flagged.

Round 2: Forced advocacy. The Editor frames 3 specific disagreements and makes agents argue:

  • "Builder sees an adoption accelerant, Contrarian sees vendor lock-in. Defend your positions."
  • Agents argue with structural reasoning about what matters for GTM operators this week
  • The Editor controls turn-taking and reframes when arguments stall -- preventing both premature convergence and endless cycling

Round 3: Thesis. The Editor proposes a weekly thesis. Agents validate, push back, refine. What survives becomes the newsletter's editorial frame.

Real output from the March 11 run: "The GTM stack is splitting into two speeds. Most teams can only see one. The ones who sequence them pull away." That emerged from a Builder/Contrarian argument over whether cheap leads are a structural advantage or a vanity metric. Neither was wrong. The thesis distilled the productive middle.

After building this, I found Bao et al.'s work on Adaptive Heterogeneous Multi-Agent Debate -- distinct roles, dynamic coordination, consensus weighted by reliability. Different vocabulary, same mechanism I'd stumbled into.

What Broke

Four production failures the prototype never hinted at.

Scoring drift. By week 3, the Builder rated everything above 80. The persona instructions defined what "interesting" meant but not what "85 confidence" meant. Without calibration anchors, agents inflated their own scores until every article was a STRONG_PICK.

Fix: calibration examples directly in each evaluation prompt. A STRONG_PICK at 90 confidence means "this changes how I work." A RECOMMEND at 72 means "useful if it fits a theme." Specific anchors, not vibes.

Consensus gaming. When agents saw the debate-round structure, they started front-loading agreement in Round 1 to avoid being the dissenter in Round 2. Bland consensus on everything, no friction, boring newsletters.

Fix: the Contrarian's inverted incentive structure. Not just a skeptical persona, but scoring that rewards finding problems. One deliberately adversarial agent breaks the politeness loop. This is why inverted incentives matter more than inverted personality -- a skeptic who's merely contrarian will still drift toward agreement over time. One whose scoring function rewards dissent won't.

Debate degeneracy. Round 3 sometimes collapsed into "I agree with the Editor's thesis" from all agents. That's not a debate. That's a rubber stamp.

Fix: forced advocacy. The Editor now asks specific agents to defend positions they held in Round 1 even if they've shifted. "Architect, you flagged integration risk earlier. Builder says it's overblown. Make your case." Forcing agents to advocate prevents premature convergence.

Hallucination. Agents occasionally invented article details -- specific metrics, company names, feature descriptions that weren't in the source. Fix: evidence fields in the output schema require agents to cite "from_article" (direct quote or data point) separately from "from_experience" (persona-based reasoning). The separation makes fabrication visible and auditable.

I haven't fully solved hallucination. The evidence field catches maybe 80% of it. The remaining 20% is subtle -- a real statistic from the article applied to the wrong context, or a reasonable inference stated as fact. That's where the human review pass earns its 30 minutes per week.

Research on debate scaling documents these same failure modes. The difference between a demo and a production system is whether you've hit these and built around them.

The Evolution: 1 Agent to 7

V0 (2025 Q4). Single prompt. "Pick the best 8." Readable newsletters. The same readable newsletter every week.

V1 (January 2026). Three agents: Builder, Revenue Leader, Contrarian. Added parallel evaluation and basic consensus scoring. Quality jumped immediately -- the Contrarian broke groupthink on AI hype articles. But the 3-agent system had a coverage gap: lean execution content and integration architecture were consistently missed. No debate yet. Just parallel scoring and a weighted average. The agents never talked to each other.

V2 (February 2026). Seven agents, two-track scoring, 3-round debate. Added the Scrapper, Closer, Architect, and Editor. This was the shift from parallel scorers to a team. The sub-agent architecture emerged here -- Claude Code's Agent tool spawning persona-specific sub-agents that evaluate independently then converge under a moderator. First version that produced newsletters with genuine editorial friction.

V2.1 (March 2026). Added the Slack-based human curation layer. My daily picks and annotations now flow into the agent context as a signal. Added the agent memory system for cross-issue learning and a freshness check to prevent the same themes from dominating consecutive issues.

The "almost stopped at 3" moment was real. Three agents produced 80% of the value of seven. The question was whether the remaining 20% justified the complexity. It did, because the 20% was editorial differentiation -- the articles only the Architect caught, the disagreements only the Contrarian surfaced, the lean-execution finds only the Scrapper flagged. Those turned out to be the newsletter sections readers responded to most. The 80% that three agents covered was table stakes -- any competent curation gets you there. The 20% was what made people forward the email.

Real numbers from the March 11 run: 85 articles evaluated across 7 agents. 15 advanced to debate. 3 disagreements explored in Round 2. Final output: 3 deep dives, 1 contrarian corner, 1 stack section, 5 reading stack picks.

Where This Pattern Transfers

The ingredients -- distinct agent roles, parallel independent evaluation, moderated convergence -- apply wherever a single prompt can't hold multiple perspectives.

Deal review is the clearest application. You'd run the same 3-agent minimum: a Revenue Leader scoring pipeline impact, a Contrarian scoring deal risk, a Closer evaluating champion strength. The moderator frames the disagreement: "Revenue Leader sees $400K in Q2, Contrarian sees a single-threaded deal with no exec sponsor. Argue." Two-track scoring weights the rep's own confidence (they know the account) while letting agents flag risks the rep is too close to see. The rep's signal stays dominant -- same architecture as my Slack curation, different domain.

The Contrarian pattern ports to any adversarial review -- vendor evaluation, content scoring, campaign post-mortems. The key insight is inverted incentive structure, not just a skeptical persona. A Contrarian that's merely skeptical will still drift toward consensus. One whose scoring rewards dissent won't.

Two-track human-agent scoring maps anywhere expert judgment needs to scale without being overridden. Lead scoring with rep input. Competitive intel with analyst weighting. Content performance prediction with editor judgment. The template: find the place your expert already works, pipe the signal in, feed their judgment back to the agents. The human stays dominant by design. Agents handle the volume humans can't.

The 3-round debate protocol works for any multi-perspective evaluation where you need a thesis, not a leaderboard. Quarterly business reviews where department heads evaluate the same metrics through different lenses. Product prioritization where engineering feasibility competes with market demand. Hiring decisions where interviewers weight different dimensions.

Subscribe to the STEEPWORKS Weekly newsletter to see what this pipeline produces. The STEEPWORKS Products page covers the Knowledge OS system behind it.

Should You Build This?

Probably not all of it.

A single agent is enough when: You're curating fewer than 30 items, criteria are stable, editorial diversity isn't critical. This covers most content curation tasks.

You need multi-agent when: Evaluation requires perspectives that genuinely conflict in one prompt. You're processing 50+ items where human review alone is a bottleneck. You need auditable reasoning for each decision.

You need the full team pattern when: False consensus is expensive -- bad picks mean unsubscribes, bad deal reviews mean lost revenue. You need editorial friction, not just rankings. You have a human expert whose judgment should dominate but who can't review everything.

Total cost: About $7/month for daily infrastructure. $20-40 per weekly evaluation run depending on volume and model -- evaluating 85 articles across 7 agents plus 3 debate rounds consumes substantial context. 10-15 minutes of my time daily for Slack curation, 30 minutes weekly reviewing output.

What I'd do differently:

  • Start with 3 agents. Add agents only when you can name the gap.
  • Build the debate protocol from day one. Parallel evaluation without debate is expensive voting.
  • Add the human signal early. A thumbs-up in Slack changes output quality more than you'd expect.
  • Build the Contrarian first. Unanimous agreement from AI agents is the most dangerous output in any multi-agent system. It usually means they're all wrong in the same way.

CrewAI is a reasonable starting point if you want a framework -- I outgrew it because the debate protocol needed custom orchestration. GitHub's engineering team published a useful guide on multi-agent failure patterns that covers similar ground.

If you're running any multi-agent workflow where agents score in parallel but never talk to each other, you're leaving the most valuable output on the table. The disagreements are where the insight lives. Give each agent a real perspective. Make them argue under moderation. Keep a human in the loop. Don't trust unanimity.


Victor Sowers is the founder of STEEPWORKS, where he builds AI-native GTM systems and writes about what actually works in production. 15 years scaling B2B SaaS go-to-market at CB Insights and BurnAlong.