Slop Theater and the 37,000 Lines

Slop Theater and the 37,000 Lines

9 min read

Slop Theater and the 37,000 Lines

Someone shipped 37,000 lines of code in a single day. They posted about it. People applauded.

Sentry’s entire Python backend — the one processing billions of events for paying customers — is 550,000 lines. A solo developer with 25 agent windows claimed to match 7% of that in an afternoon. The crowd cheered.

Nobody asked what those 37,000 lines did.

The performance

Armen Ronacher has a name for this. He calls it “slop theater.”

Picture the setup: 25 terminal windows tiled across an ultrawide monitor. Each runs an AI coding agent. The screenshot goes on Twitter. Replies fill with fire emojis. “The future of software development.” Thousands of likes.

Look closer. Most windows run Haiku — Anthropic’s smallest, cheapest model — on generic research tasks. Copy-paste prompts. Boilerplate generation. The kind of work a bash script handles in four lines. But four lines don’t make a good screenshot. Twenty-five windows do.

A cottage industry formed around this performance. Frameworks for parallelizing agents. Dashboards tracking how many agents run simultaneously. Leaderboards of token consumption. The metric shifted from “what did you build” to “how many agents did you burn.”

NVIDIA’s CEO said it plainly: a $500,000-compensated engineer should spend at least $250,000 in tokens per year. FANG companies built internal leaderboards tracking token spend per engineer. The person who burns the most tokens wins.

Not ships the most value. Burns the most tokens.

GStack and the machine-written machine

GStack has 60,000 GitHub stars. It bills itself as a coding productivity framework for Claude Code. Open the repo. What’s inside?

Markdown files. Bash scripts. A review command that costs 16,000 tokens per invocation. An “office hours” feature — a virtual YC advisor — that burns 21,000 tokens per session.

The skills weren’t written by Gary, the creator. The machine wrote them. Armen’s observation: “The machine wrote the markdown. The machine doesn’t care.” It doesn’t care about token efficiency. It doesn’t care about redundancy. It generates what you ask for in the shape you describe. If you describe a 16,000-token review process, it builds a 16,000-token review process. A human would notice the waste. The machine has no concept of waste.

This pattern repeats everywhere. People don’t write skill files. They have agents write skill files. The agent produces something that works — technically correct, structurally bloated, impossible to maintain. Nobody reads it. Nobody trims it. It ships.

Could it be a bash script? Almost always. But bash scripts don’t feel like progress. A 400-line skill file generated by Claude feels like you built something. You didn’t. The machine printed something. There’s a difference.

Lines of code as loss

Here’s the number that matters.

Beats, an open-source issue tracker, contains 380,000 lines of code. Town, a related project, sits at 400,000 lines. Together, nearly 800,000 lines across two projects.

Sentry’s Python backend: 550,000 lines. Sentry processes billions of error events. It runs a business generating hundreds of millions in revenue. That codebase serves millions of developers daily.

Beats and Town don’t.

Every line of code is a liability. It needs to be read. Understood. Tested. Maintained. Debugged when it breaks. Migrated when dependencies change. Every line carries a cost that compounds over time.

Fred Brooks wrote this in 1975. “More code” has never meant “more progress.” It has always meant “more to maintain.” The Mythical Man-Month is fifty years old. The lesson still hasn’t landed.

When generating code costs nothing, the instinct to minimize disappears. The feedback loop that made experienced developers write less — the pain of debugging, the tedium of testing, the shame of reading your own bloat six months later — all of it vanishes. The agent generates. You accept. The line count climbs.

Nobody celebrates deleting 10,000 lines. Nobody screenshots a codebase that shrank. The incentive structure rewards addition. Always addition.

The drift

Humans reuse code because writing hurts.

You spend 45 minutes building a utility function. Next week, you need something similar. You search for the function. You find it. You adapt it. Not because reuse is a principle you believe in. Because writing the same thing twice feels like a waste of your limited, painful effort.

Agents don’t feel pain. Generation is instant. Searching an existing codebase takes longer than generating fresh code. So agents generate. Every time.

Ben Vinegar calls this “drift.” The codebase grows duplicate functions, near-identical utilities, three implementations of the same pattern in three different files. Each works. None knows about the others. The codebase develops a kind of amnesia — every new task starts from scratch because starting from scratch is cheaper than remembering.

For a human maintaining this code later, the cost is real. Three functions that do almost the same thing. Which is canonical? Which handles the edge case? Which was written last? The agent doesn’t know. The agent doesn’t have to know. The agent will generate a fourth one tomorrow.

Drift isn’t a bug in the model. It’s a structural incentive. When generating costs zero effort and searching costs some effort, generation wins every time. The codebase expands. Entropy increases. Nobody notices until a human tries to understand the system — and discovers there is no system. Just layers of generation.

The Halo problem

Armen tells a story about the Halo Master Chief Collection launch. The matchmaking system had fifteen boolean flags controlling its behavior. Fifteen booleans create 32,768 possible states. Most were invalid. The team couldn’t reason about which combinations were legal. The system broke on launch in ways nobody predicted.

LLMs love this pattern. They never feel the pain of complexity.

A human developer hits fifteen boolean flags and thinks: this is insane. I need to refactor this into an enum, a state machine, something with fewer valid states. The complexity creates cognitive pain. That pain drives simplification.

An LLM hits fifteen boolean flags and handles them. All 32,768 states. It generates code that checks each combination, handles each case, maintains backward compatibility with every possible path. It doesn’t escalate. It doesn’t simplify. It doesn’t say “this is too complex.” It treats the problem as stated and solves it as stated.

The code works. Every test passes. The system is technically correct.

It’s also unmaintainable. A human reading it can’t hold the state space in their head. But no human needs to read it — until something breaks. Then a human stares at fifteen boolean flags and 32,768 states and realizes the LLM built a house of cards that only the LLM can navigate.

The pain of complexity is a feature, not a bug. It’s the signal that tells you to simplify. Remove the signal and complexity grows without limit.

The refactor test

Ben Vinegar offers a diagnostic. Simple. Brutal.

Go to any repository. Search the pull request history. Count how many PRs contain the word “refactor” in the title.

GStack, the 60,000-star framework: 5 out of 600 PRs mention refactoring. Less than 1%.

That number tells you everything. A codebase that grows by 595 addition PRs and 5 refactoring PRs is a codebase that only accumulates. It never consolidates. It never simplifies. It never looks back at what it built and asks: is this still the right shape?

Refactoring is the developer’s immune system. It detects bloat, redundancy, unnecessary complexity — and removes it. A codebase with no refactoring is a codebase with no immune system. It accepts everything. It rejects nothing. It grows until it collapses under its own weight.

Check your own repos. The ratio won’t lie.

Foundations first

Here’s the part that nobody wants to hear.

Armen says it directly: “Well-engineered pre-agentic library foundation plus agents equals bliss. Chaos plus agents equals more chaos.”

If your codebase was clean before you added AI agents — clear abstractions, consistent patterns, well-defined boundaries — agents amplify that clarity. They generate code that fits the existing structure. They follow the conventions because the conventions are legible. The foundation guides the generation.

If your codebase was messy — inconsistent patterns, unclear boundaries, technical debt in every corner — agents amplify that mess. They generate code that matches the existing chaos. They follow no conventions because there are none to follow. The foundation is noise. The generation is noise on top of noise.

You cannot fix this with a better model. “If you started with a sloppy codebase with GPT-3.5 last May and put a SOTA model on it now — make it perfect — it won’t work if the problem was already too big.” You can’t outslop yourself with future intelligence. The garbage compounds faster than the models improve.

The work that matters is the work before the agents start. Handcraft your foundation. Define your patterns. Write the abstractions yourself — with your own understanding of why each boundary exists. Then let the agents fill in the implementation. This order is non-negotiable. Reverse it and you get 380,000 lines of code that does what Sentry does in 550,000 — except Sentry generates revenue and yours generates screenshots.

The mirror

Slop theater is session escalation made public. The same mechanism that drives the 3 AM refactor — “one more improvement, one more file, one more agent” — except now it has an audience. The audience applauds. The applause feeds the escalation. The escalation produces more slop. The slop gets more applause.

It’s also operational dependency at scale. When your workflow requires 25 parallel agents to feel productive, you’ve outsourced not just the coding but the judgment about what to code. The agent doesn’t ask “should this exist?” It asks “how should this be generated?” Different question. Critical difference.

The uncomfortable part: the code works. The tests pass. The features ship. You can build a lot of software this way. You just can’t maintain it. And maintenance — reading, understanding, debugging, simplifying — is where software actually lives.

37,000 lines in a day. How many survive the month?

Notice the ratio. That’s the diagnostic.

Take the self-check. 14 questions. 3 minutes. It measures session escalation and operational dependency — the two patterns behind the theater. Unlike 37,000 lines of code, it might tell you something you can use.


Sources:

  • Ronacher, A. & Vinegar, B. (2026). State of Agentic Coding, Episode 5. Podcast.
  • Brooks, F. (1975). The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley.
  • Ronacher, A. (2026). “Some things just take time.” Blog post.
  • 343 Industries. (2014). Halo: The Master Chief Collection launch postmortem. Public documentation.

OnTilt is a research project studying behavioral patterns in AI-assisted work. The quiz is a self-check tool, not a diagnostic instrument. Read more on our About page.