Push the Knowledge Down the Stack: Cheaper Agents Through Skills

Greg Heffner June 5th, 2026

Connect a fifty-tool MCP server to your agent and you can burn twenty thousand tokens before you ask it a single question. You never see it as an error. You see it later, as a bill you didnt expect, or as a context window that filled up and made the agent forget what you told it ten minutes ago. Ive paid that tax more than once running agents against my home lab, so this is the post about where the money actually goes and how to keep it.

The short version: MCP servers are some of the most useful things you can plug into an agent, and theyre also one of the fastest ways to drain a token budget if you wire them up wrong. The fix isnt to stop using them. Its to push the expensive, repetitive knowledge down the stack, into skills and subagents, so your top-level model pays for answers instead of paying to rediscover the same facts over and over.

Tokens are money. Start acting like it.

Here is the mental model that fixed this for me. A token is a coin. Every word in your prompt, every line of every tool result, every schema the model has to read before it can act, all of it costs coins. And context is sticky: once something lands in the window, you keep paying to re-read it on every single turn for the rest of the session. A 4,000 token tool result isnt a one-time charge. Its a subscription you pay every time the model thinks after that.

So the goal isnt "use fewer tools." The goal is to make each coin buy as much judgment as possible, and to keep the cheap mechanical stuff out of the expensive model entirely.

MCP servers are great, and thats exactly the trap

MCP (Model Context Protocol) is fantastic. It lets an agent talk to your real systems: your SIEM, your scanner, your network controller, your Google Drive. I run one that reaches into my security stack over SSH, and it has saved me hours. But MCP has two costs that are easy to miss because nobody puts them on an invoice.

The first cost is the tool schemas. When you connect a chatty MCP server, every one of its tools ships a name, a description, and a full JSON schema for its arguments, and a lot of clients load all of that into context up front. Connect a server with fifty tools and you can burn tens of thousands of tokens before the model has done a single useful thing. You are paying rent on a toolbox you mostly arent using on this turn.

The second cost is the results. MCP calls love to hand back big blobs. Ask a scanner for recent findings and it might dump 300 rows of JSON, most of which you dont care about. That blob lands in context, and per the sticky rule above, you re-read it on every subsequent turn until the session ends or gets compacted.

The part that compounds: None of this is a one-time charge. Make ten calls in a session, let each fat result sit in context, and you keep re-reading all of them on every turn after that. You can end up spending more tokens carrying the toolbox around than you ever spend answering the actual question.

So how do you stop paying it?

Two problems, two different owners. The schema bloat is mostly your clients job to fix: turn on deferred tools (Claude Code calls it tool search) and the agent loads tool names up front but pulls a full schema only for the handful it needs that turn. That one switch can shrink a fat MCP connection down to a rounding error.

The sticky-result problem is yours to design around, and the rule is simple: never let a raw MCP dump reach your top-level model if something cheaper can read it first. That "something cheaper" is exactly what skills and subagents are for.

Push the knowledge down the stack

Heres the core move. Your top-level agent is the most expensive context you have, because its the one carrying the whole conversation. So you want that context to stay small and full of judgment, not full of raw data and rediscovered procedure. Anything thats repeatable, mechanical, or bulky should live lower in the stack: in a skill, or behind a subagent. Push the knowledge down so the top only sees conclusions.

Skills: bake the procedure in once

A skill is just a file of instructions (and sometimes scripts) that the model loads only when its relevant. The win is that it replaces exploration with a recipe. Without a skill, an agent that needs to do some multi-step task will poke around, list directories, read three wrong files, call an MCP tool twice to figure out the shape of the data, and generally flail. Every one of those flails is tokens.

With a skill, you front-load the hard-won knowledge: "the config lives here, the highest article number is found this way, patch these three blocks in this order." The model reads one tight procedure and executes it. You paid the discovery cost once, when you wrote the skill, instead of paying it on every run. A good skill is a cache for everything you already learned the hard way.

Agents: a token firewall

Subagents are the other half. A subagent runs in its own separate context window, does the messy work in there, and hands back only the conclusion. That makes it a firewall between expensive raw data and your main thread.

This is the right home for a chatty MCP server. Instead of letting the top-level model call the scanner and swallow 300 rows of JSON, you hand the job to a collector agent. It makes the MCP call, sifts the blob in its own context, and returns three lines: "two findings matter, here they are, the rest are noise." The main conversation pays for those three lines, not the 300 rows. The raw dump gets thrown away with the subagents context when it finishes. You got the answer and none of the weight.

Put the two together and the wins compound: the subagent keeps the noisy MCP connection out of your main context, and a skill tells that subagent exactly what to do so it doesnt burn tokens rediscovering the procedure every time.

A few rules I actually follow

Dont attach a fat MCP server to your main agent if a subagent can hold it. In Claude Code, give a dedicated collector subagent its own narrow tool list (just that one MCP server) so the raw dump lives and dies in the subagents context and never lands in your main thread.
Turn on tool search (deferred schemas) before you connect anything with more than a few tools. Loading fifty schemas you wont use this turn is pure overhead.
Make tool results return the smallest useful shape. If you own the server, add a filter argument or a summary mode instead of dumping everything. If you dont own it, wrap the call in a subagent whose only job is to read the blob and return the two or three lines that matter.
Put repeated discovery in writing. When you catch the model relearning the same facts every session (where a config lives, what the latest ID is, which order to patch things), thats a skill or a CLAUDE.md note waiting to be written. Write it once, stop paying for it.
Use the expensive model for judgment only. Routing, formatting, error codes, anything an if statement or a checklist could do belongs in plain code or a skill, not in tokens.

Final Thoughts

MCP servers arent the problem, and the answer is never "use less power." The answer is to put the power in the right place. Keep your top-level agent lean and pointed at the actual decision, and push the bulk, the boilerplate, and the noisy connections down into skills and subagents where theyre cheap. Do that and the same budget that used to get you a couple of confused turns will carry a whole afternoon of real work. Tokens are money. The teams that win the next couple of years are the ones who spend them like it.

About Me

I served in the U.S. Army, specializing in Network Switching Systems and was attached to a Patriot Missile System Battalion. After my deployment and Honorable discharge, I went to college in Jacksonville, FL for Computer Science. I have two beautiful and very intelligent daughters. I have more than 20 years professional IT experience. This page is made to learn and have fun. If its messed up, let me know. Im still learning :)

Recent Blogs

Popular Projects

Book Board

Weather Loop

Animated radar loop of Southeast US weather from NOAA

Nerdsense