Why I Started Building RuleKiln
A lot of projects start with a grand vision.
RuleKiln started because I was frustrated with a summary.
If you want to follow the project, the repository is here: RuleKiln on GitHub.
The Problem: Correct but Not Focused
At work, I was experimenting with using LLMs to summarize call transcripts. The goal was straightforward: produce consistent summaries that highlighted what the business actually cared about.
Like most people, I started with prompts.
I wrote instructions.
Then I rewrote them.
Then I added examples.
Then I added more examples.
The summaries got better, but something still felt off.
The output was often technically correct, but not focused enough. The model spent too much time on details that did not matter and not enough on the themes that did.
The prompt kept growing.
Every new edge case became another instruction.
Eventually, I had something that worked reasonably well, but it felt fragile.
The First Breakthrough
The process made me wonder whether I was solving the wrong problem.
Instead of writing even more instructions, what if I gave a stronger model examples of what I actually wanted?
I took around twenty call transcripts and converted them into structured JSON. Then I asked Claude Opus to analyze those examples, identify common themes, and generate a prompt that would consistently produce summaries in the format I wanted.
The result surprised me.
It worked immediately.
The summaries were noticeably better than the prompts I had been hand-crafting.
After a few rounds of refinement, the generated prompt was producing exactly what I had been trying to achieve manually.
That experience planted a question in my head:
If a stronger model can generate a better prompt than I can, could that process be automated?
From Model Distillation to Prompt Distillation
Around the same time, I kept seeing discussions about model distillation.
Researchers and practitioners were using powerful frontier models to improve smaller models through fine-tuning, synthetic data generation, and reasoning traces.
The underlying idea was always similar:
Use a smarter model during development to improve a cheaper model during deployment.
That made me wonder if there was an equivalent process for prompts.
Not fine-tuning.
Not retraining.
Prompt distillation.
Could a stronger model analyze examples, extract important rules, remove unnecessary reasoning, and compile those insights into a better instruction set for a smaller model?
Finding the Missing Piece
I started looking around to see what already existed.
There were prompt optimization tools.
There were evaluation frameworks.
There were fine-tuning platforms.
But I could not find many tools focused specifically on turning examples into distilled prompts and then proving whether those prompts actually improved a target model.
Eventually I found a research paper from Google engineers discussing prompt-level distillation.
Reading that paper felt like finding the missing piece.
It described many of the same ideas I had been thinking about:
- stronger teacher models
- weaker student models
- extracting useful behavior
- compiling knowledge into prompts instead of weights
That paper became the inspiration for what eventually became RuleKiln.
The Original RuleKiln Loop
The original idea was simple:
- Start with labeled examples.
- Use a teacher model to extract task-specific rules.
- Turn those rules into candidate prompts.
- Test those prompts against a baseline.
- Keep only the prompts that actually improve the student model.
Over time, the project grew into something much larger than prompt generation.
It needed evaluations.
It needed quality gates.
It needed cost tracking.
It needed durable workflows.
It needed ways to prove when a generated prompt was actually better and, just as importantly, when it was not.
But all of that came later.
Why RuleKiln Exists
The project started with a much smaller observation:
A powerful model generated a better prompt than I could.
RuleKiln grew out of the desire to understand why that happened and whether it could become a repeatable system.
In future posts I will cover:
- the research paper that inspired the project
- the first RuleKiln architecture
- benchmarking on BANKING77
- prompt hardening for local and edge models
- why durability became a major design requirement
But this is where it started.
With a summary that was not quite good enough.