Code optimization with LLM-imagined ideas and evolutionary search
You’ve got a slow endpoint. You sit down with the code and a list of candidate changes — cache the database queries, batch the N+1 lookups, parallelize the independent API calls, add an index on the search column. The usual approach is to try each against the baseline, keep what helps, drop what doesn’t.
Baseline: p99 latency 850 ms
Try cache DB queries: p99 720 ms → keep
Try batch lookups: p99 690 ms → keep
Try parallelize API calls: p99 700 ms → reject
Try add index: p99 640 ms → keep
Final: p99 640 ms
You settle at 640 ms. Done.
But the configuration that reaches 480 ms in this search space is [cache, parallelize, index] — together. Parallelizing alone hurts because the un-cached calls dogpile the database; with the cache warm, the parallel paths don’t conflict and the latency falls off a cliff. Greedy rejected parallelize at the third step because it failed in isolation, and never tried it again. The synergy was invisible.
I built a small tool called cevolve to chase down problems with this shape. It applies to any code-optimization task where ideas interact: hot code paths, training scripts, compiler flags, infrastructure config, agent prompts. This post is about why that shape is more common than people assume, why I’ve been here before, and what falls out when you replace sequential testing with population search.
The shape of the problem
Greedy search has one job: maximize fitness by accepting changes that improve it. When the search space decomposes neatly — each parameter independent of the others — greedy is fine. Better than fine. It’s optimal in compute and simple to reason about.
Most interesting optimization problems don’t decompose. The endpoint example above is the everyday version — caching, parallelism, and database structure all interact, and the right answer is some combination of changes rather than any one of them. Compiler flags interact: inlining helps until it busts the instruction cache, and the answer depends on which other flags are set. ML hyperparameters interact: learning rate, batch size, and warmup behave together in ways that make single-variable sweeps a poor reflection of the loss surface.
Once interactions are in play, the global maximum lives in a configuration that no greedy path will visit. You need a search procedure that can hold multiple incomplete answers in mind at once and recombine them.
That’s what genetic algorithms do. A population of candidate configurations evaluated together; selection keeps the fit ones; crossover combines genes from different parents into new individuals; the next generation is tested. Crossover is the mechanism that matters here. It produces children with combinations of traits that no path-based search would assemble.
Inspired by autoresearch
The immediate trigger for building cevolve was autoresearch, Andrej Karpathy’s framework for running an LLM agent overnight to iteratively edit a training script. The setup is elegant: an agent has access to train.py, modifies it, runs the model for a fixed wall-clock budget, reads the validation metric, then either keeps or reverts the change. Humans steer through a program.md file. You wake up to a log of around a hundred experiments and, with luck, a slightly better model.
What autoresearch gets right is the basic shape: put the LLM in the loop, make experiments cheap and time-bounded, treat the running notebook as research lineage. What I wanted to push on is the search strategy. autoresearch’s agent works sequentially — one experiment at a time, conditioned on what came before. That’s structurally close to greedy hill-climbing, and inherits the same blind spot: when ideas interact, you can’t find the best combination by trying them one at a time and keeping winners.
cevolve takes the same architectural decisions — LLM proposes, run benchmark, accumulate — and swaps in population-based search. The LLM still reads code and suggests ideas. The benchmark still scores configurations. The difference is that instead of a single conditioning trace, you have a population of trial combinations evaluated in parallel, and crossover routinely produces children no sequential trace would assemble.
A familiar wedge
I keep ending up at this problem, going back further than I’d care to admit.
In 2010 I co-authored a paper at GECCO on using GPUs to accelerate genetic programming fitness evaluation. Different domain — symbolic regression and program synthesis, not LLM tuning — but the wedge was the same: evolutionary search does a lot of work, and the bottleneck is how cheaply you can score an individual. If you can evaluate ten thousand candidates in the time you used to evaluate ten, you stop apologizing for the population-based approach. You just use it.
The bottleneck isn’t fitness evaluation anymore. GPUs solved that one well enough. The new bottleneck is imagining the right ideas to evolve over. That’s what LLMs are unusually good at: reading code or a config and proposing what to try. They are, however, bad at testing combinations of their own proposals. They tend to recommend the obvious next thing and stop.
So the structure is: let the LLM imagine, let evolution discover. Hence the tagline on the project.
What cevolve does
The setup is straightforward. The LLM reads the target — a file, a config, a glob of source — and proposes specific changes worth trying. For a slow service it might propose caching, batching, query rewrites, parallelization, an index. For a training script it might propose flash attention, mixed precision, gradient checkpointing, different batch sizes. For a compiler invocation it might propose inlining thresholds, vectorization flags, link-time optimization.
Each idea is a discrete optimization choice: either binary (use_flash_attention: on/off) or variant (depth: 4 | 6 | 8 | 10). An individual is a particular combination of idea settings. The population is a few of these, evaluated against the benchmark in parallel.
Individual:
genes: {
"depth": "6",
"flash_attn": "on",
"batch_size": "large",
"dropout": null // idea is off
}
fitness: 2.32
Each generation, the GA selects the fittest, performs crossover between parents to form children, mutates with some small probability, and evaluates again. Standard stuff.
The piece I spent the most time on is the rethink loop. Every N evaluations, the system commits the best configuration as the new baseline, analyzes which ideas have been pulling weight and which haven’t, and asks the LLM to add new hypotheses informed by what’s worked. The population resets and starts exploring on top of the committed wins. This creates eras: discrete chunks of search, each building on the last.
Era 1: baseline 0.72 → best 0.80 → commit [chain_of_thought]
Era 2: baseline 0.80 → best 0.88 → commit [chain_of_thought, max_iterations=10]
Era 3: baseline 0.88 → best 0.92 → commit [chain_of_thought, max_iterations=10, history]
Without eras, accumulating wins on top of each other tends to be brittle — late-stage individuals carry too much baggage from earlier generations and the population stops moving. With eras, the genome shrinks back to “what the LLM has on its mind right now” plus the locked-in winners.
The benchmark
I needed a testbed where the search space was small enough to characterize fully, each evaluation was cheap enough to run hundreds of times, and there were known synergies in the space. Agent tool-use ended up being a clean fit. The agent solves 25 math and fact-lookup tasks via Ollama, with five tunable parameters and 72 possible configurations. The point isn’t agent tuning specifically — it’s a controlled environment where the wins and losses of greedy versus population search are visible.
| Method | Avg final | Evals to ≥0.82 |
|---|---|---|
| Baseline | 0.720 | — |
| Greedy | 0.896 | 8.2 |
| cevolve | 0.896 | 3.2 |
Same final performance — given enough evaluations, both find a good configuration. The difference is convergence speed: cevolve gets to a “good” threshold in roughly a third of the evaluations. When each evaluation is nine seconds, the absolute savings are small. When each evaluation is a 10-minute training run, the savings become real:
| Eval cost | Greedy to ≥0.82 | cevolve to ≥0.82 | Saved |
|---|---|---|---|
| 9 s | 74 s | 29 s | ~45 s |
| 2 min | 16 min | 6 min | 10 min |
| 10 min | 82 min | 32 min | 50 min |
The more interesting number is synergy discovery. The benchmark contains two plausible synergies in the search space:
include_history=true+retry_with_feedback=true(history makes error feedback useful)max_iterations=10+use_chain_of_thought=true(more iterations to actually walk through the reasoning)
| Synergy | Greedy | cevolve |
|---|---|---|
| history + retry_with_feedback | 2/5 | 2/5 |
| max_iterations=10 + CoT | 0/5 | 2/5 |
Both methods find the first synergy because each component helps in isolation; greedy can just keep both. The second one is the interesting one. max_iterations=10 is worse than the baseline alone. Greedy rejects it on contact. cevolve’s crossover stitches it together with chain-of-thought a few generations in:
max_iterations=10 + chain_of_thought synergy. Neither parent scores well alone; the child jumps to 0.92.The greedy and cevolve “best” configurations are interestingly different. Greedy finishes at max_iterations=5, history=on, retry=on, cot=on, tool_instructions=minimal. cevolve finishes at max_iterations=10, history=off, retry=off, cot=on, tool_instructions=minimal. Both score 0.92. They got there by different paths, and the path matters: cevolve’s configuration is simpler, and it found it via a route greedy can’t reach by construction.
When this earns its keep
Evolutionary search isn’t free. It uses more total evaluations than greedy on simple problems, and it has well-known failure modes (loss of diversity, premature convergence, the rest of the bestiary). cevolve makes sense when the conditions for evolution actually obtain:
- Parameters interact. If the search space decomposes, greedy is faster and the GA buys you nothing.
- Evaluation is expensive. If you can run a thousand evaluations in a minute, faster convergence is rounding error. If each evaluation is a long training run or a costly API sweep, the savings compound.
- You have parallel compute. Greedy is sequential by construction. Population-based methods parallelize naturally across individuals.
- You’d take “good enough” sooner. Saving 60% of evaluations to reach a useful threshold matters when you don’t have the patience to wait for the absolute best.
For independent parameters, narrow search spaces, or cases where one knob dominates the others, you don’t need any of this. Run a sweep and ship.
Why this generalizes
The pattern under cevolve — imagine, then search over combinations — isn’t really about agents or training. Any time you have an idea generator that’s smart but path-dependent, paired with a combinatorial space that needs exploring, the same shape applies. LLMs are very good idea generators with one specific blind spot: they don’t naturally test their own combinations. Evolutionary search is a very effective combinatorial explorer with one specific blind spot: it can’t propose ideas it’s never seen.
Sticking them together fixes both. The LLM hands evolution a smaller, smarter search space than evolution would assemble from scratch. Evolution finds combinations the LLM wouldn’t have proposed in sequence. The combined system goes further than either does alone, and it gets there with fewer evaluations than the same search using either alone.
In practice the targets I’m most interested in are not benchmarks. They’re the messy real things: a hot service that needs to be faster, a training run that takes forever, a build that’s outgrown its flags, an old code path nobody wants to touch. Those have search spaces nobody can enumerate by hand and interactions nobody can hold in their head. They’re exactly the shape cevolve was built for. Code is at github.com/jnormore/cevolve.