Michele Campi

How fragile is your weekly plan? A risk-premium framework for mid-market manufacturers

2026-05-04T00:00:00+00:00

TL;DR. A deterministic schedule promises a single number — say, 161 hours of plant time. That number assumes every task takes exactly as long as the planner wrote down. Real plants don’t behave that way. Using Monte Carlo with CVaR 95% on a real OR-Tools schedule for a mid-market Italian contract packager, I show that doubling the input volatility (±20% → ±35% on filling tasks) raises the weekly risk premium from 4.2% to 7.2% — not the catastrophic explosion most planners fear. The plan is structurally robust. The framework reproducible. The calculation is a single API call. 2,300 words, 8 minutes.

In the last article I walked through one synthetic week at Lombarda Confezionamenti SRL, a fictional contract packager in northern Italy. OR-Tools returned an optimal weekly schedule with a makespan of 161 quarter-hours — about 40 hours of plant time, distributed across six lines and 26 production tasks. Roughly 15% better than the manual baseline, with zero late orders.

But there was a hidden assumption underneath that 161-hour number, and it’s the assumption every deterministic schedule makes. It assumes that the duration of every task is exactly what the planner wrote in the spreadsheet. The crema viso filling lasts exactly 16 quarters. The shampoo run is exactly 28. The labelling on the body wash is exactly 14. No surprises. No variation.

In a real plant, of course, this is never true. Filling lines run a little faster on a good day and a little slower on a hard one. Validation sometimes takes an extra cycle when a new product comes in. Sanitization between formats is usually two hours but occasionally three. The week’s actual makespan is some distribution around 161, not the number itself.

The question every operations manager intuitively asks is: how robust is my plan? Most of them answer it with stories — “last March we had a bad week, took us until Friday night,” “the line 3 always runs late” — rather than numbers. Operations research can do better than that. It can quantify it.

What “robust” actually means, mathematically

Two concepts from finance translate directly to scheduling under uncertainty: Monte Carlo simulation and Conditional Value at Risk (CVaR).

The first is straightforward. Instead of assuming each task duration is a single number, we treat it as a probability distribution — say, triangular, with min/expected/max. We then ask the solver to evaluate the schedule against, say, 100 random samples drawn from those distributions. Each sample is a realistic “alternative week.” The output is no longer a single makespan but a distribution of makespans: a histogram of how the week could actually play out.

The second concept matters more. CVaR 95% answers the question: across the worst 5% of weeks, what is the average outcome? Not the absolute worst case (which is dominated by tail events that may never happen), but the expected outcome conditional on being in the bad tail. If CVaR 95% of your weekly makespan is 168 hours when the expected value is 161, you can plan around 168 with reasonable confidence — not around 161.

Reframed in language a CFO understands: the risk premium is the gap between the expected makespan and the CVaR 95%, expressed as a percentage of the expected. It is the cost, in hours, of buying protection against bad weeks. A risk premium of 4% says: to be safe in 95% of weeks, you need to budget 4% more time than the optimal plan suggests. That number is concrete. It can be defended in front of a board. It can be priced into contracts.

Scenario 1: a normal week (±20% volatility on filling tasks)

I ran the optimal schedule from the previous article through OptimEngine’s stochastic scheduling endpoint. The setup: 100 Monte Carlo scenarios, triangular distribution on the four filling task durations (the structural bottleneck of cosmetics packaging), ±20% variation around the planned mean. Filling, not mixing or labelling, because filling is what happens on the shared bottleneck line and is most exposed to viscosity and product-changeover variability.

The output:

Mean makespan: 161.38 quarter-hours
CVaR 95%: 168.17 quarter-hours
Coefficient of variation: 2.5%
Range (min–max across 100 simulations): 152 to 169 quarters
Risk premium: 4.2%

What does this say? The plan is structurally robust. Even when the four filling tasks vary by ±20% — which is a generous estimate of normal-week volatility for a medium-complexity cosmetics line — the worst 5% of weeks land at 168 hours instead of 161. To buy 95% reliability you pay 4.2% extra time. That’s the price of robustness.

Crucially, no week in the 100-simulation pool exploded. No catastrophic delay. No order missed. The schedule degrades gracefully — which is what we want, but rarely measure.

Why is the plan this robust? The structural answer

Not all schedules degrade gracefully. Some shatter. The reason this one doesn’t is structural, and worth explaining because it tells you when to expect different behavior.

The optimal plan from OR-Tools placed the four filling tasks across two parallel filling lines. When task A on line 1 runs 20% longer than expected, it doesn’t ripple through tasks B, C, D on line 2 — they’re on a separate machine. The week’s makespan is determined by the slower of the two parallel paths, not the sum.

In other words, parallelism absorbs variance. A schedule that piles all four filling tasks on one line would have a much larger risk premium for the same input volatility — possibly 8-10% instead of 4.2%, because every variance compounds.

This is the kind of insight that’s invisible without quantification. A planner looking at the schedule manually would say “looks fine.” A solver looking at the schedule under stochastic perturbation says “looks fine, and here’s why.” That’s what robustness analysis adds.

Scenario 2: a difficult week (±35% volatility)

Now I doubled the volatility. ±35% on the four filling tasks — what you might see during a product-mix change, a new SKU introduction, or post-sanitization commissioning. Same schedule, same constraints, same 100 Monte Carlo simulations.

Mean makespan: 161.28 quarter-hours
CVaR 95%: 172.83 quarter-hours
Coefficient of variation: 4.1%
Range (min–max across 100 simulations): 144 to 173 quarters
Risk premium: 7.2%

Here is the surprising finding. Volatility almost doubled, but the risk premium did not. It went from 4.2% to 7.2% — a 71% relative increase, but in absolute terms still a manageable buffer. The mean makespan barely moved (161.28 vs 161.38). The structural robustness held.

This is not luck. It’s the same parallelism story playing out. With two parallel lines, the worst-case outcome of the bottleneck path is bounded by the longer of two correlated random variables — whose expected maximum grows much more slowly than the underlying variance.

Compare this to a schedule that put all filling on one line: there, doubling input variance would roughly double the risk premium too, because the variances accumulate without offset. The plan would shatter.

What this changes for the planner and the CFO

For the planner, this is a tool to defend the optimal schedule against intuition. If a senior operations manager says “I don’t trust this plan, last March we had a terrible week” — the answer is no longer “trust me” or “it’s optimal.” The answer is “the framework projects a 4.2% risk premium under normal volatility, 7.2% under elevated volatility. Here are the numbers. Here is the assumption set. Here is what changes if you disagree with that assumption set.”

That’s a much harder conversation for the senior manager to win on intuition alone, because the framework is reproducible and falsifiable. If they disagree with the volatility input, they can change it. If they disagree with the parallelism modeling, they can override it. What they can’t do is say “your plan is fragile” without engaging with the math.

For the CFO, the framing is different but equally concrete. The risk premium is a cost. It’s the cost of buying delivery reliability. If the firm prices contracts on the deterministic plan (161 hours) and then absorbs the 4.2% slippage internally, that slippage shows up as overtime, expedited freight, or compressed margin on rush jobs. If the firm prices contracts on the CVaR plan (168 hours) and the week ends up at 161, that’s bonus margin. Either way, the number is real and the conversation is honest. Without measurement, the firm pays the premium without knowing it exists.

When this analysis does NOT apply

Three caveats matter for honest framing.

First, this analysis assumed parallelism in the bottleneck. If your plant has a single filling line, the framework still works but the risk premium will be larger and grow faster with volatility. The structural protection comes from the schedule’s topology, not from the framework itself.

Second, only filling task durations were perturbed. In a real plant other parameters vary too — yields, setup times, downtime events. A complete robustness analysis would add stochastic distributions for those too. The ones we used are the dominant ones for cosmetics filling, but in another industry (precision machining, food processing, semiconductor assembly) the dominant variance source might be elsewhere.

Third, the analysis is conditional on the schedule produced by OR-Tools being itself near-optimal. If the input scheduling is poor, no risk analysis on top of it will save it. Robustness analysis is a layer over solid optimization, not a substitute for it.

The framework, in five steps

This is the operational synthesis. Any mid-market manufacturer with weekly scheduling decisions can apply this:

Compute the deterministic optimum with OR-Tools or equivalent, using mean durations.
Identify the dominant variance sources — usually 2-4 task types where empirical historical variation is largest.
Define triangular distributions around each (min, expected, max) based on either historical data or expert estimate.
Run 100 Monte Carlo simulations with the same scheduler, and extract the makespan distribution. Compute CVaR 95% and the risk premium.
Defend the schedule with the risk premium, not the deterministic number. Price contracts, allocate buffer time, and have a real conversation with stakeholders.

What stays constant across industries, plant sizes, and product categories is the framework — and the fact that without measuring, the conversation about plan robustness is the same one production managers have been having for fifty years: based on intuition, on bad weeks they remember, on stories.

Operations research doesn’t replace that intuition. It quantifies it.

The analysis used OptimEngine’s stochastic scheduling endpoint with 100 Monte Carlo scenarios. The two volatility profiles (±20% and ±35% on filling tasks, triangular distribution) were chosen to represent a normal week and a difficult week respectively. The full schedule, parameter distributions, and the resulting risk metrics are reproducible — the input was a single JSON request to a public endpoint, the output is what the solver returned.

If you’re applying this framework to your own operation and want a second pair of eyes on the setup or the assumptions, the contact is on the profile.

What an OR-Tools solver finds in a week of contract packaging — and what the planner usually misses

2026-04-29T00:00:00+00:00

Most arguments in favor of optimization software focus on the obvious benefit: a solver finds a better schedule than a human planner, faster. That part is true and uninteresting. The interesting part is what the solver shows you about your own operation that you couldn’t see before.

This article walks through one synthetic but realistic week at Lombarda Confezionamenti SRL, a fictional contract packager in northern Italy. The company is fictional; the operational pattern is one I’ve seen repeat across European mid-market contract packagers in seven years of operations work. The numerical input is deliberately ordinary. The numerical output — the schedule, the metrics, the bottleneck analysis — is what OptimEngine actually returned when I fed the input into the solver. No invented numbers.

The takeaway isn’t “automate scheduling.” The takeaway is closer to: your manual planner is doing the visible part of the job correctly, but the spreadsheet hides what the solver makes obvious — that four of your six lines are running half-empty most of the time.

The setup

Lombarda Confezionamenti is a contract packager for personal care brands: shampoos, body washes, creams, lotions, fragrances. €18M revenue, ~85 employees, two-shift operation (16 hours per day, five days per week). Six production lines, each specialized in different parts of the packaging flow:

L1 — Heavy filling line (200-500ml bottles)
L2 — Medium filling line (50-150ml jars and tubes)
L3 — Automatic cartoning line
L4 — Labelling line
L5 — Bundling and multipack line
L6 — QC station with batch release

Format changeovers on these lines aren’t trivial. On the heavy filling line, a switch between products requires roughly two hours of cleaning, machine adjustment, and validation. On the medium filling line, it’s around 90 minutes. On cartoning and labelling, an hour each. Bundling is the cheapest at 30 minutes. QC has no setup. These are real numbers from this kind of plant.

The week in question has eight orders from five different customers. Each order requires a different sequence of operations, depending on the product. Here’s the load:

Order	Customer	Product	Sequence	Volume
J1	Customer A	Face cream 50ml	filling → cartoning → labelling → QC	8,000 units (rush, 48h deadline)
J2	Customer B	Shampoo 250ml	filling → labelling → QC	12,000 units
J3	Customer C	Detergent 500ml	filling → labelling → bundling → QC	6,000 units
J4	Customer A	Body wash 200ml	filling → labelling → QC	10,000 units
J5	Customer D	Body lotion 150ml	filling → cartoning → QC	4,000 units
J6	Customer B	Conditioner 250ml	filling → labelling → bundling → QC	8,000 units
J7	Customer E	Perfume 100ml	cartoning → QC	3,000 units (no filling)
J8	Customer C	Liquid soap 300ml	filling → labelling → QC	7,000 units

Total processing time across all operations, ignoring setups: roughly 79 hours of machine work distributed across six lines. With perfect parallelization and zero setup, the absolute lower bound on makespan would be around 13 hours. Reality is much further from that.

What the experienced planner does

On Monday morning at 7am, the production manager opens the Excel file. He has done this for fifteen years. He reads the rush order from Customer A, marks it as priority one. Customer A is a strategic account — they also have order J4 — so he pencils in J4 second. Then he groups the remaining orders by customer to minimize “context switching” mentally: Customer B (J2 and J6 together), Customer C (J3 and J8 together), then D and E.

Within each block, he assigns tasks to lines using rules he doesn’t articulate but follows consistently:

Long fillings on L1 because that’s the heavy line
Small fillings on L2
One job at a time on the bottleneck line, mostly — this is the heuristic he trusts most
Setup planned at the start of each job, never overlapped with anything

He blocks out the week. The schedule he produces, when I trace it through the same constraints I gave the solver, terminates around 190 quarter-hours of makespan. That’s about 47.5 hours of plant time, which means he closes the week sometime late Wednesday or early Thursday — roughly three working days, given the two-shift schedule.

This is a defensible, professional schedule. The rush is delivered on time. No customer is forgotten. Setup costs are managed. He’s been doing this competently for fifteen years.

What OptimEngine does

I fed the same inputs — eight jobs with their task sequences, six lines with their setup times, the priority ranking, the rush deadline — into OptimEngine’s CP-SAT scheduler.

The solver returned a status of optimal in 10 milliseconds.

The makespan it found: 161 quarter-hours. That’s 40.25 hours, or roughly 2.5 working days at two shifts. About 15% better than the manual schedule.

Zero orders late. The rush J1 finishes at quarter-hour 57, which is 7 quarters before its 64-quarter deadline — comfortable margin without overcommitting capacity to the rush.

The solver’s gain over the manual baseline is not coming from any single brilliant move. It’s coming from many small parallelizations the human eye doesn’t easily see. At time zero, three things start in parallel: J6 begins filling on L1, J1 begins filling on L2, J7 begins cartoning on L3. The manual planner usually starts L1 first and then thinks about L2 once L1 is “moving.” The solver doesn’t think; it just maps the constraint graph and moves everything that can move.

Format changeovers on L1 are also handled aggressively. Five different products run on L1 across the week: J6 → J2 → J3 → J4 → J8. Each transition costs eight quarter-hours of setup. The solver sequences them in the order that doesn’t force any other line to wait. The manual planner often runs setups during the night shift “to keep the day shift productive,” which sounds smart but actually adds idle time elsewhere.

The interesting finding: average machine utilization is 31.3%

Here’s where the article would normally end with “and that’s why you should buy optimization software.” But the solver returns more than a schedule. It returns metrics. And one of them is uncomfortable:

Average machine utilization across the week: 31.3%.

Let me break that down by line:

Line	Utilization	Tasks	Notes
L1 (heavy filling)	69.6%	5	Bottleneck
L2 (medium filling)	17.4%	2	Underused
L3 (cartoning)	18.6%	3	Underused
L4 (labelling)	50.3%	6	Secondary bottleneck
L5 (bundling)	11.2%	2	Severely underused
L6 (QC)	20.5%	8	Underused

L1 is running roughly 70% of the available time. L4 is at half capacity. The other four lines are sitting idle most of the week. This is true under the optimal schedule — there’s no sequencing improvement that would change this picture. The reason these lines are underused is structural: the order mix this week happens not to need much medium filling, much cartoning, much bundling, or much QC throughput.

The manual planner can’t see this. He sees that the week “got done.” He sees that the rush was delivered on time. He doesn’t see that L2, L3, L5 sat idle for over 30 hours each. They were never on his dashboard because they weren’t constraining his completion date. The bottleneck has all the visibility; the slack has none.

For a plant manager, this is the most actionable insight in the entire schedule. It’s not “your planner could be better.” It’s “this week, you have roughly 100 hours of free capacity on four lines that nobody is selling.” Those four lines have an industrial cost — depreciation, energy in standby mode, maintenance contracts, operator availability — that runs whether they’re packaging product or not.

For a CFO, the question becomes: what additional orders, with what setup profile, would absorb the slack on L2, L3, L5, and L6 without overloading L1? That’s a commercial question, not a scheduling question. But the scheduling output is what makes the question even visible.

What this case shows, and what it doesn’t

I want to be careful with what this analysis proves and what it doesn’t.

It does prove that, on a realistic week of contract packaging, an OR-Tools solver finds a schedule about 15% shorter than what an experienced human planner would produce in 30 minutes of paper-and-spreadsheet work. That gain is real and consistent across most of the FJSP scheduling problems I’ve tested. It comes from parallelism that humans don’t naturally compute.

It also shows that the solver surfaces structural information — line utilization, bottleneck analysis, idle capacity — that doesn’t appear on the planner’s Monday-morning whiteboard. This information is more valuable than the 15% scheduling gain in most plants I’ve worked with, because it points at commercial decisions, not just operational ones.

What the analysis does not prove is that automating scheduling alone fixes anything. The 15% gain only matters if the plant can absorb it: if the order book grows, if the planner uses the time saved on something else, if the customer accepts faster delivery. Plenty of plants would just see the 15% as a softer week and do nothing differently. That’s not a software problem; that’s a management problem.

It also doesn’t prove that this particular plant should immediately invest in optimization software. The 15% scheduling gain at this volume is worth, very roughly, €40-60K per year of recovered capacity at typical mid-market industrial costs. That’s not life-changing, and it has to be weighed against the cost of integration, training, and the change management work of asking a fifteen-year-veteran planner to trust a black box.

Where I see optimization tools actually pay off in mid-market is two situations:

First, when the plant has a real growth ceiling that could be lifted. If management is looking at L1 utilization and thinking “we should buy a second heavy filling line,” but L2, L3, L5 are at 17%, the better question is whether sales mix can be rebalanced before capex. The solver makes that question quantitative.

Second, when the plant has visible service problems — rushes accepted at high cost, deadlines slipping under load, last-minute customer changes producing chaos. The same solver run with robust optimization extensions can quantify how brittle the current schedule is to disruption, and how much slack would buy how much resilience. That’s a different article.

A note on what’s underneath

The solver used in this analysis is OptimEngine, built on Google OR-Tools CP-SAT. The math is mature: constraint programming applied to flexible job-shop scheduling is a well-developed field with decades of research behind it. What’s new in 2026 is mostly the accessibility — the same kind of math that enterprise APS vendors charge mid-market companies €100-300K per year to access can now be packaged as a service that fits the operational and financial profile of an Italian PMI.

The endpoint that returned the schedule for this case study is the same one I exposed publicly through MCP and x402 payment infrastructure earlier this month, and that I’ve been writing about in the rest of this blog. If you’re a mid-market manufacturer or contract packager wondering whether your week looks like Lombarda Confezionamenti’s, the question I’d start with is the one this case ends on: not “is my planner producing the optimal schedule?” but “what does my plant’s average utilization actually look like, and how much of my weekly idleness is structural versus something I could sell into?”

That answer doesn’t come out of a spreadsheet. It comes out of a model.

Built with OptimEngine v9.0.0 on Google OR-Tools CP-SAT. The schedule and metrics presented here are the actual solver output for the input described, not retrospective approximations. The company name and customer specifics are synthetic; the operational pattern is drawn from years inside European mid-market contract manufacturing.

Three Production Scheduling Failures I’ve Seen, and the Math That Would Have Caught Them

2026-04-26T00:00:00+00:00

In seven years of operations controlling inside mid-market manufacturers, I watched the same scheduling failures repeat across plants, across product lines, across teams. Different people, different machines, identical patterns.

The interesting thing about chronic scheduling failures isn’t that they happen. It’s that everyone in the plant knows they happen, and almost nobody knows why. Management explains them away with operational folklore — “that machine is always troubled,” “Friday afternoons are bad,” “this client is impossible.” Engineers blame planners. Planners blame the ERP. The ERP blames the master data. Nothing changes.

What changed for me, slowly, was realizing that almost every chronic scheduling failure I’d seen had the same shape. There was a visible symptom that everyone agreed on. There was a hidden cause that nobody was tracking. There was a management reaction that addressed the symptom and not the cause. And there was, quietly waiting in textbooks nobody read, a piece of operations research math that would have solved it.

Three of those failures, anonymized but real, with the math that would have caught them.

Failure 1 — The OEE that wouldn’t move

Every contract manufacturer I worked with had at least one line where OEE was stubbornly stuck somewhere between 60% and 70%, far below the 85% management kept asking for. Production stoppages were frequent. Deadlines slipped every month, sometimes by hours, sometimes by days.

The dashboard showed the symptoms in red. The reaction was always the same: overtime shifts to recover, occasionally a maintenance contractor brought in to “fix the machine.”

Neither solved anything, because neither addressed what was actually happening.

The hidden cause was double. First, micro-stoppages under five minutes weren’t being tracked at all. The MES rounded everything below that threshold to zero. When we eventually instrumented the line and counted them properly, those untracked micro-stops were roughly half of the total downtime. The line wasn’t failing dramatically. It was failing constantly, in tiny invisible bursts, and the cumulative effect was catastrophic.

Second, the preventive maintenance plan existed only on paper. Maintenance was scheduled every 200 hours in the manual, but in practice intervention happened reactively, after a failure had already caused a production stop. The plan was theoretical; the operation was firefighting.

What the math would have done. This is exactly the situation stochastic optimization was built for. Instead of treating the line’s OEE as a deterministic input — “machine X runs at 75% efficiency” — you model micro-stops as a stochastic variable with the empirical distribution you observe. The Monte Carlo simulation generates thousands of possible production days, accounting for the real shape of the failure distribution, not its average.

The output isn’t a number. It’s a Conditional Value at Risk: given the way this line actually behaves, what’s the worst 5% of production days going to look like? That single metric — CVaR 95% — would have ended the conversation about averages. Management would have seen, in numbers, that the line was vulnerable to compounding micro-stops in a way no average could capture.

For maintenance, the same engine answers a different question: when should preventive intervention happen to maximize expected uptime, given the empirical distribution of failure intervals? Not “every 200 hours” because that’s what the manual says. The optimal interval, computed from data, often turns out to be very different.

The total infrastructure required to do this exists in OR-Tools and a Monte Carlo wrapper. It costs nothing in licenses. What it costs is recognizing that the problem isn’t the machine — it’s the model of the machine that everyone is using to make decisions.

Failure 2 — The format change that took twice as long

Theoretical changeover time: two hours. Actual changeover time: four hours. Every single time, on every line, for years.

This wasn’t a technical mystery. Everyone knew the SMED methodology existed. Some of the plants had even sent operators to training courses on it. The problem was that the management response to the gap between theoretical and actual time was always the same kind of response: structural and physical.

I saw companies dedicate specific lines to specific product families, sacrificing flexibility, to avoid the cross-format change problem entirely. I saw machine vendors brought back in to redesign accessibility on equipment that was perfectly accessible already. I saw line layouts modified, costing hundreds of thousands of euros, to shorten transport distances during changeovers.

What I never saw was anyone treating the changeover as what it actually is: a scheduling and parallelization problem disguised as a hardware problem.

The hidden cause was the framing itself. Management saw long changeovers as a sign that the machine was the bottleneck, when actually the bottleneck was the sequence in which changeover activities were performed and the fact that they were almost always serialized when they didn’t need to be. Sanitization and tool retrieval and quality validation and operator briefing — these can all happen in parallel, with the right planning. Almost none of them did, in practice.

The math that would have caught it. Changeover sequencing is a classical Constraint Programming problem. CP-SAT solves it in milliseconds for problems up to a few hundred activities. You define each sub-activity (sanitize, retrieve tools, validate quality, brief operator, swap die, calibrate, run-in), the resource each consumes (operator A, operator B, the machine itself, the QC technician), and the precedence constraints (you can’t run-in before calibration). The solver finds the minimum total time given these constraints, automatically discovering which activities can run in parallel and which must wait.

For a typical contract manufacturer running roughly five format changes per month per machine, with two hours of avoidable extra time per change and an industrial cost of line idleness in the €80–€120 per hour range, the financial impact of leaving this unsolved tends to land somewhere between €40K and €60K per year per plant. Not catastrophic. Not invisible either. The kind of number that, once quantified, makes the conversation about “should we invest in optimization” finally productive.

The other piece of math — sequence-dependent setup times — addresses what comes before. Two consecutive products with similar fragrances need a brief sanitization. Two products with very different formulas need a deep cleaning that takes triple the time. Treating all sanitizations as uniform in the schedule is one of the most expensive simplifications I’ve seen in consumer goods manufacturing. CP-SAT handles asymmetric setup times natively. Spreadsheets don’t.

Failure 3 — The customer who could move anything

The third pattern is the one that’s most specific to contract manufacturing, and I think the least discussed publicly. It has to do with the fact that contract manufacturers don’t really plan their own production. Their customers do.

In every plant I worked with, somewhere around 15% of orders were modified in the seven days before production. Quantities changed. Formulas were tweaked. Delivery dates moved up because of a marketing campaign nobody had mentioned earlier. Sometimes the customer would call on Thursday asking to bring forward a batch scheduled for the following Monday, and the answer was always yes, because the customer was big and the relationship mattered.

The hidden cause was triple, and worth unpacking carefully.

Large customers knew they could move deadlines and they did so systematically. There was an asymmetry of leverage that nobody had quantified, so nobody had pushed back on. Smaller customers anticipated batches for marketing reasons without understanding the impact on line saturation — they weren’t being malicious, they simply lacked the information that would have made them think twice. And internally, nobody calculated the real cost of a last-minute change. The commercial team accepted modifications because to them, it was free. The cost showed up later, in overtime hours and missed deadlines on other clients’ orders, and by then it was diffused enough that you couldn’t trace it back to the original “yes.”

The management response was almost always to schedule a planning-commercial meeting where rules of engagement were established for accepting last-minute changes. Those rules then proceeded to be ignored within a month. I don’t say this with cynicism — the rules were genuinely well-designed, in some cases. They just couldn’t survive contact with the reality that the commercial team’s incentive structure rewarded saying yes and the planning team had no quantitative argument for saying no.

The math is two pieces. First, robust optimization under demand uncertainty. Instead of producing a single deterministic schedule and reacting to changes one by one, you produce a schedule that’s robust against the typical pattern of last-minute modifications. The solver knows that 15% of orders will change, and it finds the schedule that minimizes worst-case impact across plausible modification scenarios. The schedule itself becomes more resilient — built-in slack on the right resources, batch ordering that survives reordering, buffer placement informed by which customers historically modify and which don’t.

Second, and this matters more than the first: automated cost-of-change quantification. When the customer calls on Thursday asking to bring a Monday batch forward, the planner shouldn’t have to estimate the impact in their head. The system should respond with a number: “This change costs us roughly €X in disrupted capacity and pushes three other deliveries by Y hours. Are we accepting this, and what’s the customer paying for it?” The cost stops being invisible. The conversation shifts.

This is exactly what sensitivity analysis on top of a scheduling solver does. Perturb one parameter (the deadline of order #347), re-solve, observe the delta in the objective function. The technology to do this in real time is mature. What’s missing is almost never the technology. It’s the recognition that the planning team needs to bring numbers to the meeting, not opinions.

What ties these three failures together

I want to be careful not to oversimplify, because each of these problems is real and complicated and deserves more than a paragraph of treatment. But there’s a pattern worth naming.

In all three cases, the planning team was operating with a model of reality that was missing something. Micro-stops were missing from the OEE model. Parallelizable activities were missing from the changeover model. Customer modification probability was missing from the demand model. Each missing piece led to decisions that looked reasonable on the dashboard and were quietly destructive in production.

Operations research math doesn’t fix these failures by being magical. It fixes them by forcing you to write down the model — explicitly, in code — and then revealing the parts of reality the model is ignoring. The discipline isn’t the optimization. It’s the modeling.

The tools to do this kind of modeling are mature, free, and well-documented. OR-Tools handles all three of the math problems I described. Python or any language with a CP-SAT binding is sufficient. The hard part isn’t the technology.

The hard part is having someone in the organization who has spent enough time on the production floor to know which parts of reality the spreadsheet is hiding, and enough time with the math to know which solver to point at the problem.

That intersection is rare. It’s also where most of the value is.

I build production optimization tooling and consult with European mid-market manufacturers who want to move past spreadsheet-based planning. If any of these patterns sound familiar, the contact links are in the site header.

Why your AI assistant can’t actually plan your factory

2026-04-25T00:00:00+00:00

Last week I sat down with a synthetic but realistic problem. MetallbauTech GmbH, a 45-person precision manufacturer in the Stuttgart area, has fifteen CNC machines and six automotive orders to deliver this week. Mix of brake calipers for BMW, transmission gears for Porsche (with a yield-rate constraint, because precision), Audi steering parts, a Tesla rush order due in 15 hours, MAN truck components, and a standard machining batch.

A Monday-morning question for the production manager: what’s the optimal weekly plan?

I wanted to test two things, both relevant to anyone running an SME factory in 2026:

Can a mathematical solver actually do better than spreadsheets and intuition? (Spoiler: yes, dramatically.)
Can a generic AI assistant — like ChatGPT or Claude — do the same job? (Spoiler: no, and not for the reasons you’d expect.)

This post walks through the actual tests, with real outputs and real timings. If you run a factory with 10–50 machines and you’re tired of Excel-Gantt week planning, the patterns here will be familiar. If you’re evaluating “AI for manufacturing” tools, you’ll see exactly where the line is between marketing and reality.

The problem, in plain numbers

Here’s MetallbauTech’s week:

15 machines: 8 CNC milling stations (two of them on night shift, available 80h instead of 45h), 4 CNC lathes, 2 precision grinders, 1 quality measurement station
6 orders: priorities 3 to 10, due dates ranging 15h to 42h, total 19 production tasks across all orders
Constraints: each task can only run on certain machines (you can’t do precision grinding on a lathe), Porsche gears require a machine with ≥98% yield rate, the QC station bottlenecks at the end
Objective: minimize total tardiness — the sum of how late each order is past its due date

A human production manager — and I’ve watched several do this — typically takes half a day to lay out a plan like this on a whiteboard or in a spreadsheet, and the result is rarely optimal. They prioritize by gut, parallelize where they remember they can, and accept that some orders slip.

This is exactly the kind of problem Flexible Job Shop Scheduling (FJSP) was designed to solve mathematically. Google’s OR-Tools CP-SAT solver, wrapped in a service layer I built called OptimEngine, eats this for breakfast.

Test 1: deterministic schedule

I wrote the scenario as JSON — 6 jobs with their tasks, 15 machines with their availability and yield rates, the optimization objective — and POSTed it to OptimEngine’s /schedule-robust endpoint. This is a composite endpoint that, given a scheduling problem, returns the optimal plan.

curl -X POST https://optim-core-gateway-production.up.railway.app/schedule-robust \
  -H "X-Core-Key: ***" \
  -H "Content-Type: application/json" \
  -d @metallbautech-scenario.json

The response came back in 1.28 seconds. Solver status: optimal (not heuristic — provably optimal under the constraints). Result:

Makespan: 18 hours. The entire week’s production fits in 18 hours of wall-clock time.
6 of 6 orders on time. Zero tardiness. Including the Tesla URGENT order (due in 15h), which the solver completed in 5h flat.
Plan structure: Tesla used CNC-M-06 for rough milling and CNC-M-01 for finishing — not the “obvious” CNC-M-01 first. The solver figured out that running Tesla and BMW in parallel on different machines was faster than serializing them. Porsche correctly went to GRIND-01 (yield 0.99, satisfying the ≥0.98 constraint).

The schedule wasn’t intuitive. A production manager working manually would likely have started Tesla on the “best” machine and serialized everything else around it. The solver found a parallel plan where Tesla finishes in 5h on the second-tier mill while the first-tier mill handles BMW concurrently.

This alone is the value proposition: a solver finds non-obvious optimal plans in seconds. But there’s a catch, and it’s the reason most schedule outputs don’t survive contact with reality.

Test 2: what happens when reality bites

The 18-hour plan above assumes all task durations are exact. They aren’t. In a real precision shop, the gear-cutting operation on Porsche transmissions might take 5 hours on a good day, 7 nominally, 10 if a tool needs replacement. The Tesla rough milling might run 1.5h to 4h depending on material variability and operator experience.

If you give your production manager an “18 hour optimal plan” and don’t tell them the underlying assumptions, they’ll commit to it. Then on Wednesday, when Porsche’s gear-cutting runs over, the cascading delays push Tesla off its 15h deadline. By Thursday afternoon, you’re calling the customer.

OptimEngine’s /schedule-robust endpoint accepts an optional stochastic_parameters array — a way to declare which parameters are uncertain and what distribution they follow. I added three:

Tesla rough-milling: triangular distribution, min 1.5h, mode 2.5h, max 4h
Porsche gear-cutting: triangular, min 5h, mode 7h, max 10h
Porsche grinding: triangular, min 3h, mode 5h, max 8h

Re-ran the call. Total time: 1.83 seconds (147ms scheduling + 793ms running 30 Monte Carlo scenarios). The response now contained three strategies:

Strategy A — Nominal Optimistic: same 18-hour plan as before. Risk level: high.

Strategy B — CVaR-Protected: makespan 28 hours under the 95% CVaR worst case. Expected value 23 hours. Feasibility rate 100%. The recommendation field reads:

“High variability: CV=15.5%. The makespan ranges from 18.0 to 29.0. Risk-aware planning strongly recommended. The gap between expected value (23.0) and CVaR 95% (28.0) is 21.7%. This is the ‘risk premium’ — what you pay for 95% protection.”

Strategy C — Balanced: middle ground. Use the nominal schedule but add buffer time on the parameters identified as uncertain.

This is the part that matters for a CEO making delivery commitments. The nominal plan said 18 hours. The probabilistic analysis says: the realistic expected outcome is 23 hours, and there’s a 5% chance you blow past 28 hours. If you committed to the customer based on Strategy A, you’re now exposed.

The cost of running this analysis is 0.79 seconds of computation. The cost of not running it is the phone call to BMW on Thursday.

And what about asking ChatGPT or Claude?

Here’s where it gets interesting. While testing OptimEngine, I asked Claude (in a separate conversation, no MCP integration) to plan the same MetallbauTech week using natural language. The query was straightforward:

“I work at MetallbauTech GmbH, a German manufacturer with 15 CNC machines. I have 6 automotive orders this week (BMW brake calipers, Porsche transmission gears, Audi steering, standard batch, Tesla motor mounts URGENT due in 15h, MAN truck parts). Each order has multiple machining stages. Can you build me an optimized weekly schedule that minimizes total tardiness?”

Claude responded thoughtfully. It flagged its assumptions explicitly (synthetic operation data, generic machine pool), built an FJSP model, then ran CP-SAT in its sandbox. It produced a schedule in 0.06 seconds. It even generated a Gantt chart.

Impressive. Until I asked the follow-up: “Did you call the OptimEngine MCP server I have configured, or did you run CP-SAT locally?”

The honest answer:

“I ran CP-SAT locally in my sandbox. I pip install’d ortools and called the Python solver directly — I never called your OptimEngine MCP server. […] My local model is a reasonable FJSP with weighted tardiness, but I have no guarantee it matches OptimEngine’s actual /optimize_schedule contract — field names, objective options, how alternatives are encoded, whether setup_times and availability_windows are passed the way I’d assume from memory. The solution is mathematically valid for the model I built; it is not a verified OptimEngine output.”

This is the line that matters. A generic AI assistant can improvise CP-SAT for a small problem and produce something plausible. But “plausible” is not “production-grade.”

Here’s what the AI’s local reconstruction was missing, and what your factory will care about:

Reproducibility. The AI’s solver runs in a sandbox that’s recreated each session. No two runs are identical environments. Production scheduling needs runs that are byte-identical given the same inputs, for audit and rollback.
Custom logic. OptimEngine’s v9 solver has nine years of hand-tuning for manufacturing scenarios — sequence-dependent setup times, multi-window availability, yield-rate filtering, four optimization objectives, four uncertainty modes. The AI improvised “a reasonable FJSP” — generic, but missing every detail that makes a real shop’s plans actually executable.
Performance at scale. AI sandboxes run for one conversation. They don’t run 24/7 with SLAs, can’t be called by automated agents at 1000 requests per hour, don’t return responses in <2 seconds under load. Your factory’s MES integration needs an actual API endpoint with uptime guarantees, not a chat tab.
Specialization beyond scheduling. OptimEngine exposes ten composite endpoints — not just /schedule-robust, but /risk-analysis, /full-intel, /pack-resources, /forecast-basic, etc. Each is pre-orchestrated for a specific class of decision. A generic AI would have to invent the orchestration each time, with no memory of last week’s decisions, no understanding of your specific shop’s bottlenecks.
Auditability. When the auditor asks why you made a specific scheduling choice in February, “because I asked Claude” doesn’t pass ISO 9001. “Because OptimEngine /schedule-robust v9.0.2 returned this strategy with these parameters and these inputs, logged with timestamp, signature, and CVaR analysis” does.

The AI assistant was honest about this. Most won’t be. The current generation of “AI for X” tools either avoids these questions or hides behind vague capability claims. Your AI assistant can sketch a Gantt chart. It cannot run your factory.

What this means if you’re an SME manufacturer

If you have 10–50 machines and you’re juggling multi-client orders weekly, three things follow from this:

One: a real solver is now within reach. You don’t need an SAP/Siemens enterprise contract anymore. OptimEngine is an HTTP API — anyone with a backend developer can integrate it in a day. Send your jobs and machines, get back an optimal plan with risk analysis. That’s the whole story.

Two: probabilistic planning is a competitive advantage. Every competitor still using deterministic Excel plans is silently exposed to the variance you’re now measuring. The 21.7% risk premium between nominal and CVaR-95 is the gap they don’t see. You will.

Three: AI assistants are not the answer for production decisions, but they’re a great front-end. A natural-language interface that converts a CTO’s question — “can we squeeze in one more order this week?” — into an OptimEngine call, runs the solver, and returns the answer in business terms is exactly the right architecture. The AI handles the conversation; the solver handles the math.

Try it

If your factory’s weekly planning sounds like the MetallbauTech scenario above, OptimEngine’s /schedule-robust endpoint is live. The full request schema is in the public OptimEngine documentation (open to inspection — no signup needed to read).

The 6-job, 15-machine, 19-task scenario from this article runs in under 2 seconds and returns three strategies. Your real shop floor — likely 30–80 jobs across 10–50 machines — runs in 5–30 seconds.

If you’re an SME manufacturer in Italy, Germany, or Europe more broadly, and you’d like to discuss a pilot integration — connecting OptimEngine to your existing MES, ERP, or planning workflow — reach out. I work with manufacturing operations specifically, with a controlling background in contract manufacturing before transitioning to engineering.

The math is solved. The integration is the part that matters.

OptimEngine is a mathematical optimization service built on Google OR-Tools CP-SAT (v9.0.0), exposing 11 solver capabilities including FJSP scheduling, CVRPTW routing, bin packing, Pareto multi-objective analysis, Monte Carlo risk simulation with CVaR metrics, parametric sensitivity analysis, and prescriptive intelligence. The engine is currently deployed for industrial use cases and accessible via REST and (forthcoming, OAuth-gated) MCP protocols.

Exposing a math solver as Circle Nanopayments: what I learned forking arc-nanopayments

2026-04-24T00:00:00+00:00

I spent the last three weeks wiring a constraint-programming solver into Circle’s Nanopayments stack on Arc testnet. The result is optim-arc-v3, a Next.js gateway that exposes ten optimization endpoints — scheduling, routing, Pareto frontiers, stochastic CVaR analysis — each behind a 402 Payment Required response that accepts gasless USDC micropayments.

It’s live at optim-arc-v3.vercel.app. You can hit any endpoint with curl right now and get back a valid x402 v2 payment challenge.

This post is about how I got there. Not the Circle marketing pitch — the actual friction, the choices I didn’t expect to face, and the patterns I’d reach for next time.

Why this, why now

In March 2026 Circle published a blog post on Nanopayments. I was already tracking them for stablecoin infrastructure reasons, but that post clicked something. Sub-cent, gasless USDC payments batched off-chain and settled on-chain periodically — this wasn’t another “blockchain primitive in search of a use case.” It was plumbing for a thing I’d been thinking about for months: markets of calculations and decisions.

Here’s what I mean. I run OptimEngine, a mathematical optimization service built on Google OR-Tools. It solves things like factory scheduling, logistics routing, resource packing — problems that LLMs cannot solve, because they require constraint satisfaction and provable optimality, not pattern matching. My hunch is that agentic systems in 2026-2027 will increasingly need to call into specialized solvers for decisions that matter. Not every problem is a text generation problem.

But for an AI agent to pay my solver for each call — autonomously, without human approval loops, at sub-cent granularity — the payment layer has to disappear. Traditional APIs with credit card billing and monthly invoices don’t work when the caller is a machine making ten thousand decisions per hour. That’s the gap Nanopayments targets.

So the plan crystallized: make OptimEngine speak x402 natively on Arc, Circle’s payment-native chain. An agent sees a 402, signs an EIP-3009 authorization off-chain, retries, gets the solver response. Zero gas. Settlement batched by Circle in the background. For the agent, it’s just an HTTP call with a payment header.

The decision: fork, don’t build from scratch

I already had a working gateway on Arc testnet — an Express app I’d hand-rolled months earlier, implementing a custom x402 flow against the same network. One transaction had even been processed through it. Extending that code was the obvious path.

I chose the opposite: fork Circle’s own arc-nanopayments sample and adapt it.

The reasoning, which I worked through in iterative discussion with an AI assistant (a workflow I’ve come to rely on for significant architectural choices): Circle’s sample is the reference implementation. It’s the code their DevRel team points to, built on their own SDKs, validated against their own infrastructure. Starting from something they promote means inheriting their assumptions about batching, signature verification, and settlement — assumptions I’d otherwise have to reverse-engineer.

The trade-off was real. Their sample is Next.js + Supabase + Tailwind, while my existing gateway was Express + ethers. Different stack. Different deployment story. Adopting it meant throwing away a working codebase.

I estimated the fork would be “90% compatible” with what I already had. That was optimistic. When I actually walked through the code, it was closer to 50-60%. The payment flow patterns matched at a high level, but every implementation detail — how the payment-required header is structured, how signatures are verified, how settlement is recorded — was different enough to require new code.

What saved the approach wasn’t the estimate being accurate. It was the ability to course-correct quickly. The principle that emerged, which I’ll keep reusing: what matters is moving quickly from misalignment to alignment. An imprecise initial estimate isn’t a failure mode if the process catches it early and adjusts.

The implementation: thin proxy, HOC pattern

The core insight, once I’d read Circle’s sample carefully, is that the x402/Nanopayments flow is encapsulated in a single Higher-Order Function: withGateway. You write a normal Next.js route handler that returns JSON, then wrap it:

import { NextRequest, NextResponse } from "next/server";
import { withGateway } from "@/lib/x402";

const CORE_GATEWAY_URL = process.env.CORE_GATEWAY_URL;
const CORE_GATEWAY_KEY = process.env.CORE_GATEWAY_KEY;

const handler = async (req: NextRequest) => {
  const body = await req.json();

  const coreRes = await fetch(`${CORE_GATEWAY_URL}/pack-resources`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-Core-Key": CORE_GATEWAY_KEY,
    },
    body: JSON.stringify(body),
  });

  const responseBody = await coreRes.text();
  return new NextResponse(responseBody, {
    status: coreRes.status,
    headers: { "Content-Type": "application/json" },
  });
};

export const POST = withGateway(handler, "$0.25", "/api/solve/pack-resources");

That’s it. The handler is a thin proxy to my existing orchestration layer (the OptimEngine Core Gateway, running separately). It doesn’t know anything about x402, Nanopayments, Arc, or USDC. Payment verification, settlement, and event logging to Supabase all happen inside withGateway.

When a client hits this route without a payment header, the response is HTTP 402 Payment Required with a base64-encoded payment-required header. Decoded, the JSON looks like this (shortened):

{
  "x402Version": 2,
  "accepts": [{
    "scheme": "exact",
    "network": "eip155:5042002",
    "asset": "0x3600000000000000000000000000000000000000",
    "amount": "250000",
    "payTo": "0x389D73e1cAC5e4D4a7BB3C6c4Cf35aB36bF00712",
    "extra": {
      "name": "GatewayWalletBatched",
      "verifyingContract": "0x0077777d7EBA4688BDeF3E311b846F25870A19B9"
    }
  }]
}

The network field identifies Arc testnet in CAIP-2 format. The amount is in USDC atomic units (6 decimals, so 250000 = $0.25). The extra.name: "GatewayWalletBatched" is the flag that tells a compatible client to use Circle’s batched settlement path via the Gateway Wallet contract at 0x0077777d7EBA... — this is what makes the payment gasless and nanopayment-scale.

A client that understands this responds by signing an EIP-3009 authorization off-chain, base64-encoding it into a payment-signature header, and retrying. The seller’s withGateway wrapper forwards the signature to Circle’s BatchFacilitatorClient.verify() and BatchFacilitatorClient.settle(), records the payment event in Supabase, then calls my handler and streams the response back.

I did this pattern once for /api/solve/pack-resources, then generated the remaining nine endpoints with a shell script that templates the same ~30-line proxy. Each endpoint has a different price tier and proxies to a different route on my Core Gateway, but the middleware logic is identical.

The fork, minimally

The upstream sample includes a dashboard UI, a LangChain-powered agent buyer, and four demo endpoints that return things like motivational quotes and toy JSON datasets. I kept none of it.

Removed:

agent.mts and the LangChain/DeepAgents/OpenAI dependencies. I’m building a seller, not a buyer. The sample’s buyer agent is instructive but it’s infrastructure I don’t need to own.
app/dashboard/ and the related React components. If I ever want a seller dashboard, I’ll build one tailored to OptimEngine metrics. The sample’s dashboard was for demonstrating Circle’s balance and withdrawal APIs, which aren’t the story I want to tell.
The four demo paywalled routes.

Kept:

lib/x402.ts — the core middleware. Circle’s implementation is well-structured, uses the official SDK, and handles edge cases I’d otherwise forget. Apache-2.0 licensed.
supabase/migrations/ — the schema for payment_events and withdrawals tables.
The app/api/gateway/balance and app/api/gateway/withdraw endpoints. I don’t use them today, but leaving them costs nothing and saves work later.

The result was a clean install of 189 packages (down from 327 in the upstream sample), zero vulnerabilities, TypeScript passing with no errors. First commit on April 23, production deploy on Vercel two days later.

Deploy: Vercel, and one thing I’d redo

Vercel was the right choice for me, even though my other infrastructure is on Railway. Next.js 16 with Turbopack is Vercel-native — zero configuration, push to main, live in ninety seconds. The free tier’s 10-second function timeout is a real limit on my heaviest endpoint (a composite analysis that can take seven seconds of solver time), but it’s a constraint I’ll manage rather than architect around for now.

The one thing I’d do differently: iterate on the Vercel setup more carefully. I accidentally created two parallel Vercel projects pointing to the same GitHub repo — one where I’d configured environment variables correctly, one left empty. Builds on the empty project kept failing with supabaseUrl is required errors that looked identical to missing configuration. It took a screenshot comparison to realize I was looking at two different projects in my dashboard. The lesson wasn’t technical. It was: when an error repeats after you think you’ve fixed it, check that you’re fixing the right instance.

Once sorted, the smoke test from my terminal was clean: all ten endpoints returning HTTP 402 with valid payload, response times between 370 and 730 milliseconds, first call warm-up aside.

Two things I’d want you to take from this

First, on the practical side: don’t be afraid to fork enterprise code. Circle’s sample isn’t sacred. It’s Apache-2.0 licensed, it’s designed to be adapted, and the maintainers at Circle would rather see external builders fork it into real applications than admire it untouched. Strip what you don’t need, keep what you do, credit the upstream in your README, and move on. The sample is a starting point, not a template to preserve.

Second, on the broader bet: x402 + Nanopayments is foundational infrastructure for agent economies. Not because it’s fashionable, but because the alternative — traditional payment rails, human-approved transactions, monthly billing cycles — doesn’t scale to the kind of economic activity that autonomous agents will generate. If an agent wants to call my solver ten thousand times to evaluate a decision space, the payment layer has to be invisible and nearly free. That’s Nanopayments. On Arc, with USDC, through Circle’s Gateway, it works.

The piece that convinced me this was worth three weeks of effort wasn’t the technology itself. It was the realization that markets of calculations and decisions — where agents pay specialized services for mathematical work at micropayment scale — become economically viable once the payment friction collapses. Solvers, predictors, validators, oracles: all of them become addressable by autonomous clients who can pay per call without accounting overhead. The solver is just the first primitive I happened to have ready.

Try it, or reach out

The endpoint is live. A single command:

curl -i -X POST https://optim-arc-v3.vercel.app/api/solve/pack-resources \
  -H "Content-Type: application/json" \
  -d '{}'

You’ll get back a 402 with a decodable payment-required header. Everything to implement a client is in Circle’s Nanopayments documentation.

If you’re building an agent that needs to make optimization decisions — scheduling, routing, portfolio construction, resource allocation — and you want to integrate OptimEngine as a solver in your pipeline, reach out. The solver is production-grade, the payment rail is now Circle-native, and I’m interested in hearing what kinds of problems you’re trying to make agent-solvable.

Built on top of circlefin/arc-nanopayments (Apache-2.0). Thanks to the Circle team for the sample and to Arc network for making sub-cent USDC payments practical on testnet. The OptimEngine Core Gateway is private infrastructure that orchestrates Google OR-Tools CP-SAT solvers behind this payment layer.

How I exposed OR-Tools as a production MCP server

2026-04-21T00:00:00+00:00

In April 2026, the tooling gap is starting to show.

AI agents can generate text, call APIs, navigate browsers, write and execute code. They can do a lot. What they can’t do reliably is decide how to schedule ten manufacturing jobs across three machines so that lateness is minimized. Ask a language model to do it and you’ll watch it reason for forty-five seconds, then return a plausible-looking sequence that’s noticeably worse than optimal. Or worse, confidently wrong.

OR-Tools CP-SAT, Google’s open-source constraint solver, returns the provably optimal schedule for the same problem in under fifty milliseconds. No hallucination. No “let me reason about this step by step.” Just a deterministic answer with a mathematical guarantee.

So I spent the last two months wrapping it as a production MCP server that autonomous agents can call. This is how it went, what I chose, what I’d do differently, and the bugs I hit — all documented in a git log that I later had to clean up.

The problem I came from

I’m not a “tech guy” by origin. I’ve spent seven years as an operations controller in mid-market manufacturing in northern Italy — the kind of work where you watch production managers rebuild weekly schedules by hand every Monday morning because the MES system doesn’t do real scheduling, it only tracks events after the fact.

The scheduling problems I saw were always the same shape. Five to fifteen machines. Ten to twenty active jobs with different customer deadlines and setup dependencies. A plant manager with an Excel sheet and twenty years of gut feeling. When something went wrong — a line didn’t start, a machine broke down — the replan was done by eye, because formally recalculating would take hours. The cost of a late order translates into penalties, lost customer goodwill, and sometimes lost customers.

Commercial alternatives existed but weren’t accessible. Enterprise MES and APS vendors start in the low six figures per year in license fees, often multiples more once you include integration and consulting. For a small-to-medium manufacturer in Emilia or Lombardy doing ten to thirty million in revenue, those numbers don’t pencil out.

Before OptimEngine, I built smaller things: an order forecast scheduler for my controller work (Gantt visualization, multi-job parallelism, economics tracking), then a routing demo. Neither went anywhere, but they taught me the pattern: the math that enterprise software sells for six figures runs on open-source solvers that anyone can use, if they know how.

Then MCP landed. Anthropic released the spec, and I started noticing that agents building on top of frontier LLMs could now discover tools via a standardized protocol, not hard-coded integrations. The question answered itself: what if I took a world-class solver and made it an MCP tool that any autonomous agent could call without knowing anything about operations research?

That’s what OptimEngine is now.

Why MCP, not just a REST API

The first architectural decision was whether to build this as a conventional REST service or as an MCP server. I ended up doing both. Here’s the reasoning.

REST APIs are universal. They work from anywhere. But they have one structural limitation in the agent era: the agent has to know you exist before it can call you. Every REST integration requires someone — a developer, a prompt engineer — to manually wire the API into the agent’s context. You are a service that needs to be introduced.

MCP flips that. An MCP server exposes a tool manifest: a machine-readable description of what it does, what inputs it takes, and what outputs it returns. Agents can discover MCP servers through registries like Smithery, read the manifest, and decide on the fly whether to invoke the tool. No human integration step.

For a solver service, this matters. If the target user is “every AI agent that might ever need to solve a scheduling problem,” the ratio of developers-who-will-write-custom-integrations to agents-that-will-find-me-via-MCP-and-call-me tips hard toward MCP over a three-year horizon. MCP is where the audience is going.

But REST still earns its place. My own use cases — a Next.js SaaS for Italian SMEs called PMI Scheduler, server-to-server workflows, Stripe-based billing — all speak REST natively. Forcing them through MCP would be silly.

The solution is dual-stack. Same FastAPI backend, same OR-Tools core. REST for traditional integrations, MCP for agent-native consumers. Routes split by transport, logic shared by the solver layer underneath.

Architecture

Here’s the shape of OptimEngine at a high level:

REST client             MCP client
(Next.js proxy,        (Claude Desktop,
 cURL, Python SDK)      Cursor, autonomous agent)
       |                    |
       v                    v
  FastAPI app          MCP transport
  (api/server.py)      (/mcp SSE +
                        /mcp/v2 Streamable HTTP)
       |                    |
       +---------+----------+
                 v
         Solver dispatch
         (solver/, routing/, packing/,
          pareto/, robust/, stochastic/)
                 |
                 v
         OR-Tools CP-SAT
                 |
                 v
            JSON response

The application is FastAPI. I went with it because it’s what I already knew from previous projects — Python async, auto-generated OpenAPI docs, dependency injection that fits naturally with middleware. Flask would have worked; Starlette alone might have been lighter. I didn’t benchmark the alternatives, and I don’t regret it.

Under the FastAPI layer sits a modular solver tree. Eight domains: scheduling, routing, packing, pareto frontier, prescriptive analytics, robust optimization, sensitivity analysis, stochastic scheduling. That breadth emerged incrementally. I started with scheduling, iterated with AI pair-programming, kept adding domains as I understood the space better.

In retrospect, this is both the strength and the weakness of the project. Scheduling and routing are the most mature and most used. Pareto and sensitivity are production-grade. The robust and stochastic modules exist but haven’t been stress-tested by real traffic. If I were starting again, I’d build depth in one domain first and expand only after the first one had paying users. Shipping eight domains in two months looks impressive but makes each one thinner than it could be.

The solver choice — CP-SAT specifically — was deliberate. OR-Tools offers multiple solvers (linear programming, mixed-integer, constraint programming), but CP-SAT is the sweet spot for the problems I care about: combinatorial scheduling with hard constraints, vehicle routing with capacity windows, bin packing with rules. It’s open source, it’s been battle-tested by Google internally, and it handles disjunctive constraints natively — something MIP solvers tend to struggle with. I didn’t run head-to-head benchmarks against Gurobi or CPLEX. I chose CP-SAT on the recommendation that it fits this problem class, and it has.

Dual MCP transport: SSE and Streamable HTTP

MCP supports multiple transports. The older one is Server-Sent Events (SSE): long-lived HTTP connections where the server streams tool invocations to the client. The newer one is Streamable HTTP, which moves toward a more conventional request/response model with first-class support for auth.

OptimEngine deploys both.

### Open, rate-limited, no auth — for demos and compat
mcp.mount_sse(mount_path="/mcp")

### Streamable HTTP + OAuth 2.1 — for production agents
if _SCALEKIT_CONFIGURED:
    mcp.mount_http(mount_path="/mcp/v2")

The /mcp endpoint is open, rate-limited at ten tool calls per hour per IP via a custom middleware. Good enough for Claude Desktop, Cursor, and anyone kicking the tires on the service. No authentication, no billing, no friction.

The /mcp/v2 endpoint requires a valid OAuth 2.1 bearer token, validated against a ScaleKit JWKS endpoint. This is the tier where production agents with real traffic live.

A gotcha I spent time on: the mcp Python library’s mount_sse() method defaults to mounting at /sse, not /mcp. The older mount() method used /mcp as its default. If you’re upgrading from an earlier version and don’t pass mount_path="/mcp" explicitly, every existing client breaks silently — the SSE handshake succeeds at the wrong path, and nothing in the logs makes this obvious. The fix is one argument, but the debugging time to find it is real. I learned this the hard way during post-deploy verification.

A second gotcha, related to auth: the /.well-known/oauth-protected-resource endpoint — the standard OAuth 2.1 discovery path that MCP clients call before authentication — was being blocked by my API key middleware because it didn’t match any of the public paths. Clients would get a 403 during discovery, fail to initialize OAuth, and give up. The fix is to add /.well-known to the middleware’s bypass list. Trivial in hindsight, but it cost a deploy cycle to figure out.

If you’re building your first MCP server: start with SSE, keep it open, add auth and the streamable transport only when you have a concrete reason. My mistake early on was adding OAuth before I had any user who needed it.

Authentication without building an OAuth server

Implementing OAuth 2.1 from scratch is something nobody sensible does anymore. The RFCs are dense, the attack surface is large, and the table stakes for correctness are high. The pragmatic choice is to delegate to an identity provider.

I used ScaleKit. The tier is free for development, it supports OAuth 2.1 natively, it handles Dynamic Client Registration — which MCP discovery platforms like Smithery require — and the dashboard is usable. Auth0 would have worked too, Clerk probably, plenty of others.

The integration is lean: four environment variables, a JWT validator using PyJWT directly, and a FastAPI dependency that checks the bearer token on every /mcp/v2 request.

from jwt import PyJWKClient, decode as jwt_decode

_JWKS_CLIENT = PyJWKClient(
    f"{_SCALEKIT_ENV_URL}/.well-known/jwks.json",
    cache_keys=True,
    lifespan=3600,
)

async def validate_mcp_bearer(request: Request) -> dict:
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        raise HTTPException(401, "Missing bearer token")
    token = auth[7:]
    signing_key = _JWKS_CLIENT.get_signing_key_from_jwt(token).key
    try:
        claims = jwt_decode(
            token,
            signing_key,
            algorithms=["RS256"],
            audience=_SCALEKIT_RESOURCE_ID,
        )
    except Exception as e:
        raise HTTPException(401, f"Invalid token: {e}")
    return claims

A nontrivial gotcha here: the official scalekit-sdk-python package depends on protobuf 5.x. OR-Tools, the entire reason this service exists, depends on protobuf 6.33+. The two cannot coexist in the same Python environment. I wasted a deploy figuring this out before switching to PyJWT directly against the JWKS endpoint. Public-key validation is the only thing I needed from the SDK anyway — no need for the full client library.

This is a recurring pattern in modern Python backends: SDKs from auth providers pull heavy dependency trees that conflict with whatever scientific or ML libraries you’re also using. When it happens, drop the SDK and call the provider’s raw HTTP endpoints. OAuth 2.1 with JWKS is simple enough to do in thirty lines of code.

The Smithery issuer match saga

The last piece — making OptimEngine discoverable on Smithery, the de-facto MCP directory — turned out to be the hardest. It’s worth describing because it illustrates how young this ecosystem still is.

Smithery tries to connect to an MCP server, complete the OAuth handshake, and introspect its tools. When it couldn’t scan OptimEngine, I went looking for the reason, and the reason was an RFC 8414 compliance issue between how ScaleKit advertises itself and how Smithery expects authorization servers to be described.

The short version: ScaleKit’s resource-scoped OAuth metadata correctly supports DCR (what Smithery needs), but its issuer claim points to the base environment URL rather than the resource-scoped URL. Smithery, strictly following RFC 8414, validates that issuer equals the authorization_servers URL declared in /.well-known/oauth-protected-resource. The mismatch fails the handshake.

I tried three fixes over a single Saturday evening, in three successive branches.

Use the base URL as authorization server. This matches the issuer, but the base URL doesn’t expose DCR — Smithery still fails, now with a different error: “does not support dynamic client registration.”
Use the resource-scoped URL. This has DCR but mismatched issuer. Back to square one.
Serve the metadata myself. Implement /.well-known/oauth-authorization-server as a proxy endpoint: fetch ScaleKit’s real metadata, override the issuer field to match my own BASE_URL, return the rewritten document. All the OAuth flows still terminate at ScaleKit (authorize, token, register, jwks) — I only rewrite the one field Smithery validates. Works. RFC-compliant. Both compliance and DCR satisfied.

Even after this, Smithery’s MCP scanner was still returning 401 during the handshake for unrelated reasons. Rather than keep debugging, I implemented a /.well-known/mcp/server-card.json endpoint — a static JSON document describing my tools — which Smithery docs explicitly support as a fallback for servers whose dynamic discovery fails. Paste the tool manifest, done, scanned.

The lesson, if there is one: in a young protocol ecosystem, working around integration quirks is sometimes faster than fixing them. Both OAuth spec compliance and Smithery’s scanner are reasonable in isolation, but they didn’t meet cleanly at the time. A metadata proxy plus a static fallback cost a few hours. Solving it “properly” at the ScaleKit layer would have required either them changing their issuer convention or me running my own OAuth provider — not an acceptable trade.

What I’d do differently

Writing this up forces a kind of retrospective honesty. A few things I’d change if I started again:

I’d build one domain deep before eight domains wide. Eight optimization verticals in two months is impressive on paper. In practice, scheduling and routing are the only ones with real traffic; the others are options I left on the table. If the first use case is SME manufacturing, then the first version should do scheduling with deep configuration (shift windows, setup times, quality yield) and nothing else. Breadth was a hedge that slowed everything down.

I’d wait on OAuth. I built the OAuth 2.1 integration before I had a single paying customer who needed it. The open /mcp tier, with rate limiting, would have covered all real usage for the first ninety days. Adding auth earlier meant debugging ScaleKit-Smithery integration when I should have been writing articles or talking to users.

I’d use one branch per problem, not per attempt. The Smithery saga left orphan branches in my repo — each representing one attempted fix. Iterating on a single branch with git commit --amend or git rebase -i produces the same outcome with a cleaner history. The reason I didn’t is psychological: new branches felt like “resets” during stressful debugging. Useful insight, next time.

I’d start writing the same day I started the product. The technical work is only half of what gets a developer tool adopted. Without an article trail describing decisions and lessons, nobody finds the work. I’m writing this article two months late. If you’re shipping something similar, start writing about it from commit one.

Try it yourself

OptimEngine is live. For Claude Desktop or any MCP client, add this to your config:

{
  "mcpServers": {
    "optimengine": {
      "transport": "sse",
      "url": "https://optim-engine-production.up.railway.app/mcp"
    }
  }
}

Restart your client and the solver tools should appear in the tool list, ready to be called for scheduling, routing, packing and the other optimization domains. Free tier: ten calls per hour per IP, no authentication.

REST endpoints exist for server-to-server integration, but they’re behind an API key — this article focuses on the MCP side, which is the open surface. Production tier with OAuth and higher limits is billed via x402 on Base — I’ll cover the monetization layer in a separate article.

Source on GitHub at github.com/MicheleCampi/optim-engine. If you’re working on something similar — wrapping a scientific or domain-specific tool as an agent-accessible service — I’d genuinely like to hear about it. Reach out on X at @MicheleC54474.