The Cost of Machine Learning

ML projects rarely fail because the model was wrong. They fail because the budget ran out before the model had good enough data to work with, or because nobody had accounted for the cost of labelling, or because the infrastructure bill arrived as a surprise three months in. The technical risk is real but it is usually the second problem. The first problem is that nobody broke down what each component actually costs before the project started.

Machine learning reduces to a simple expression: F(X) = Y. A model F takes input data X and produces output Y. Every cost in an ML project maps to one of these three components. Understanding which part is expensive, and why, is how you estimate a project before building it.

The three cost components

Machine learning reduces to a simple expression: F(X) = Y. A model F takes input data X and produces output Y. Every cost in an ML project maps to one of these three components, and every role on the team owns one of them.

X is the data. Acquiring it, cleaning it, storing it, and maintaining its quality over time. Data costs are the most underestimated in early project estimates, partly because raw data often already exists inside the organisation and looks free. It is not free. Two roles make it usable: the domain expert, who understands what the data means and whether it is trustworthy, and the data scientist, who understands what methods it supports and what it cannot do.

F is the model. Building it, training it, deploying it, and keeping it running. This is where most budget conversations start, usually because it is the most visible part of the work. The ML engineer implements and deploys the model to infrastructure. The developer integrates it into the application. Compute costs live here too – training runs, inference endpoints, monitoring infrastructure, and so on.

Y is the labels. The desired outputs that tell the model what correct looks like. Labelling is frequently treated as a minor line item and frequently turns out to be the largest single cost in a production ML project. The labeller does the tagging. The domain expert defines what correct means in the first place – without that definition, the labeller has nothing to work from and the model has nothing to learn.

A project that is missing any of these roles is missing a component of the system. The domain expert who understands the data and the output is not a peripheral stakeholder – they own two thirds of the equation. Engaging them early, and budgeting for their time, is the difference between a project that produces a working system and one that produces a model nobody can evaluate.

A worked example: CakeyBakey

To make the cost structure concrete, consider a simple pipeline. CakeyBakey is a hypothetical service that takes a tweet describing a cake, generates an image of it, rates it as delicious or not, and posts the result back. No human involvement. Three model calls per event: language processing, image generation, classification.

Each event generates £0.50 in revenue. The first build decision is whether to use hosted model APIs or run self-hosted models on rented compute.

Hosted APIs (OpenAI equivalents for language and image generation) cost roughly £0.015 per event. Profit per event: £0.485.

Self-hosted models (open-weight equivalents on AWS GPU instances) cost roughly £0.12 per event at current compute rates. Profit per event: £0.38.

The hosted option is nearly eight times cheaper per event at this scale. Barring regulatory constraints or data sensitivity requirements, it is the obvious starting point for a new system. The self-hosted option becomes competitive at volume, or where data cannot leave the organisation. (The private inference articles cover that case.)

The point of the calculation is not the specific numbers – pricing changes and usage patterns vary – but the method. Each component of F(X)=Y has an identifiable cost per event. Once you have that, you can model the project.

The cost of getting better

A working model is not a finished model. Performance degrades as the world it was trained on recedes. New patterns emerge in the data. Errors accumulate in the output. At some point, retraining becomes necessary.

Retraining costs reduce to two questions: how much will new labels cost, and how much improvement will they produce?

The label cost formula is straightforward:

label_cost = (n_labels / labels_per_day) x labeller_daily_cost

The expected improvement is estimable too. Adding new labels to an existing dataset produces a gain proportional to the ratio of new to total examples. A model currently at 70% accuracy, trained on 500 examples, retrained on 100 new examples:

gain = 100 / 600 = 0.167
improvement = (1 + 0.167) x 70% = 81.7%

Returning to CakeyBakey: 100 new labels at £500 per labeller day, 20 labels per day, costs £2,500. The accuracy improvement from 70% to roughly 82% reduces the break-even point from around 7,800 events to around 6,700. At 500 events per day, that is two days recovered.

Whether that investment makes sense depends on the business. The point is that it is a calculable decision, not a gut feeling. The same framework applies at any scale: what does a label cost, what does the improvement buy, and can the business afford the delay while the labelling happens.

What “good enough” means

Not every ML system needs to be maximally accurate. A classification model that is right 85% of the time may be entirely adequate if the fallback for the remaining 15% is a lightweight human review. A model that is right 95% of the time may be inadequate if the 5% failure rate occurs on high-value or high-risk cases.

Defining “good enough” before the project starts is the most valuable thing a business stakeholder can do. It sets the label budget, determines the evaluation criteria, and gives the technical team a target that is neither moving nor vague. Without it, projects tend to optimise indefinitely toward a standard nobody defined.

F(X)=Y is a framework for that conversation as much as a cost model. X tells you what data investment is required. Y tells you what correct looks like and what it costs to specify. F tells you what engineering work sits in between. The project is estimable once all three are defined.

(If you are assessing whether your organisation has the data foundation to start an ML project, the data maturity article covers the infrastructure side of that question.)


Questions about this? Get in touch.