A simple OpenAI model upgrade that was not so simple
In the age of AI and LLM-generated content running services, the value of end-to-end tests cannot be overstated
So, I have been using gpt-4.1-mini to generate short product descriptions in a product workflow and it while it worked fine, I still got some slops, hallucinations and inaccuracies.
After some recent generated errors and hallucinations, It sounded like a good time to switch to newer models. At the time this sounded like an easy change as all I needed to do was to swap the versions and move on to other things.
The goal was simple, I wanted:
less hallucinations and errors
more accurate descriptions
reduced cost from reduced retries since this upgrade should give me less validation errors from the generated output.
So we went from gpt-4.1-mini to gpt-5-mini. I updated the package, updated the code. Small change. Then the end-to-end tests started failing.
First Error Encountered
max_tokens is deprecated in favor of max_completion_tokens.
At first glance, this looks fine. small issue.
max_tokens was added in the first place because at one point the models would generate way too much text for a product image that only needed one short description. So the limit was there for a reason.
I needed to keep it short and straight to the point and also save cost on output tokens. This also helps in cases where the model generates gibberish as it limits how much gibberish I have to pay for in output tokens cost.
So I changed it to max_completion_tokens, updated the code again and deployed.
Ran the tests.
New error.
Second Error Encountered
completion.choices[0]?.message?.content came back empty.
I know this because I process that as an error when the content length comes back empty. This is an important validation as the whole point of the feature is to return a usable description. If there is no content, there is nothing useful to validate on use on product page.
So I logged the completion and saw this:
{
"model": "gpt-5-mini-2025-08-07",
"choices": [
{
"message": {
"role": "assistant",
"content": ""
},
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 1204,
"completion_tokens": 350,
"completion_tokens_details": {
"reasoning_tokens": 350
}
}
}Here you will see:
content is empty
finish_reason is “length”
All the allocated max_completion_tokens were used for reasoning.
So, what is going on?
Going back to the commented docs in the openai package you see:
/**
* An upper bound for the number of tokens that can be generated for a completion,
* including visible output tokens and
* [reasoning tokens](https://platform.openai.com/docs/guides/reasoning).
*/visiting the guides on reasoning models it was clear that gpt-5-mini is a reasoning model, and this this new parameter max_completion_tokens has a different meaning than the previously used max_tokens.
max_completion_tokens is not just “how much text can the model return” anymore but it now includes the internal reasoning tokens the model uses before it gives the final answer. Basically, the parameter now behaves differently depending on the model you are using.
With a reasoning model, that max_completion_tokens limit is now shared between:
reasoning
actual visible text output
Why this is a problem
The challenge here is not that reasoning tokens exist, this is fine. The challenge is that the parameter becomes more ambiguous from an engineering point of view. At least in this particular use case.
Now it is not straightforward to predict how much of that token budget will be used for reasoning and how much will be used for the final text I actually need and that matters because I still want the same behavior I had before updating this llm model.
I do not want the model generating a full book when I asked for one short description
I want to cap costs at an output token in case the model decides to generate gibberish anyway
I want retries to happen only when necessary
A somewhat predictable behavior. I know, this is llm generated output and it is non-determinisitic, but this is one of many levers helping to handle this non-deterministic nature
Also, this is not some deep reasoning task. I just needed short, accurate, controlled text output. With this setup, you might spend tokens, hit the limit, and still get nothing visible back. So yes, this can increase cost in ways you may not be expecting.
As I write this though, I wonder if thi parameter could have been split into two parts such as max_output_tokens and max_reasoning_tokens. This might help users better budget tokens and in addition to reasoning.effort get more “predicatble” results based on their use-case.
Why end-to-end tests mattered here
This is one of those cases where end-to-end tests actually did their job. If one only looked at the code change, this would have seemed fine.
The package updated.
The model changed.
The request still went through.
But the actual output behavior changed and that is the part that matters when building with LLMs. The non-deterministic nature makes a lot difficult to test so it is important to actually set some tight validations on the generated output. This can include text length validation, text patterns check and matching, output accuracy, emojis generated when not present in input amongst others.
tldr;
Upgrading an LLM model can change more than output quality, it can also quietly change token budgeting in ways that increase cost and break assumptions even for a change such as llm model update.
This is why due diligence with end-to-end testing matters in LLM systems: it is often the only way to catch generated output inconsistency, validation issues, and cost regressions before they reach production.
