ChatGPT, a chatbot developed by OpenAI. [Photo: Shutterstock]

OpenAI engineers have reportedly shared internally an optimisation technique that can cut artificial intelligence inference costs to less than half.

Foreign media outlets including online publication Gigazine and The Information reported on Tuesday that the technique was mentioned inside OpenAI in early June and is already applied to some ChatGPT processing for guest users.

Inference costs are operating expenses incurred each time an AI model generates a response to a user’s input. Because the costs recur at the service stage for chatbots, coding and API calls, they are seen as a key factor affecting profitability for companies running large-scale AI services.

OpenAI engineers reportedly told colleagues they found a way to keep inference costs to less than half with the new optimisation technique. The specific method was not disclosed. Applying it to ChatGPT guest users allowed the number of Nvidia GPUs to be reduced to about 200, the reports said.

The development is drawing attention because inference costs occur more continuously during service operations than training costs. Training cutting-edge AI models involves a one-time large investment, but inference repeatedly adds costs at each stage of conversational responses, API requests and agent tasks. If software optimisation alone can sharply cut GPU usage for free access, it could deliver operating cost savings that are difficult to achieve through hardware contract adjustments alone.

OpenAI’s cost burden has already been discussed steadily in the market. Industry analyst Edward Zitron (에드워드 지트론) estimated that OpenAI likely spent more than $5 billion on inference costs in the first half of 2025. The amount was pointed out as far exceeding projected revenue at the time.

The industry is also focused on what kind of optimisation was applied. Improved server utilisation efficiency has been cited as a driver of cost cuts, with possible candidates including more efficient batch processing, better cache reusability, quantisation and routing simpler queries to cheaper models. These estimates are based on external observation, and it has not been confirmed what combination of technologies was actually used.

The scope of application also appears limited so far. The only confirmed target is some processing for ChatGPT guest users. It is unclear whether the same approach is applied to free or paid account users. The next point of interest is whether OpenAI can expand the technique to the full service or its API product lineup.

If broader application is possible, OpenAI would have more options. The industry has speculated that OpenAI could cut prices or handle more agent tasks without buying additional chips. With competition intensifying to secure additional data centres and AI chips, improving the efficiency of existing servers to defend margins is seen as meaningful from a cost perspective.

The key point is that OpenAI has shown the possibility of lowering service operating expenses without significantly increasing new hardware. If the actual technique and the scope of application are further confirmed, changes could follow in ChatGPT pricing policy, free usage levels and AI infrastructure investment strategy.

Keyword

#OpenAI #ChatGPT #Nvidia #Edward Zitron #GPU
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.