AI use rises as costs feel burdensome, token engineering is coming

From left, Coinbase CEO Brian Armstrong and Microsoft’s Nicolas Bustamante. {Photo: edited with Gemini]

As AI use rises, the costs that come with it feel burdensome, and companies are showing greater interest in “tips” for using AI as effectively as possible.

Some companies have recently been sharing their know-how, drawing attention.

Coinbase, the largest cryptocurrency exchange in the United States, disclosed how it cut AI costs by nearly half while continuing to increase token usage. In a post shared by Coinbase CEO Brian Armstrong (브라이언 암스트롱) on social media platform X (Twitter), the key was not usage limits but better default model settings, routing and caching.

Coinbase is running experiments to set open-weight models such as GLM 5.2 and Kimi 2.7 as defaults in its LLM gateway, rather than lowering the usage cap.

The company says lowering the cap is ineffective because 91 percent of employees have never hit the usage limit. Engineers can still freely choose the model they want even after the default model is changed.

Improving routing is also an important element.

According to Armstrong, Coinbase built a system that first analyzes prompts, checks cache status and model pricing, and automatically allocates requests to the most suitable model. Armstrong said, “High-performance models are needed for tasks like establishing complex plans, but cheaper models are sufficient for simple execution. Ultimately, the goal is to automate the task of choosing a model itself with AI.”

Coinbase is also actively using a method that makes every request cache-aware so it can reuse existing caches as much as possible. A cache is a temporary repository that stores previously processed prompts and answers and pulls up stored answers immediately for identical or similar requests without recalculating.

After properly applying caching to LibreChat, an open-source chat AI interface, Coinbase was able to handle 60 percent of all requests with stored answers. Previously, it was only 5 percent.

For context minimisation, the focus was on starting a new session when switching tasks, narrowing the scope of file context, and disconnecting tools that are not used.

Armstrong said, “The goal is not the number of tokens but reducing wasted tokens,” adding, “The key is not to curb usage but to build infrastructure that makes exponential growth sustainable.”

Microsoft’s Nicolas Bustamante (니콜라스 부스타만테) also agreed with Armstrong’s post and emphasised background agents as a future keyword for AI cost optimisation.

He said the current situation marks the start of an era of “token engineering”. If the first generation was “let’s use more AI,” the next stage is “use the right tokens, with the right model, at the right time, with the right cache.”

The next optimisation tool he is watching is background agents. Tasks such as code review, evaluation, refactoring, data extraction, document updates, security scans, inbox organisation, CRM enhancement, test generation and migration planning do not necessarily need to be handled immediately. They can be done 30 minutes, 2 hours, or even a day later.

He predicted, “The fixed token pricing system will change. GPU capacity alternates between peaks and slack depending on the time of day. As interactive use concentrates during working hours, background tasks can be run more cheaply when there is spare capacity. There will be a shift from fixed token pricing to ‘delay-tolerant token pricing’.” He said that would mean a structure in which immediate needs use real-time pricing, waiting an hour costs less, and waiting 24 hours costs much less.

He said, “The future AI stack will evolve toward simultaneously optimising model quality, cache status, delay tolerance, GPU capacity and business value, and agents will decide not only which model to use but also when to run tasks.”

Chi-gyu Hwang delight@d-today.co.kr

Keyword