Cactus Compute has unveiled Needle, a 26-million-parameter tool-calling AI model that can run locally on small devices such as smartphones. On May 14, online outlet Gigazine reported that Needle was developed by distilling the tool-calling function of Google's AI model Gemini-3.1-Flash-Lite.
Needle focuses on running directly on devices for general users. Its prefill processing speed is 6,000 tokens per second, and its decoding processing speed is 1,200 tokens per second.
Pre-training was conducted for 27 hours using 16 TPU v6e units. Post-training was completed in 45 minutes using a tool-calling dataset generated by Gemini.
Developer Henry Ndubuaku (헨리 은두부아쿠) said there had been almost no attempts to build AI agents that work even on low-cost smartphones. He also explained that since AI agents are built on tool calling, he judged large models to be excessive. As a result, Needle was designed as a lightweight model specialized in tool calling and has a light structure that can run on edge devices such as smartphones.
Cactus Compute is distributing Needle through GitHub and Hugging Face. The licence is the MIT licence. The company is also developing Cactus Chat, an AI execution app for smartphones.
Needle is disclosed as having been developed by distilling Gemini-3.1-Flash-Lite. Google prohibits extraction and distillation from Gemini.