Friday, May 15, 2026

5 compact language models for agentic tool calling

Share

# Entry

Agentic AI systems depend on the model’s ability to reliably invoke tools, select the right function, format arguments correctly, and integrate results into multi-step workflows. Enormous boundary models such as ChatGPT, Claude, and Gemini do this well, but have trade-offs in cost, latency, and hardware requirements that make them impractical in many real-world implementations. Petite language models have done a good job of filling this gap, and several compact, lightweight options now offer first-class support for invoking tools without having to rely on a data center to host them.

Now, in no particular order, here are 5 little language models for calling agent tools. Please note that for convenience and consistency, all model links point to models hosted on Hugging Face.

# 1. SmolLM3-3B

Technical aspect Details
Parameters 3B
Architecture Transformer intended exclusively for the decoder (GQA + NoPE, ratio 3:1)
Context length native 64k; up to 128 thousand Yarn extrapolation
Training tokens 11.2T
Multilingual support 6 languages ​​(EN, FR, ES, DE, IT, PT)
Reasoning mode Dual mode (think/no-think switch)
Tool call Yes: JSON/XML (xml_tools) and Python (python_tools)
License Apache 2.0

SmolLM3 is a 3B parameter language model designed to push the boundaries of compact models, supporting dual-mode reasoning, 6 languages, and long context. It is a decoder-only transformer using Grouped Query Attention (GQA) and No Positional Embedding (NoPE) (with a 3:1 ratio), pre-trained on 11.2T tokens with a phased learning program covering network, code, math and reasoning. Post-training included a mid-phase on 140 billion reasoning tokens, followed by supervised tuning and customization using Anchored Preference Optimization (APO), in violation of the HuggingFace policy approach to adjusting preferences. The model supports two different tool calling interfaces, via JSON/XML blobs xml_tools and Python-style function calls via python_toolsmaking it very adaptable for agent pipelines and RAG systems. As a fully open version, including weights, datasets, and training code, SmolLM3 is ideal for chatbots, RAG systems, and coding assistants on restricted hardware such as edge devices or low-VRAM machines.

# 2. Qwen3-4B-Instruct-2507

Technical aspect Details
Parameters 4.0B (3.6B without embedding)
Architecture Causal LM, 36 layers, GQA (32 Q heads / 8 KV heads)
Context length 262,144 tokens (native)
Reasoning mode Just not thinking (no blocks)
Multilingual Over 100 languages
Tool call Yes: natively, via Qwen-Agent/MCP
License Apache 2.0

Qwen3-4B-Instrukt-2507 is an updated version of the Qwen3-4B No-Think Mode, with significant improvements in overall capabilities including: following instructions, logical reasoning, text comprehension, math, science, coding, and tool exploit. It also provides significant progress in long-term knowledge in multiple languages. Both the Instruct and Thinking variants have a total of 4 billion parameters (3.6B without embedding) built in 36 transformer layers, using GQA with 32 query heads and 8 key/value heads, enabling effective memory management in very long contexts. This specific, non-thinking variant is optimized for direct and rapid response applications, such as delivering concise responses without obvious traces of chain of thought, making it well suited for chatbots, customer service, and tool calling agents where low latency is critical. Qwen3 excels in tool calling capabilities, and Alibaba recommends using Qwen-Agent frameworkwhich includes tool calling templates and parsers internally, reducing coding complexity, with support for MCP server configuration files.

# 3. Phi-3-mini-4k manual

Technical aspect Details
Parameters 3.8B
Architecture Transformer intended only for the decoder
Context length Tokens 4 thousand
Vocabulary size 32,064 tokens
Training data Synthetic + filtered public internet data
After training Securities financing + data protection officer
Tool call Yes: by chat template (requiring HF transformers ≥ 4.41.2)
License WITH

Phi-3-Mini-4K manual is a lightweight, state-of-the-art 3.8B open model trained on Phi-3 datasets that include both synthetic data and filtered publicly available web data, with a focus on high-quality and reasoning-dense properties. The model went through a post-training process that included both supervised tuning (SFT) and Direct preference optimization (DPO) to comply with instructions and safety. Microsoft’s flagship “small but smart” model, Phi-3-mini, stood out at launch for its ability to run on devices including smartphones, while competing with GPT-3.5 in performance tests. The model is primarily intended for environments with restricted memory and processing power, latency-related scenarios, and tasks requiring powerful reasoning, especially mathematics and logic. Although it is older than the other models on this list and restricted to a 4K context window, the MIT license makes it one of the most liberally licensed options available, and its powerful overall rationale has made it a popular base for tuning in commercial applications.

# 4. Gemma-4-E2B-it

Technical aspect Details
Effective parameters 2.3B (total 5.1B with deposition)
Architecture Dense hybrid attention (sliding window + global) + PLE
Layers 35
Sliding window 512 tokens
Context length 128 thousand tokens
Vocabulary size 262 thousand
Modality Text, image, audio (≤30 sec), video (as frames)
Multilingual Over 35 native speakers trained in over 140 languages
Tool call Yes: native function calling
License Apache 2.0

Gemma-4-E2B is part of Google DeepMind’s Gemma 4 family, which features a hybrid attention mechanism, local sliding window attention with full global attention. This design provides the processing speed and low memory footprint of a lightweight model, without sacrificing the deep awareness required for elaborate, long-context tasks. The “E” in E2B stands for “efficient” performance, made possible by a key architectural innovation called Deposition per layer (PLE), which adds a dedicated conditioning vector in each decoder layer. This is a mechanism that allows E2B to work with less than 1.5 GB of memory under quantization and still produce valuable results. The model supports native function calling, enabling agentic workflows, and is optimized for deployment on mobile and IoT devices supporting text, image, audio, and video input. Released in Apache 2.0 (a change from the more restrictive custom license of earlier generations of Gemma), Gemma 4 E2B is an attractive option for developers building multimodal agent applications that run completely at the edge.

# 5. Mistral-7B-Instruct-v0.3

Technical aspect Details
Parameters 7.25B
Architecture Transformer, CHECK + SWA
Context length 32,768 tokens
Vocabulary size 32,768 tokens (expanded from version 0.2)
Tokenizer Mistral v3 tokenizer
Function call Yes: by TOOL_CALLS / AVAILABLE_TOOLS / TOOL_RESULTS chips (see here)
License Apache 2.0

Mistral-7B-Instruct-v0.3 is an improved version of Mistral-7B-v0.3 that introduced three key changes over version 0.2: expanded vocabulary to 32,768 tokens, tokenizer v3 support, and function calling support. The model uses group query analysis for faster inference and Beware of the sliding window (SWA) to efficiently handle long sequences, and support for function calls is made possible by an extended vocabulary, including dedicated tokens Down TOOL_CALLS, AVAILABLE_TOOLSAND TOOL_RESULTS. As the largest model in this lineup at 7B parameters, Mistral-7B-Instruct-v0.3 offers the best overall instruction execution performance in the group and has become an industry standard workhorse, widely available through Ollama, vLLM and most inference platforms.

# Summary

The five models discussed here – SmolLM3-3B, Qwen3-4B-Instruct-2507, Phi-3-mini-4k-instruct, Gemma-4-E2B-it, and Mistral-7B-Instruct-v0.3 – cover a range of architectures, parameter counters, context windows, and release dates, but they share one critical feature: they all support invocation of structured tools in a compact, open package.

From Hugging Face’s fully clear SmolLM3 to Google DeepMind’s multimodal, edge-optimized Gemma 4 E2B, the selection shows that powerful agent-based models no longer require massive infrastructure and boundary models to deploy. Whether your priority is on-device inference, long context support, multilingualism, or the most permissive license possible, there’s a model on this list that’s worth exploring.

Please note that these are not the only compact language models with the ability to call tools. However, they do a good job of representing those I have direct experience with and feel comfortable including based on my results.

Matthew Mayo (@mattmayo13) has a master’s degree in computer science and a university diploma in data mining. As editor-in-chief of KDnuggets & Statologyand contributing editor at Machine learning masteryMatthew’s goal is to make elaborate data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and emerging artificial intelligence discovery. It is driven by the mission of democratizing knowledge in the data science environment. Matthew has been coding since he was 6 years ancient.

Latest Posts

More News