5 compact language models for agentic tool calling

# Entry

Agentic AI systems depend on the model’s ability to reliably invoke tools, select the right function, format arguments correctly, and integrate results into multi-step workflows. Enormous boundary models such as ChatGPT, Claude, and Gemini do this well, but have trade-offs in cost, latency, and hardware requirements that make them impractical in many real-world implementations. Petite language models have done a good job of filling this gap, and several compact, lightweight options now offer first-class support for invoking tools without having to rely on a data center to host them.

Now, in no particular order, here are 5 little language models for calling agent tools. Please note that for convenience and consistency, all model links point to models hosted on Hugging Face.

# 1. SmolLM3-3B

Technical aspect	Details
Parameters	3B
Architecture	Transformer intended exclusively for the decoder (GQA + NoPE, ratio 3:1)
Context length	native 64k; up to 128 thousand Yarn extrapolation
Training tokens	11.2T
Multilingual support	6 languages (EN, FR, ES, DE, IT, PT)
Reasoning mode	Dual mode (think/no-think switch)
Tool call	Yes: JSON/XML (`xml_tools`) and Python (`python_tools`)
License	Apache 2.0

SmolLM3 is a 3B parameter language model designed to push the boundaries of compact models, supporting dual-mode reasoning, 6 languages, and long context. It is a decoder-only transformer using Grouped Query Attention (GQA) and No Positional Embedding (NoPE) (with a 3:1 ratio), pre-trained on 11.2T tokens with a phased learning program covering network, code, math and reasoning. Post-training included a mid-phase on 140 billion reasoning tokens, followed by supervised tuning and customization using Anchored Preference Optimization (APO), in violation of the HuggingFace policy approach to adjusting preferences. The model supports two different tool calling interfaces, via JSON/XML blobs xml_tools and Python-style function calls via python_toolsmaking it very adaptable for agent pipelines and RAG systems. As a fully open version, including weights, datasets, and training code, SmolLM3 is ideal for chatbots, RAG systems, and coding assistants on restricted hardware such as edge devices or low-VRAM machines.

# 2. Qwen3-4B-Instruct-2507

Technical aspect	Details
Parameters	4.0B (3.6B without embedding)
Architecture	Causal LM, 36 layers, GQA (32 Q heads / 8 KV heads)
Context length	262,144 tokens (native)
Reasoning mode	Just not thinking (no blocks)
Multilingual	Over 100 languages
Tool call	Yes: natively, via Qwen-Agent/MCP
License	Apache 2.0

Qwen3-4B-Instrukt-2507 is an updated version of the Qwen3-4B No-Think Mode, with significant improvements in overall capabilities including: following instructions, logical reasoning, text comprehension, math, science, coding, and tool exploit. It also provides significant progress in long-term knowledge in multiple languages. Both the Instruct and Thinking variants have a total of 4 billion parameters (3.6B without embedding) built in 36 transformer layers, using GQA with 32 query heads and 8 key/value heads, enabling effective memory management in very long contexts. This specific, non-thinking variant is optimized for direct and rapid response applications, such as delivering concise responses without obvious traces of chain of thought, making it well suited for chatbots, customer service, and tool calling agents where low latency is critical. Qwen3 excels in tool calling capabilities, and Alibaba recommends using Qwen-Agent frameworkwhich includes tool calling templates and parsers internally, reducing coding complexity, with support for MCP server configuration files.

# 3. Phi-3-mini-4k manual

Technical aspect	Details
Parameters	3.8B
Architecture	Transformer intended only for the decoder
Context length	Tokens 4 thousand
Vocabulary size	32,064 tokens
Training data	Synthetic + filtered public internet data
After training	Securities financing + data protection officer
Tool call	Yes: by chat template (requiring HF transformers ≥ 4.41.2)
License	WITH

Phi-3-Mini-4K manual is a lightweight, state-of-the-art 3.8B open model trained on Phi-3 datasets that include both synthetic data and filtered publicly available web data, with a focus on high-quality and reasoning-dense properties. The model went through a post-training process that included both supervised tuning (SFT) and Direct preference optimization (DPO) to comply with instructions and safety. Microsoft’s flagship “small but smart” model, Phi-3-mini, stood out at launch for its ability to run on devices including smartphones, while competing with GPT-3.5 in performance tests. The model is primarily intended for environments with restricted memory and processing power, latency-related scenarios, and tasks requiring powerful reasoning, especially mathematics and logic. Although it is older than the other models on this list and restricted to a 4K context window, the MIT license makes it one of the most liberally licensed options available, and its powerful overall rationale has made it a popular base for tuning in commercial applications.

# 4. Gemma-4-E2B-it

Technical aspect	Details
Effective parameters	2.3B (total 5.1B with deposition)
Architecture	Dense hybrid attention (sliding window + global) + PLE
Layers	35
Sliding window	512 tokens
Context length	128 thousand tokens
Vocabulary size	262 thousand
Modality	Text, image, audio (≤30 sec), video (as frames)
Multilingual	Over 35 native speakers trained in over 140 languages
Tool call	Yes: native function calling
License	Apache 2.0

Gemma-4-E2B is part of Google DeepMind’s Gemma 4 family, which features a hybrid attention mechanism, local sliding window attention with full global attention. This design provides the processing speed and low memory footprint of a lightweight model, without sacrificing the deep awareness required for elaborate, long-context tasks. The “E” in E2B stands for “efficient” performance, made possible by a key architectural innovation called Deposition per layer (PLE), which adds a dedicated conditioning vector in each decoder layer. This is a mechanism that allows E2B to work with less than 1.5 GB of memory under quantization and still produce valuable results. The model supports native function calling, enabling agentic workflows, and is optimized for deployment on mobile and IoT devices supporting text, image, audio, and video input. Released in Apache 2.0 (a change from the more restrictive custom license of earlier generations of Gemma), Gemma 4 E2B is an attractive option for developers building multimodal agent applications that run completely at the edge.

# 5. Mistral-7B-Instruct-v0.3

Technical aspect	Details
Parameters	7.25B
Architecture	Transformer, CHECK + SWA
Context length	32,768 tokens
Vocabulary size	32,768 tokens (expanded from version 0.2)
Tokenizer	Mistral v3 tokenizer
Function call	Yes: by `TOOL_CALLS` / `AVAILABLE_TOOLS` / `TOOL_RESULTS` chips (see here)
License	Apache 2.0

Mistral-7B-Instruct-v0.3 is an improved version of Mistral-7B-v0.3 that introduced three key changes over version 0.2: expanded vocabulary to 32,768 tokens, tokenizer v3 support, and function calling support. The model uses group query analysis for faster inference and Beware of the sliding window (SWA) to efficiently handle long sequences, and support for function calls is made possible by an extended vocabulary, including dedicated tokens Down TOOL_CALLS, AVAILABLE_TOOLSAND TOOL_RESULTS. As the largest model in this lineup at 7B parameters, Mistral-7B-Instruct-v0.3 offers the best overall instruction execution performance in the group and has become an industry standard workhorse, widely available through Ollama, vLLM and most inference platforms.

# Summary

The five models discussed here – SmolLM3-3B, Qwen3-4B-Instruct-2507, Phi-3-mini-4k-instruct, Gemma-4-E2B-it, and Mistral-7B-Instruct-v0.3 – cover a range of architectures, parameter counters, context windows, and release dates, but they share one critical feature: they all support invocation of structured tools in a compact, open package.

From Hugging Face’s fully clear SmolLM3 to Google DeepMind’s multimodal, edge-optimized Gemma 4 E2B, the selection shows that powerful agent-based models no longer require massive infrastructure and boundary models to deploy. Whether your priority is on-device inference, long context support, multilingualism, or the most permissive license possible, there’s a model on this list that’s worth exploring.

Please note that these are not the only compact language models with the ability to call tools. However, they do a good job of representing those I have direct experience with and feel comfortable including based on my results.

Matthew Mayo (@mattmayo13) has a master’s degree in computer science and a university diploma in data mining. As editor-in-chief of KDnuggets & Statologyand contributing editor at Machine learning masteryMatthew’s goal is to make elaborate data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and emerging artificial intelligence discovery. It is driven by the mission of democratizing knowledge in the data science environment. Matthew has been coding since he was 6 years ancient.

Categories

5 compact language models for agentic tool calling

# Entry

# 1. SmolLM3-3B

# 2. Qwen3-4B-Instruct-2507

# 3. Phi-3-mini-4k manual

# 4. Gemma-4-E2B-it

# 5. Mistral-7B-Instruct-v0.3

# Summary

US uses artificial intelligence to chase insider transactions on Polymarket

Mira Murati wants her artificial intelligence to ‘keep people in the loop’

We already know how many people the CDC is monitoring for hantavirus

The real losers in the Musk v. Altman lawsuit

Two from MIT have been named 2026 Knight-Hennessy Scholars

More News

We already know how many people the CDC is monitoring for hantavirus

Inside the race to develop a test for the scarce Andean Hantavirus

How AI agents will change the work of data science in 2026

What needs to be done to make artificial intelligence sustainable

US uses artificial intelligence to chase insider transactions on Polymarket

Mira Murati wants her artificial intelligence to ‘keep people in the loop’

We already know how many people the CDC is monitoring for hantavirus