Function calling — how LLMs invoke tools

Definition

Function calling (also called "tool use" in Anthropic terminology) is an LLM capability where the model emits a structured tool invocation — function name + JSON-shape arguments — instead of free-text. The runtime executes the function and feeds the result back; the model continues with the result in context.

Concretely: you give the model a JSON schema of available tools. The model decides when to call which tool. It emits a message with tool_calls field. You execute the function. You append the result as a tool message. You ask the model to continue. The model produces the final user-facing response (or calls another tool).

Why it matters

Function calling closed the loop between LLM and the deterministic world. Before it, you had to parse free-text responses to extract structured intent (fragile). With it, the model emits structured output that's directly machine-readable.

Every modern agent loop, RAG retriever, code-running assistant, and tool-augmented chatbot is built on function calling. It's the primitive.

A short history

June 2023 — OpenAI introduces function calling in the Chat Completions API. tool_choice + tools parameters. Initial release with GPT-3.5 + GPT-4.
2023-09 — OpenAI parallel tool calling (model can request multiple tool calls in a single turn).
2023-11 — Anthropic adds tool use to Claude 2.1 (later refined for Claude 3 family).
2024 — Google Gemini, Meta Llama 3.1+, Mistral, Cohere, and most open-weight frontier models add function calling.
2024-11 — Anthropic releases Model Context Protocol (MCP) — open standard for tool exposure across vendors. Adopted by OpenAI and most agent frameworks within ~6 months.

The JSON schema

Function calling specifies tools via JSON Schema (OpenAI) or a similar structure. Example:

{
  "type": "function",
  "function": {
    "name": "verify_claim",
    "description": "Verify a natural-language factual claim about AI/ML.",
    "parameters": {
      "type": "object",
      "properties": {
        "claim": {
          "type": "string",
          "description": "Natural-language claim to verify"
        },
        "min_confidence": {
          "type": "number",
          "description": "Minimum confidence threshold (0.0-1.0)",
          "default": 0.85
        }
      },
      "required": ["claim"]
    }
  }
}

The model sees this schema, decides when to call the function, fills in the arguments per the schema, and emits a structured tool_calls block.

The agent loop

Send messages + tool schemas to the model
If model emits tool_calls: execute each tool, append results as tool messages, loop back to step 1
If model emits a final text message: done

Parallel tool calls let the runtime execute multiple tools concurrently before looping. Streaming tool calls let the model emit tool requests mid-stream (without waiting for full response).

Vendor flavors

Vendor	Field name	Multi-call	Streaming
OpenAI	`tool_calls`	Yes (parallel)	Yes
Anthropic	`tool_use` blocks	Yes	Yes
Google Gemini	`functionCall`	Yes	Yes
Mistral	`tool_calls` (OpenAI-compatible)	Yes	Yes
Llama 3.1+	via tool-token format (Llama-Stack)	Yes	Yes

Cross-vendor portability — MCP

Each vendor's tool-call format is slightly different, which has historically forced per-vendor adapter code. Anthropic's Model Context Protocol (November 2024) is the cross-vendor standard:

Tool servers expose tools via MCP
Clients (any LLM vendor) connect to MCP servers
One tool definition works across OpenAI, Anthropic, Gemini, Mistral, etc.

As of 2025, MCP adoption is high enough that new agent frameworks default to it. The vendor-specific tool formats remain but are increasingly thin wrappers over MCP-compatible tool definitions.

Common production patterns

Retrieval-then-cite

The model has a search_knowledge_base tool. When the user asks a question, the model calls the search tool, gets back relevant chunks, then composes an answer citing them. Compare with RAG vs VERITAS.

Verify-before-asserting

The model has a verify_claim tool. When the model is about to assert a factual claim, it calls verify first; only asserts confirmed claims. This is the AI agent grounding use case.

Code execution

The model has a run_python tool. For math / calculation / data-manipulation queries, the model writes code and runs it instead of computing in-head.

Multi-tool composition

The model has 10+ tools (search, calendar, email, database, calculator, etc.). It chains tool calls to accomplish complex tasks. The OpenAI Assistants API + Anthropic computer use target this shape.

Anti-patterns

Tool descriptions too vague. The model decides when to call a tool based on the description. Vague description = mis-called or under-called tool. Write descriptions like API docs.
Too many tools. 5-10 tools per agent is generally fine. 50+ tools degrades model performance — consider sub-agent decomposition or hierarchical tool routing.
Trusting model arguments unconditionally. Models hallucinate JSON. Validate arguments before executing — Pydantic, JSON Schema validators, Instructor, Pydantic AI.
No tool-call observability. Log every tool call + arguments + result. You'll need it when debugging why the agent did something weird.

LLM grounding — function calling enables verify-before-asserting
Agent frameworks topic hub
AI agent grounding use case
OpenAI tool-calls integration
Anthropic SDK tool-use integration
Pydantic AI — type-safe variant