SourceScore

Concept · 2026-05-17

Function calling — how LLMs invoke tools

Function calling is how modern LLMs invoke external tools (APIs, databases, code execution). OpenAI launched it June 2023; the pattern is now table-stakes across every major vendor.

Definition

Function calling (also called "tool use" in Anthropic terminology) is an LLM capability where the model emits a structured tool invocation — function name + JSON-shape arguments — instead of free-text. The runtime executes the function and feeds the result back; the model continues with the result in context.

Concretely: you give the model a JSON schema of available tools. The model decides when to call which tool. It emits a message with tool_calls field. You execute the function. You append the result as a tool message. You ask the model to continue. The model produces the final user-facing response (or calls another tool).

Why it matters

Function calling closed the loop between LLM and the deterministic world. Before it, you had to parse free-text responses to extract structured intent (fragile). With it, the model emits structured output that's directly machine-readable.

Every modern agent loop, RAG retriever, code-running assistant, and tool-augmented chatbot is built on function calling. It's the primitive.

A short history

  • June 2023 — OpenAI introduces function calling in the Chat Completions API. tool_choice + tools parameters. Initial release with GPT-3.5 + GPT-4.
  • 2023-09 — OpenAI parallel tool calling (model can request multiple tool calls in a single turn).
  • 2023-11 — Anthropic adds tool use to Claude 2.1 (later refined for Claude 3 family).
  • 2024 — Google Gemini, Meta Llama 3.1+, Mistral, Cohere, and most open-weight frontier models add function calling.
  • 2024-11 — Anthropic releases Model Context Protocol (MCP) — open standard for tool exposure across vendors. Adopted by OpenAI and most agent frameworks within ~6 months.

The JSON schema

Function calling specifies tools via JSON Schema (OpenAI) or a similar structure. Example:

{
  "type": "function",
  "function": {
    "name": "verify_claim",
    "description": "Verify a natural-language factual claim about AI/ML.",
    "parameters": {
      "type": "object",
      "properties": {
        "claim": {
          "type": "string",
          "description": "Natural-language claim to verify"
        },
        "min_confidence": {
          "type": "number",
          "description": "Minimum confidence threshold (0.0-1.0)",
          "default": 0.85
        }
      },
      "required": ["claim"]
    }
  }
}

The model sees this schema, decides when to call the function, fills in the arguments per the schema, and emits a structured tool_calls block.

The agent loop

  1. Send messages + tool schemas to the model
  2. If model emits tool_calls: execute each tool, append results as tool messages, loop back to step 1
  3. If model emits a final text message: done

Parallel tool calls let the runtime execute multiple tools concurrently before looping. Streaming tool calls let the model emit tool requests mid-stream (without waiting for full response).

Vendor flavors

VendorField nameMulti-callStreaming
OpenAItool_callsYes (parallel)Yes
Anthropictool_use blocksYesYes
Google GeminifunctionCallYesYes
Mistraltool_calls (OpenAI-compatible)YesYes
Llama 3.1+via tool-token format (Llama-Stack)YesYes

Cross-vendor portability — MCP

Each vendor's tool-call format is slightly different, which has historically forced per-vendor adapter code. Anthropic's Model Context Protocol (November 2024) is the cross-vendor standard:

  • Tool servers expose tools via MCP
  • Clients (any LLM vendor) connect to MCP servers
  • One tool definition works across OpenAI, Anthropic, Gemini, Mistral, etc.

As of 2025, MCP adoption is high enough that new agent frameworks default to it. The vendor-specific tool formats remain but are increasingly thin wrappers over MCP-compatible tool definitions.

Common production patterns

Retrieval-then-cite

The model has a search_knowledge_base tool. When the user asks a question, the model calls the search tool, gets back relevant chunks, then composes an answer citing them. Compare with RAG vs VERITAS.

Verify-before-asserting

The model has a verify_claim tool. When the model is about to assert a factual claim, it calls verify first; only asserts confirmed claims. This is the AI agent grounding use case.

Code execution

The model has a run_python tool. For math / calculation / data-manipulation queries, the model writes code and runs it instead of computing in-head.

Multi-tool composition

The model has 10+ tools (search, calendar, email, database, calculator, etc.). It chains tool calls to accomplish complex tasks. The OpenAI Assistants API + Anthropic computer use target this shape.

Anti-patterns

  • Tool descriptions too vague. The model decides when to call a tool based on the description. Vague description = mis-called or under-called tool. Write descriptions like API docs.
  • Too many tools. 5-10 tools per agent is generally fine. 50+ tools degrades model performance — consider sub-agent decomposition or hierarchical tool routing.
  • Trusting model arguments unconditionally. Models hallucinate JSON. Validate arguments before executing — Pydantic, JSON Schema validators, Instructor, Pydantic AI.
  • No tool-call observability. Log every tool call + arguments + result. You'll need it when debugging why the agent did something weird.