Docker: Tool Calling with Local LLMs: A Practical Evaluation

Source URL: https://www.docker.com/blog/local-llm-tool-calling-a-practical-evaluation/
Source: Docker
Title: Tool Calling with Local LLMs: A Practical Evaluation

Feedly Summary: Which local model should I use for tool calling? When building GenAI and agentic applications, one of the most pressing and persistent questions is: “Which local model should I use for tool calling?” We kept hearing again and again, from colleagues within Docker and the developer community, ever since we started working on Docker Model…

AI Summary and Description: Yes

Summary: The text discusses the evaluation of various local LLM models when implementing tool-calling in generative AI applications. It highlights the challenges faced during manual testing and introduces a framework for scalable testing, yielding insights into the accuracy and latency of different models in handling tool calls.

Detailed Description: The document presents an in-depth exploration of the selection and performance evaluation of local LLM models in the context of tool calling for generative AI applications. Key points include:

– **Local Inference Challenge**: The main question addressed is: “Which local model should I use for tool calling?” The authors emphasize the importance of local models for control, cost-efficiency, and privacy.

– **Initial Manual Testing**:
– A practical application, “chat2cart,” was tested with hosted and local models to assess performance.
– Challenges with local models included:
– Eager invocation of tools for trivial exchanges.
– Incorrect tool selection for basic actions.
– Issues with missing or malformed parameters.
– Ignored tool responses leading to incomplete conversations.

– **Iterative Improvement**: Move from manual testing to a scalable testing framework called “model-test.”
– The framework simulates realistic conversations and measures performance on tool-calling accuracy, tool selection, and latency.
– Incorporates multiple test cases and allows for custom suite creation to evaluate diverse scenarios.

– **Performance Metrics**:
– Final outputs were measured using the F1 score across three dimensions:
– Tool Invocation: Whether a tool was deemed necessary.
– Tool Selection: Correctness of the chosen tools.
– Parameter Accuracy: Correctness of tool call parameters.

– **Testing Outcomes**:
– Tested 21 models across 3,570 test cases, utilizing a robust hardware setup.
– Rankings revealed that OpenAI’s GPT-4 led in tool calling accuracy, closely followed by Qwen 3 (14B), with significant differences in latency noted.

– **Insights on Underperformance**: Some models, particularly quantized variants, showed poor tool-calling performance, indicating that while they may perform well in other contexts, they falter under structured testing.

– **Recommendations**:
– Suggested models for high accuracy include Qwen 3 (14B) and 8B.
– Highlighted trade-offs between speed and performance for various applications, aiding developers in model selection.

– **Implications for the Industry**: The findings underscore the crucial role of effective tool calling in real-world generative AI applications. The introduction of a systematic testing framework enables developers to make informed decisions on model selection, thereby enhancing the efficacy of agentic workflows.

This text serves as a valuable resource for professionals in AI and cloud security, offering insights into model performance and the importance of adequate testing in ensuring responsive and accurate tool use in AI applications.