Docker: Run, Test, and Evaluate Models and MCP Locally with Docker + Promptfoo

Sep 25, 2025

—

Source URL: https://www.docker.com/blog/evaluate-models-and-mcp-with-promptfoo-docker/
Source: Docker
Title: Run, Test, and Evaluate Models and MCP Locally with Docker + Promptfoo

Feedly Summary: Promptfoo is an open-source CLI and library for evaluating LLM apps. Docker Model Runner makes it easy to manage, run, and deploy AI models using Docker. The Docker MCP Toolkit is a local gateway that lets you set up, manage, and run containerized MCP servers and connect them to AI agents. Together, these tools let…

AI Summary and Description: Yes

Summary: The text introduces Promptfoo, a tool designed for evaluating large language model (LLM) applications, alongside Docker Model Runner and Docker MCP Toolkit, which facilitate managing and deploying AI models. This combination allows for the assessment and red-teaming of LLMs, enhancing the security and compliance posture of AI applications by comparing model outputs and testing for vulnerabilities in a streamlined manner.

Detailed Description:
The provided text details a suite of tools—Promptfoo, Docker Model Runner, and Docker MCP Toolkit—designed to help developers and security professionals evaluate and manage LLM applications effectively. Here are the key points:

– **Promptfoo Overview**:
– An open-source command-line interface (CLI) and library tailored for evaluating LLM applications.
– Facilitates local and cloud model assessments for efficiency and cost management in deploying AI solutions.

– **Docker Model Runner**:
– A system that simplifies the management, execution, and deployment of AI models through Docker technology.
– Allows users to pull various models easily and integrate them into their testing workflow.

– **Docker MCP Toolkit**:
– Enables users to run containerized models and aids in connecting them to AI agents for evaluation.
– Provides a centralized registry (Docker MCP Catalog) for discovering and sharing model control plane (MCP) servers.

– **Evaluation Capabilities**:
– Users can compare model performances using different metrics, assessing whether local models can meet production needs without incurring cloud token costs.
– The process of red-teaming and testing against security flaws (e.g., authentication, authorization) is streamlined through integration with Promptfoo.

– **Security Assessments**:
– The text outlines methodologies for red-teaming AI applications, evaluating them for privacy, safety, and operational integrity.
– Direct testing of MCP tools validates their functionality and security, ensuring effective defense against vulnerabilities.

– **Practical Workflow**:
– The text includes practical command-line examples for setting up tools, pulling models, performing evaluations, and viewing results, offering a comprehensive guide for professionals to execute these processes themselves.

Overall, the conversational structure of the text not only informs but also equips professionals in AI and cybersecurity with the knowledge to enhance the security and efficacy of their applications. The integration of these tools highlights the importance of proactive testing and evaluation in maintaining compliance and securing AI systems against emerging threats.

a Act age agent agents AGI AI AI applications ai model AI models AI systems All allow and app Application applications as assessment assessments at authentication authorization AWS Bi by C capabilities catalog CI cli Cloud co command command-line interface compliance compliance posture container containerized control control plane conversation conversational cost cost management Costs cyber cybersecurit Cybersecurity D de defense deployment design developer developers Docker Docker Model Runner e edge effective efficiency emerging emerging threats Entra evaluation evaluations execution face flaws for function functionality g Gateway Gen GIS gs H high Highlight HR http HTTPS in integration integrity inter interface io ite k Key knowledge l language language model large large language model Large Language Model (LLM) law led Li library line line interface llm llms lm local local models long low M man management mcp MCP servers methodologies metrics ML Mode model model assessment model assessments model outputs model performance Model Runner models N nation needs no o of off on only ons open open-source operation operational operational integrity oS out output Outputs over per performance point post pre privacy pro proactive process processes product production professionals prompt Promptfoo ps Q R rate RCE re red Registry Ro RSA s safe safety sec security security and compliance security assessment security assessments Security Flaw security flaws security professionals server servers SHA sharing side Sig Sim solutions source SSE system systems T Tails team teaming tech technology test Testing text the threat threats to token tool toolkit tools TP UI up US use user Users uth V val Valid Valuation vulnerabilities Wi workflow x z