ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
2 Fuxi AI Lab, Netease Inc.
* Equal Contribution Project Leader Corresponding Author
Image0

ComfyGPT's workflow generation across various task instructions. By leveraging its strong alignment capabilities, ComfyGPT enables users to generate diverse workflows in response to task instructions, supporting various visual tasks.

Abstract

ComfyUI provides a widely-adopted, workflow-based interface that enables users to customize various image generation tasks through an intuitive node-based architecture. However, the intricate connections between nodes and diverse modules often present a steep learning curve for users. In this paper, we introduce ComfyGPT, the first self-optimizing multi-agent system designed to generate ComfyUI workflows based on task descriptions automatically. ComfyGPT comprises four specialized agents: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent. The core innovation of ComfyGPT lies in two key aspects. First, it focuses on generating individual node links rather than entire workflows, significantly improving generation precision. Second, we proposed FlowAgent, a LLM-based workflow generation agent that uses both supervised fine-tuning (SFT) and reinforcement learning (RL) to improve workflow generation accuracy. Moreover, we introduce FlowDataset, a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench, a comprehensive benchmark for evaluating workflow generation systems. We also propose four novel evaluation metrics: Format Validation (FV), Pass Accuracy (PA), Pass Instruct Alignment (PIA), and Pass Node Diversity (PND). Experimental results demonstrate that ComfyGPT significantly outperforms existing LLM-based methods in workflow generation.

Method

Image0

Overview of the ComfyGPT pipeline for automated ComfyUI workflow generation. Given a user instruction, ComfyGPT sequentially executes four specialized agents to construct and refine workflows for diverse visual tasks. ReformatAgent evaluates whether user queries require conversion during few-shot learning or FlowAgent training. FlowAgent, an LLM-based model trained with supervised fine-tuning (SFT) and optimized via GRPO, generates workflows and autonomously corrects errors. RefineAgent enhances workflow quality by integrating LLMs with knowledge retrieval for validation and topological consistency. Finally, ExecuteAgent converts the optimized workflow into a ComfyUI-compatible JSON format and executes it within the ComfyUI environment.

Image0

Illustration of Different Representations of ComfyUI Workflows. Instead of generating the entire JSON format ComfyUI workflows, we introduce a new workflow digram to generate individual links between the processing nodes (C).

FlowDataset and FlowBench

To train FlowAgent, we develop a large dataset called FlowDataset, which contains 13,571 workflows and corresponding instructions, organized into six core categories and six subcategories, making it the most comprehensive workflow dataset available.

Image0

Illustration of categories included in FlowDataset and FlowBench. The left figure represents the proportion of the six categories, while right represents subcategories.

To better evaluate our ComfyGPT, we partition 1,000 samples from FlowDatasets to create a test set and develop FlowBench. FlowBench covers a wider range of categories and more data compared to other benchmarks. FlowBench can not only be used to evaluate ComfyGPT, but it can also serve as a task benchmark for the evaluation of LLM.

Image0

Illustration of length (left) and task categories (right) distribution in FlowBench. The length is calculated by the number of nodes contained in each workflow.

Image0

Comparison of task categories between our proposed dataset and others in text-to-image generation (T2I), image editing (IE), style transfer (ST), 3D generation (3DG), video generation (VG) and others (O). The Dataset Scale column quantifies the number of instructions, workflows, and unique node types. ComfyGPT achieves the most extensive task support and dataset coverage across both the training set and benchmark.

Workflow Generated by ComfyGPT

Visulization Result

Image0
Image0
Image0

Quantitative Result

Image0

Quantitative result of ComfyGPT and few-shot learning across different baselines. This evaluation is conducted in FlowBench, focusing primarily on the four metrics: Format Validation(FV), Pass Accuracy(PA), Pass Instruct Alignment(PIA) and Pass Node Diversity(PND). To show the great performance of our ComfyGPT, quantitative results are conducted in two aspects: first, by comparing our standard ComfyGPT pipeline with the few-shot learning across various baselines. ComfyGPT achieves substantial improvements across various baselines in all metrics.

Image0

Quantitative comparison result of ComfyGPT and ComfyAgent on FlowBench and ComfyBench.

BibTeX

@misc{huang2025comfygptselfoptimizingmultiagentcomprehensive,
        title={ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation}, 
        author={Oucheng Huang and Yuhang Ma and Zeng Zhao and Mingrui Wu and Jiayi Ji and Rongsheng Zhang and Zhipeng Hu and Xiaoshuai Sun and Rongrong Ji},
        year={2025},
        eprint={2503.17671},
        archivePrefix={arXiv},
        primaryClass={cs.MA},
        url={https://arxiv.org/abs/2503.17671}, 
  }