How to Install Sage Attention in ComfyUI: A Complete Setup Guide
ComfyUI has become one of the most flexible frontends for running Stable Diffusion and related image generation models locally. Sage Attention is a memory-efficient attention mechanism that can significantly speed up inference — but installing it alongside ComfyUI requires a few deliberate steps that differ depending on your hardware and environment. Here's what you need to know.
What Is Sage Attention?
Sage Attention is an optimized attention kernel designed to reduce VRAM usage and improve throughput during diffusion model inference. It works by replacing the standard scaled dot-product attention with a more efficient implementation — particularly useful when running large models like FLUX, SD3, or Wan 2.1 on consumer GPUs.
In practical terms, Sage Attention can:
- Reduce peak VRAM consumption during generation
- Speed up steps-per-second on compatible hardware
- Allow larger batch sizes or higher resolutions on mid-range GPUs
It's not a ComfyUI plugin in the traditional sense — it's a Python library that ComfyUI nodes and workflows call upon when the option is enabled in supported custom nodes.
Prerequisites Before You Begin
Before installing Sage Attention, confirm your environment meets the baseline requirements:
| Requirement | Details |
|---|---|
| GPU | NVIDIA GPU with CUDA support (Ampere architecture or newer recommended) |
| CUDA Version | CUDA 11.8 or 12.x |
| Python Version | Python 3.10 or 3.11 |
| ComfyUI | Up-to-date installation (portable or manual) |
| PyTorch | Version aligned with your CUDA build |
AMD GPUs and CPU-only setups are generally not compatible with Sage Attention as of current releases, since the library relies on CUDA-specific Triton kernels.
Step 1 — Set Up Your Python Environment
If you're using the ComfyUI portable package, it ships with its own embedded Python interpreter. If you installed ComfyUI manually in a virtual environment (venv or conda), you'll work within that environment instead.
For portable ComfyUI users, open the embedded terminal or use the python_embeded folder path when calling pip.
For manual/venv users, activate your environment first:
# Example for venv source venv/bin/activate # Linux/macOS venvScriptsactivate # Windows Step 2 — Install Triton
Sage Attention depends on Triton, a GPU kernel language. On Linux, Triton installs cleanly via pip. On Windows, it requires a pre-built wheel because native compilation isn't straightforward.
Linux:
pip install triton Windows: Download a compatible Triton wheel from a community repository (such as the triton-windows releases on GitHub) and install it manually:
pip install triton_windows-<version>-cp311-cp311-win_amd64.whl Match the wheel version to your Python version (cp310 for Python 3.10, cp311 for 3.11) and your CUDA version.
Step 3 — Install SageAttention ⚙️
With Triton in place, install Sage Attention directly from pip:
pip install sageattention Alternatively, for the latest development version:
pip install git+https://github.com/thu-ml/SageAttention.git The build process compiles CUDA extensions and may take several minutes. If it fails, the most common causes are a CUDA version mismatch, missing build tools (like Visual Studio on Windows), or an incompatible PyTorch version.
Step 4 — Enable Sage Attention in ComfyUI
Installing the library alone doesn't activate it. You need to tell ComfyUI to use it.
Via command-line argument:
Launch ComfyUI with the --attention flag:
python main.py --attention sage Or for the portable version, edit your launch batch file to include --attention sage.
Via supported custom nodes:
Some custom nodes — particularly those built for FLUX or video diffusion pipelines — expose a dropdown or toggle to select the attention mode. In these cases, you select sage or sage_fast from within the node's parameters directly in the workflow.
The sage_fast variant trades a small amount of numerical precision for additional speed. For most generative image tasks, the difference in output quality is negligible.
Common Installation Issues
🔧 "No module named 'sageattention'" — The library installed into a different Python environment than ComfyUI is using. Verify you're installing into the correct interpreter.
CUDA kernel compilation errors on Windows — Usually caused by missing or mismatched Visual Studio Build Tools. Install the C++ build tools from the Visual Studio installer and ensure your CUDA toolkit version matches your PyTorch build.
Triton import errors — On Windows especially, using a pre-built wheel rather than compiling from source resolves most import failures.
Slower generation after enabling — On older GPU architectures (pre-Ampere, like GTX 10-series or RTX 20-series), the Triton kernels may not be optimally compiled, resulting in no improvement or a slight regression.
What Affects Your Results
The actual benefit you see from Sage Attention varies considerably based on several factors:
- GPU architecture — Ampere (RTX 30-series) and Ada Lovelace (RTX 40-series) GPUs see the most consistent gains
- Model type — Transformer-based diffusion models (FLUX, SD3, Wan) benefit more than older UNet architectures
- Resolution and batch size — Gains become more pronounced at higher resolutions where attention computation dominates
- Operating system — Linux environments generally have a smoother installation path and better Triton support than Windows
- VRAM capacity — On GPUs with 8GB or less, VRAM savings may matter more than raw speed
Someone running FLUX.1 on an RTX 4090 under Linux will have a meaningfully different experience than someone running SD 1.5 on a Windows machine with an RTX 3060. The installation steps are the same — but whether the effort is worthwhile depends entirely on the specifics of your setup and what you're generating. 🖥️