Kolors is the open-source text-to-image diffusion model from Kuaishou Technology, built on a latent diffusion architecture with a ChatGLM text encoder for both Chinese and English prompts. This article breaks down how the model works under the hood, what separates it from SDXL and other frontier models, and why it produces such sharp, photorealistic results across portraits, product imagery, and bilingual use cases.
Kolors arrived in 2024 as one of the more technically interesting releases from China's AI research scene. Built by Kuaishou Technology, the company behind the short-video platform Kuaishou, this text-to-image model went open-source almost immediately and earned serious attention from researchers and developers who had grown accustomed to Western-dominated model releases.
What made people stop and look was not just the image quality, which is genuinely impressive, but the architectural decisions underneath. Kolors uses a transformer-based text encoder borrowed from ChatGLM rather than the CLIP encoder that most Western models rely on. That single choice has significant consequences for how prompts are interpreted and how well the model handles Chinese-language input across the full range of commercial and creative applications.
This article covers how Kolors works technically, what it produces, how it compares to other frontier models, and where it fits in today's rapidly shifting landscape of open-source image generation.
What Kolors Actually Is
Kolors is a large-scale text-to-image diffusion model developed by Kuaishou's Kolors team and released publicly in mid-2024. The full model name is Kolors-diffusers and it was made available on Hugging Face with weights anyone can download and run. It is built on a latent diffusion framework, meaning images are generated and processed in a compressed latent space rather than at full pixel resolution, which is both faster and more memory-efficient.
The architecture traces its roots to the same family of ideas that produced SDXL. But Kolors departs significantly in one critical component: the text encoder. That departure is not a small tweak. It reshapes what kinds of prompts the model can handle and who can use it effectively.
Built on Latent Diffusion
In a latent diffusion model, the actual denoising process happens in a low-dimensional latent space. A Variational Autoencoder (VAE) compresses images into this space for training, and at inference time the model generates in that compressed space before the VAE decoder reconstructs the final pixel image.
This means the computationally expensive part, the iterative denoising, operates on a much smaller tensor than raw pixels would require. Kolors uses a 4-channel latent with a downscaling factor of 8, so a 1024x1024 image is processed as a 128x128 latent. The UNet inside Kolors is trained to predict the noise at each step of the diffusion process in this compressed space, then hands off to the VAE decoder for final rendering.
The ChatGLM Text Encoder
This is where Kolors really separates itself from the competition. Instead of CLIP, which encodes text into a 77-token fixed-length sequence with limited semantic depth, Kolors uses ChatGLM3-6B as its text encoder. ChatGLM is a bilingual Chinese-English language model from Tsinghua University and Zhipu AI, and it handles long, nuanced prompts far more gracefully than CLIP.
The practical difference is real and measurable. CLIP was trained on English-heavy web-scraped data and struggles with three categories of input:
Long prompts beyond 77 tokens
Chinese characters and their semantic nuance
Complex compositional instructions with multiple subjects and relationships
ChatGLM addresses all three. It supports up to 256 tokens of input, has native bilingual capability at the training level, and brings transformer depth that comes from a 6-billion-parameter language model trained specifically for Chinese-English understanding.
💡 The core difference: Kolors can accept a full prompt in Chinese and produce accurately described results, whereas SDXL would either mistranslate or ignore the semantic content entirely.
Why Kolors Stands Out
There are dozens of open-source text-to-image models available in 2024. What gives Kolors a reason to exist beyond another SDXL fine-tune?
Bilingual Text in Images
Most Western models produce garbled or inconsistent Chinese characters when asked to embed text in images. Kolors handles Chinese typography with noticeably higher accuracy because its text encoder understands the semantic weight of Chinese tokens at a deep level. This matters enormously for commercial use cases in Chinese-speaking markets: e-commerce product banners, poster design, social media graphics, and app store screenshots.
The same capability extends to English text rendering. Kolors is competitive with, though not consistently superior to, dedicated text-in-image models for Latin scripts. It is, however, the only major open-source model that handles both scripts reliably in a single architecture without requiring any translation preprocessing step.
Photorealistic Output Quality
Kolors produces images with a strong bias toward photorealism. Skin textures, fabric detail, ambient lighting, and environmental depth all render with a natural quality that competes with fine-tuned SDXL variants and FLUX.1 Dev. The model was trained on a large proprietary dataset curated by Kuaishou, which included high-resolution photography and commercial imagery sourced through the platform's media ecosystem.
In blind comparisons conducted by the AI community shortly after release, Kolors consistently scored higher than base SDXL across multiple benchmarks:
Metric
SDXL Base
Kolors
Human preference score
75.2%
85.6%
Skin texture realism
Medium
High
Chinese prompt accuracy
Poor
High
Prompt following (long)
Medium
High
Text rendering (Chinese)
Fail
Pass
Compositional accuracy
Medium
High
These numbers come from the Kolors paper's evaluation section and represent Kuaishou's internal benchmark methodology. Independent community testing has broadly confirmed the trend, particularly for portrait subjects and product-on-surface imagery where photorealism is the primary measure.
Open-Source and Free to Run
Kuaishou released Kolors under an open license that permits commercial use. The full weights, VAE, and scheduler configurations are freely available on Hugging Face. Anyone with a suitable GPU can run it locally with no API costs. This was a deliberate strategic choice: releasing openly builds an ecosystem of fine-tuners, LoRA trainers, and downstream applications that extend the model's reach far beyond what Kuaishou could produce internally.
Kolors Architecture Deep Dive
Understanding why Kolors works requires a brief look at what is happening inside the model. If you are not interested in the technical specifics, you can skip this section without losing the thread of the article.
The UNet Backbone
Kolors uses a modified UNet as its denoising backbone, similar in structure to SDXL's UNet but trained from scratch on Kuaishou's data pipeline. The structural decisions that define its behavior:
Resolution support: Native 1024x1024 training with multi-aspect ratio conditioning support
Attention layers: Cross-attention conditions generation on text embeddings from ChatGLM at multiple UNet depths
Skip connections: Standard UNet skip connections preserve spatial structure during the upsampling path
Timestep conditioning: Sinusoidal timestep embeddings communicate denoising stage to the network throughout inference
The UNet receives the noisy latent tensor and the projected text conditioning simultaneously, predicts the noise component to remove, and iterates. Common samplers including DDIM, DPM++ 2M Karras, and Euler are all compatible with Kolors. Typically 20 to 50 steps produce high-quality results, with 25 to 35 being the practical sweet spot for speed-quality balance.
How Tokens Flow Through the Model
The text conditioning pipeline in Kolors is more complex than SDXL's two-encoder setup. Here is the processing sequence from raw prompt to image conditioning:
Input prompt arrives in Chinese or English, up to 256 tokens
ChatGLM3-6B tokenizer encodes the text into token IDs with language-aware subword tokenization
Projection layer maps the 4096-dimensional embeddings to the UNet's internal embedding dimension
Cross-attention in UNet layers at multiple depths conditions each generation step on these projections
The ChatGLM embeddings carry significantly more contextual information than CLIP embeddings because ChatGLM is a full autoregressive transformer, not a contrastive model trained on paired image-text data. This is why Kolors handles compositional prompts more reliably: "a woman in a red coat standing in front of a green building on a rainy evening" is understood as a compositional scene, not as a statistical cluster of image-text associations.
VAE and Latent Space
Kolors uses a standard KL-regularized VAE with 4 latent channels. The encoder maps pixel images to latents at 8x spatial compression for training, and the decoder reconstructs full images at inference time. The Kolors team used SDXL's released VAE weights as initialization, which is one reason the latent space characteristics are broadly compatible with some SDXL-adjacent tooling and communities.
Kolors vs Other Models
Where does Kolors actually sit in the broader model landscape?
Kolors occupies a specific niche: photorealistic generation with native bilingual support at open-source access levels. No other model in this comparison covers that combination. FLUX.1 Dev comes close on photorealism but lacks the Chinese language depth. GPT Image 2 handles Chinese better but is closed-source and API-only.
Where Kolors is relatively weaker: its LoRA fine-tune ecosystem is smaller than SDXL's, inference speed is slower than FLUX.1 Schnell, and the model has fewer community-trained stylistic checkpoint variations available for download.
When to Choose Kolors
Choose Kolors when:
Your audience is Chinese-speaking and prompts will primarily be in Chinese
You need photorealistic portraits or commercial product imagery at publication quality
You want open-source weights with a commercial-friendly license for production deployment
You are building a bilingual product and want a single model to serve both language audiences
Reach for something else when:
Maximum inference speed is the priority (FLUX Schnell is meaningfully faster)
You need the largest fine-tune and LoRA ecosystem (SDXL still leads there)
Complex scene composition with exact instruction following is required (FLUX.1 Dev performs better)
Real-World Use Cases
Portrait and Fashion Photography
Kolors produces portrait-quality output that can substitute for stock photography in many commercial contexts. Skin tone rendering is natural across a wide range of ethnicities, lighting is coherent without artificial-looking specularity, and the model handles clothing fabric behavior with surprising accuracy. Fashion photographers and e-commerce production teams in China adopted it rapidly for generating product imagery and model reference frames at a scale that traditional photography cannot match economically.
Product Visualization
Chinese brands actively use Kolors to generate product-in-context imagery without expensive studio photo shoots. A skincare brand can produce a serum bottle on a marble bathroom shelf with accurate caustic lighting and shadow at a fraction of the cost of traditional photography. The native bilingual prompt support means the marketing team can work directly in Chinese without any translation preprocessing or language approximation.
The model's handling of reflective surfaces, material transparency, and surface-light interaction is particularly strong for product visualization work, which depends on material rendering accuracy.
Social Media and Content Creation
Kuaishou's own platform is a short-video product competing directly with TikTok in China. Kolors feeds directly into Kuaishou's in-platform creator tools. Content creators can generate thumbnail images, banner art, and promotional graphics entirely in Chinese, with results that match the visual expectations of Chinese social media aesthetics without any English-language intermediary step.
Running Kolors in Practice
Via API
The simplest way to run Kolors without local hardware is through the Replicate API. The model is hosted at kwai-kolors/kolors and accepts standard diffusion parameters: prompt, negative prompt, steps, guidance scale, and seed. A typical call costs a fraction of a cent per image and produces 1024x1024 output in roughly 10 to 20 seconds of inference time.
The Chinese prompt above translates to: "A woman in a white dress standing under cherry blossom trees." Kolors parses it natively without any translation step.
Local Setup Requirements
For local deployment, the minimum configuration is:
GPU: RTX 3090 with 24GB VRAM for comfortable inference at 1024x1024
Framework: Diffusers 0.30.0 or later from Hugging Face
Checkpoint: Kwai-Kolors/Kolors available directly on Hugging Face Hub
The model integrates with the KolorsPipeline class in Diffusers, which handles ChatGLM tokenization automatically. Loading with float16 precision fits within 18 to 20GB VRAM. On an RTX 4090, 30-step inference at 1024x1024 runs in approximately 8 to 12 seconds.
💡 Tip: Apply torch.compile() to the UNet for a 20 to 30% inference speed improvement on Ada and Ampere generation GPUs. Pair with SDPA attention for additional memory efficiency without quality loss.
LoRA Fine-Tuning
Kolors supports LoRA fine-tuning through the standard Diffusers training scripts, with minor modifications to accommodate the ChatGLM conditioning path. Training a style LoRA on Kolors requires roughly 20 to 50 reference images and 1,000 to 2,000 training steps on a 24GB GPU at reasonable quality. Full dreambooth-style fine-tuning is also supported for identity consistency tasks.
The Kuaishou team released companion models including Kolors-IP-Adapter for identity-consistent portrait generation and Kolors-ControlNet for pose, depth, and canny edge conditioning. These additions make the practical toolkit around Kolors considerably more complete than a standalone base model evaluation would suggest.
The Bigger Picture
Kolors represents something worth paying attention to beyond its individual benchmark numbers. It is part of a broader wave of high-quality AI models from Chinese research teams that have received relatively little visibility in Western discussions of frontier model development.
Kuaishou, like ByteDance and Alibaba's DAMO Academy, has the engineering depth, computational resources, and proprietary training data to produce models that compete with or exceed Western counterparts in specific domains. Kolors does not beat Midjourney v6 on raw aesthetic style. But it outperforms base SDXL on most photorealistic benchmarks while remaining open-source, supporting two languages natively, being commercially licensed, and being free to self-host.
The release also validated the approach of using large language models as text encoders for image generation. T5-XXL, used in Stable Diffusion 3 and FLUX.1 Dev, proved the concept from the Western side. Kolors proved it works at least as well, and likely better for non-English languages, using ChatGLM. The search for better text-conditioning architectures for diffusion models continues, and Kolors is a meaningful data point in that ongoing story.
For teams building products that serve Chinese-speaking users, Kolors is not a curiosity or an experimental alternative. It is currently the strongest open-source option for combining photorealistic image quality with native Chinese prompt support in a single commercially licensed model.
Start Creating Your Own Images
The fastest way to experience what modern AI image generation can produce, without local hardware, model downloads, or API credentials, is to use a platform that has already handled the infrastructure. PicassoIA hosts dozens of state-of-the-art text-to-image models you can run directly in your browser with no setup.
If you want photorealistic generation in the style that Kolors popularized, FLUX.1 Dev delivers highly detailed, prompt-accurate output on PicassoIA. For instruction-heavy creative work where text in images matters, GPT Image 2 is available with no configuration required. Both run on PicassoIA infrastructure, so you write a prompt and see results in seconds.
Whether you are producing portrait photography, product imagery, social media assets, or simply testing what these models can produce from a detailed description, PicassoIA puts frontier-level image generation in one place. Write your prompt, pick your model, and see what comes out.