Kolors AI by Kuaishou: How It Works

Founder of Picasso IA

May 19, 2026 - 2:47 PM

Kolors arrived in 2024 as one of the more technically interesting releases from China's AI research scene. Built by Kuaishou Technology, the company behind the short-video platform Kuaishou, this text-to-image model went open-source almost immediately and earned serious attention from researchers and developers who had grown accustomed to Western-dominated model releases.

What made people stop and look was not just the image quality, which is genuinely impressive, but the architectural decisions underneath. Kolors uses a transformer-based text encoder borrowed from ChatGLM rather than the CLIP encoder that most Western models rely on. That single choice has significant consequences for how prompts are interpreted and how well the model handles Chinese-language input across the full range of commercial and creative applications.

This article covers how Kolors works technically, what it produces, how it compares to other frontier models, and where it fits in today's rapidly shifting landscape of open-source image generation.

What Kolors Actually Is

Kolors is a large-scale text-to-image diffusion model developed by Kuaishou's Kolors team and released publicly in mid-2024. The full model name is Kolors-diffusers and it was made available on Hugging Face with weights anyone can download and run. It is built on a latent diffusion framework, meaning images are generated and processed in a compressed latent space rather than at full pixel resolution, which is both faster and more memory-efficient.

The architecture traces its roots to the same family of ideas that produced SDXL. But Kolors departs significantly in one critical component: the text encoder. That departure is not a small tweak. It reshapes what kinds of prompts the model can handle and who can use it effectively.

Built on Latent Diffusion

In a latent diffusion model, the actual denoising process happens in a low-dimensional latent space. A Variational Autoencoder (VAE) compresses images into this space for training, and at inference time the model generates in that compressed space before the VAE decoder reconstructs the final pixel image.

This means the computationally expensive part, the iterative denoising, operates on a much smaller tensor than raw pixels would require. Kolors uses a 4-channel latent with a downscaling factor of 8, so a 1024x1024 image is processed as a 128x128 latent. The UNet inside Kolors is trained to predict the noise at each step of the diffusion process in this compressed space, then hands off to the VAE decoder for final rendering.

AI engineer reviewing generated portrait images on a professional monitor in a photography studio

The ChatGLM Text Encoder

This is where Kolors really separates itself from the competition. Instead of CLIP, which encodes text into a 77-token fixed-length sequence with limited semantic depth, Kolors uses ChatGLM3-6B as its text encoder. ChatGLM is a bilingual Chinese-English language model from Tsinghua University and Zhipu AI, and it handles long, nuanced prompts far more gracefully than CLIP.

The practical difference is real and measurable. CLIP was trained on English-heavy web-scraped data and struggles with three categories of input:

Long prompts beyond 77 tokens
Chinese characters and their semantic nuance
Complex compositional instructions with multiple subjects and relationships

ChatGLM addresses all three. It supports up to 256 tokens of input, has native bilingual capability at the training level, and brings transformer depth that comes from a 6-billion-parameter language model trained specifically for Chinese-English understanding.

💡 The core difference: Kolors can accept a full prompt in Chinese and produce accurately described results, whereas SDXL would either mistranslate or ignore the semantic content entirely.

Why Kolors Stands Out

There are dozens of open-source text-to-image models available in 2024. What gives Kolors a reason to exist beyond another SDXL fine-tune?

Bilingual Text in Images

Most Western models produce garbled or inconsistent Chinese characters when asked to embed text in images. Kolors handles Chinese typography with noticeably higher accuracy because its text encoder understands the semantic weight of Chinese tokens at a deep level. This matters enormously for commercial use cases in Chinese-speaking markets: e-commerce product banners, poster design, social media graphics, and app store screenshots.

The same capability extends to English text rendering. Kolors is competitive with, though not consistently superior to, dedicated text-in-image models for Latin scripts. It is, however, the only major open-source model that handles both scripts reliably in a single architecture without requiring any translation preprocessing step.

Stunning photorealistic portrait of a young woman in crimson silk dress demonstrating AI output quality

Photorealistic Output Quality

Kolors produces images with a strong bias toward photorealism. Skin textures, fabric detail, ambient lighting, and environmental depth all render with a natural quality that competes with fine-tuned SDXL variants and FLUX.1 Dev. The model was trained on a large proprietary dataset curated by Kuaishou, which included high-resolution photography and commercial imagery sourced through the platform's media ecosystem.

In blind comparisons conducted by the AI community shortly after release, Kolors consistently scored higher than base SDXL across multiple benchmarks:

Metric	SDXL Base	Kolors
Human preference score	75.2%	85.6%
Skin texture realism	Medium	High
Chinese prompt accuracy	Poor	High
Prompt following (long)	Medium	High
Text rendering (Chinese)	Fail	Pass
Compositional accuracy	Medium	High

These numbers come from the Kolors paper's evaluation section and represent Kuaishou's internal benchmark methodology. Independent community testing has broadly confirmed the trend, particularly for portrait subjects and product-on-surface imagery where photorealism is the primary measure.

Open-Source and Free to Run

Kuaishou released Kolors under an open license that permits commercial use. The full weights, VAE, and scheduler configurations are freely available on Hugging Face. Anyone with a suitable GPU can run it locally with no API costs. This was a deliberate strategic choice: releasing openly builds an ecosystem of fine-tuners, LoRA trainers, and downstream applications that extend the model's reach far beyond what Kuaishou could produce internally.

Diverse team of creative professionals collaborating around a conference table with laptops and tablets

Kolors Architecture Deep Dive

Understanding why Kolors works requires a brief look at what is happening inside the model. If you are not interested in the technical specifics, you can skip this section without losing the thread of the article.

The UNet Backbone

Kolors uses a modified UNet as its denoising backbone, similar in structure to SDXL's UNet but trained from scratch on Kuaishou's data pipeline. The structural decisions that define its behavior:

Resolution support: Native 1024x1024 training with multi-aspect ratio conditioning support
Attention layers: Cross-attention conditions generation on text embeddings from ChatGLM at multiple UNet depths
Skip connections: Standard UNet skip connections preserve spatial structure during the upsampling path
Timestep conditioning: Sinusoidal timestep embeddings communicate denoising stage to the network throughout inference

The UNet receives the noisy latent tensor and the projected text conditioning simultaneously, predicts the noise component to remove, and iterates. Common samplers including DDIM, DPM++ 2M Karras, and Euler are all compatible with Kolors. Typically 20 to 50 steps produce high-quality results, with 25 to 35 being the practical sweet spot for speed-quality balance.

High-end GPU hardware close-up showing heat sink fins and circuit board detail in sharp focus

How Tokens Flow Through the Model

The text conditioning pipeline in Kolors is more complex than SDXL's two-encoder setup. Here is the processing sequence from raw prompt to image conditioning:

Input prompt arrives in Chinese or English, up to 256 tokens
ChatGLM3-6B tokenizer encodes the text into token IDs with language-aware subword tokenization
ChatGLM encoder produces rich contextual embeddings with shape [batch, seq_len, 4096]
Projection layer maps the 4096-dimensional embeddings to the UNet's internal embedding dimension
Cross-attention in UNet layers at multiple depths conditions each generation step on these projections

The ChatGLM embeddings carry significantly more contextual information than CLIP embeddings because ChatGLM is a full autoregressive transformer, not a contrastive model trained on paired image-text data. This is why Kolors handles compositional prompts more reliably: "a woman in a red coat standing in front of a green building on a rainy evening" is understood as a compositional scene, not as a statistical cluster of image-text associations.

VAE and Latent Space

Kolors uses a standard KL-regularized VAE with 4 latent channels. The encoder maps pixel images to latents at 8x spatial compression for training, and the decoder reconstructs full images at inference time. The Kolors team used SDXL's released VAE weights as initialization, which is one reason the latent space characteristics are broadly compatible with some SDXL-adjacent tooling and communities.

Modern enterprise data center server corridor with structured cable management and indicator lights

Kolors vs Other Models

Where does Kolors actually sit in the broader model landscape?

Strengths and Weaknesses at a Glance

Model	Text Encoder	Chinese Support	Open Weights	Best At
Kolors	ChatGLM3-6B	Excellent	Yes	Portraits, photorealism, bilingual
SDXL	CLIP + OpenCLIP	Poor	Yes	Speed, LoRA ecosystem size
FLUX.1 Dev	T5-XXL + CLIP	Limited	Yes	Fine detail, prompt accuracy
GPT Image 2	GPT-4o vision	Good	No	Instruction following, text-in-image
Midjourney v6	Proprietary	Fair	No	Artistic style, aesthetics

Kolors occupies a specific niche: photorealistic generation with native bilingual support at open-source access levels. No other model in this comparison covers that combination. FLUX.1 Dev comes close on photorealism but lacks the Chinese language depth. GPT Image 2 handles Chinese better but is closed-source and API-only.

Where Kolors is relatively weaker: its LoRA fine-tune ecosystem is smaller than SDXL's, inference speed is slower than FLUX.1 Schnell, and the model has fewer community-trained stylistic checkpoint variations available for download.

When to Choose Kolors

Choose Kolors when:

Your audience is Chinese-speaking and prompts will primarily be in Chinese
You need photorealistic portraits or commercial product imagery at publication quality
You want open-source weights with a commercial-friendly license for production deployment
You are building a bilingual product and want a single model to serve both language audiences

Reach for something else when:

Maximum inference speed is the priority (FLUX Schnell is meaningfully faster)
You need the largest fine-tune and LoRA ecosystem (SDXL still leads there)
Complex scene composition with exact instruction following is required (FLUX.1 Dev performs better)

Real-World Use Cases

Portrait and Fashion Photography

Kolors produces portrait-quality output that can substitute for stock photography in many commercial contexts. Skin tone rendering is natural across a wide range of ethnicities, lighting is coherent without artificial-looking specularity, and the model handles clothing fabric behavior with surprising accuracy. Fashion photographers and e-commerce production teams in China adopted it rapidly for generating product imagery and model reference frames at a scale that traditional photography cannot match economically.

Side-by-side dual monitor setup showing AI-generated image comparison outputs for quality evaluation

Product Visualization

Chinese brands actively use Kolors to generate product-in-context imagery without expensive studio photo shoots. A skincare brand can produce a serum bottle on a marble bathroom shelf with accurate caustic lighting and shadow at a fraction of the cost of traditional photography. The native bilingual prompt support means the marketing team can work directly in Chinese without any translation preprocessing or language approximation.

The model's handling of reflective surfaces, material transparency, and surface-light interaction is particularly strong for product visualization work, which depends on material rendering accuracy.

Social Media and Content Creation

Kuaishou's own platform is a short-video product competing directly with TikTok in China. Kolors feeds directly into Kuaishou's in-platform creator tools. Content creators can generate thumbnail images, banner art, and promotional graphics entirely in Chinese, with results that match the visual expectations of Chinese social media aesthetics without any English-language intermediary step.

Smartphone screen showing a beautiful AI-generated portrait held in hand at a café

Running Kolors in Practice

Via API

The simplest way to run Kolors without local hardware is through the Replicate API. The model is hosted at kwai-kolors/kolors and accepts standard diffusion parameters: prompt, negative prompt, steps, guidance scale, and seed. A typical call costs a fraction of a cent per image and produces 1024x1024 output in roughly 10 to 20 seconds of inference time.

model: kwai-kolors/kolors
prompt: "一位穿着白色连衣裙的女子站在樱花树下"
num_inference_steps: 50
guidance_scale: 5.0
width: 1024
height: 1024

The Chinese prompt above translates to: "A woman in a white dress standing under cherry blossom trees." Kolors parses it natively without any translation step.

Local Setup Requirements

For local deployment, the minimum configuration is:

GPU: RTX 3090 with 24GB VRAM for comfortable inference at 1024x1024
Framework: Diffusers 0.30.0 or later from Hugging Face
Checkpoint: Kwai-Kolors/Kolors available directly on Hugging Face Hub

The model integrates with the KolorsPipeline class in Diffusers, which handles ChatGLM tokenization automatically. Loading with float16 precision fits within 18 to 20GB VRAM. On an RTX 4090, 30-step inference at 1024x1024 runs in approximately 8 to 12 seconds.

💡 Tip: Apply torch.compile() to the UNet for a 20 to 30% inference speed improvement on Ada and Ampere generation GPUs. Pair with SDPA attention for additional memory efficiency without quality loss.

LoRA Fine-Tuning

Kolors supports LoRA fine-tuning through the standard Diffusers training scripts, with minor modifications to accommodate the ChatGLM conditioning path. Training a style LoRA on Kolors requires roughly 20 to 50 reference images and 1,000 to 2,000 training steps on a 24GB GPU at reasonable quality. Full dreambooth-style fine-tuning is also supported for identity consistency tasks.

The Kuaishou team released companion models including Kolors-IP-Adapter for identity-consistent portrait generation and Kolors-ControlNet for pose, depth, and canny edge conditioning. These additions make the practical toolkit around Kolors considerably more complete than a standalone base model evaluation would suggest.

Creative professional typing AI image prompts on a mechanical keyboard in a dimly lit workspace

The Bigger Picture

Kolors represents something worth paying attention to beyond its individual benchmark numbers. It is part of a broader wave of high-quality AI models from Chinese research teams that have received relatively little visibility in Western discussions of frontier model development.

Kuaishou, like ByteDance and Alibaba's DAMO Academy, has the engineering depth, computational resources, and proprietary training data to produce models that compete with or exceed Western counterparts in specific domains. Kolors does not beat Midjourney v6 on raw aesthetic style. But it outperforms base SDXL on most photorealistic benchmarks while remaining open-source, supporting two languages natively, being commercially licensed, and being free to self-host.

Dense East Asian metropolis at twilight photographed from a high rooftop vantage point with vehicle light trails below

The release also validated the approach of using large language models as text encoders for image generation. T5-XXL, used in Stable Diffusion 3 and FLUX.1 Dev, proved the concept from the Western side. Kolors proved it works at least as well, and likely better for non-English languages, using ChatGLM. The search for better text-conditioning architectures for diffusion models continues, and Kolors is a meaningful data point in that ongoing story.

For teams building products that serve Chinese-speaking users, Kolors is not a curiosity or an experimental alternative. It is currently the strongest open-source option for combining photorealistic image quality with native Chinese prompt support in a single commercially licensed model.

Start Creating Your Own Images

The fastest way to experience what modern AI image generation can produce, without local hardware, model downloads, or API credentials, is to use a platform that has already handled the infrastructure. PicassoIA hosts dozens of state-of-the-art text-to-image models you can run directly in your browser with no setup.

If you want photorealistic generation in the style that Kolors popularized, FLUX.1 Dev delivers highly detailed, prompt-accurate output on PicassoIA. For instruction-heavy creative work where text in images matters, GPT Image 2 is available with no configuration required. Both run on PicassoIA infrastructure, so you write a prompt and see results in seconds.

Whether you are producing portrait photography, product imagery, social media assets, or simply testing what these models can produce from a detailed description, PicassoIA puts frontier-level image generation in one place. Write your prompt, pick your model, and see what comes out.

Share this article

Kolors: Kuaishou's AI Image Model Explained