A user opens the Taobao app, picks a model photo, and drops in six reference images: a coat, an inner shirt, pants, shoes, a hat and a bag. They tap a button. Less than seven seconds later, a fresh photo appears — same face, same background, every garment placed correctly with the coat unzipped, revealing the inner shirt. Multiply this by tens of millions of requests per service window, and you get a sense of what Tstars-Tryon 1.0 is solving. This is not the lab-clean VITON-HD setting where one t-shirt gets pasted onto a fashion model in a studio. This is virtual try-on at e-commerce scale, on real-world photos, with stacked outfits and accessories — and it is running today.

Motivation

Virtual try-on (VTON) has been a popular research topic since 2017, and most academic systems treat it as masked inpainting: take a person photo, segment out the garment region, then condition a generative model on a flat reference garment to fill the masked area. CatVTON, FitDiT, FastFit, Leffa — they all live in this paradigm and are evaluated on VITON-HD or DressCode, which are studio-clean datasets with one garment at a time.

This breaks down the moment you point it at a real e-commerce user:

  • Their photos have complex backgrounds, motion blur, partial occlusion, sometimes anime avatars or 3D characters.
  • They want to try multiple items together — a jacket plus an inner shirt, plus shoes, plus a hat — with correct layering (“keep the jacket open, show the t-shirt”).
  • A reliable human-parsing mask is often impossible to obtain.

On the other hand, general-purpose image editing models (QwenEdit, FLUX.2, GPT-Image-1.5, Nano Banana Pro, Seedream5 lite) handle the editing flexibility but collapse in different ways: they lose facial identity, omit reference garments, hallucinate colors, and — critically — take roughly 200 seconds per image at the relevant quality level. For a C-end product, 200 seconds is a non-starter.

Tstars-Tryon 1.0 sits in the gap between these two camps. It is a commercial-scale system deployed on Taobao’s “AI Try-On” feature, built on a single 5B-parameter Multi-Modal Diffusion Transformer (MMDiT). It supports up to 6 reference garments across 8 fashion categories (tops, pants, skirts, dresses, coats, shoes, bags, hats), runs in 3.92 s for single-garment and 6.74 s for multi-garment on H200, and beats both academic VTON specialists and frontier closed-source editors on a multi-dimensional benchmark.

Key Idea

Forget masks. Forget warping modules. The central reframing of Tstars-Tryon is this:

Virtual try-on is not inpainting. It is multi-image instruction-following editing.

Imagine a stylist looking at a customer’s photo and a small pile of reference shots: a coat, a shirt, pants, shoes. The stylist does not literally cut and paste fabric. They mentally redress the customer from scratch, preserving the face, pose and background, while making the new outfit obey physics and layering rules. The diffusion-transformer analogue is straightforward: feed all of those images and a text instruction into one MMDiT context window and let the model denoise a fresh image.

Concretely, the input to the model is a token sequence built from:

  • text tokens (the prompt, possibly rewritten by a tailored prompt-rewriter),
  • image tokens of the person photo,
  • image tokens of 1 to 6 reference garments / accessories.

These tokens go through joint attention blocks. There are no inpainting masks, no separate warping network, no per-category adapters. Layering, occlusion, and accessory placement are learned implicitly from data.

This is a deceptively simple reframing — and it is what unlocks scaling to 8 categories, arbitrary reference counts, and in-the-wild backgrounds.

Architecture

The backbone is a unified MMDiT with about 5B parameters, in the spirit of Stable Diffusion 3 / FLUX rectified-flow transformers. Three things matter for understanding it:

  1. Joint text-image attention. Both modalities live in the same attention blocks; reference garment tokens can attend to the person tokens and to the (noisy) target latent simultaneously.
  2. Variable token count. The model accepts any number of reference images (1–6) without architectural changes; data packing inspired by NaViT (Patch n’ Pack) is adapted to the DiT setting to avoid the wasted compute that fixed-resolution bucketing imposes.
  3. No inpainting head. The output is a freshly generated image, not a paste-back into a masked region.

Rectified-flow generation

Tstars-Tryon inherits the rectified-flow formulation. The training data is built by linearly interpolating between a clean target latent $x_0$ and Gaussian noise:

$$ z_t = (1-t),x_0 + t,\epsilon,\qquad \epsilon \sim \mathcal{N}(0,I),\quad t \in [0,1]. \tag{1} $$

where:

  • $x_0$ — clean target try-on image (in latent space),
  • $\epsilon$ — Gaussian noise sample,
  • $t$ — flow time / noise level,
  • $z_t$ — noisy latent on the straight line between data and noise.

The network learns a velocity field $v_\theta(z_t, t, c) \approx x_0 - \epsilon$, where the conditioning $c$ bundles all text and image tokens. Interpretation: the model is taught to “point home” — given a noisy latent and the multimodal context, predict the direction back to the clean image. Sampling is performed with a few Euler steps along $v_\theta$.

This is the same backbone formulation the open-source DiT ecosystem has converged on; the novelty of Tstars-Tryon is not in this equation but in what gets fed into $c$ and how the model is post-trained.

Training Pipeline

The paper describes a five-stage training paradigm. Each stage exists for a reason — none of them are decorative.

Stage A — General editing pretraining

Before the model ever sees try-on data, it is pretrained on a large-scale general image-editing dataset, balanced across tasks and content. The data engine itself is a non-trivial pipeline:

  • Element decomposition — cut out garments, accessories, and persons from in-the-wild photos.
  • Retrieval-based recall — assemble multi-item pairs by retrieving compatible elements.
  • Professional captioning — produce precise editing instructions.
  • VLM-based knowledge-enhanced filtering — filter out poor pairs.
  • Perceptual-metric screening — final quality gate.

Why pretrain on general editing? Because VTON is a special case of editing. The model first learns the broad skill of “follow a multi-image edit instruction”, then specializes.

Stage B — Progressive resolution

After base pretraining, resolution is gradually increased. This is the standard trick for stabilizing high-resolution synthesis without paying the compute cost of training at full resolution from scratch.

Stage C — Vertical SFT on try-on data

Now the model is fine-tuned on carefully curated try-on-specific data, balanced across the 8 categories and across reference counts. Comprehensive metrics monitor progress per category to prevent collapse — for example, you do not want shoe quality to drift while improving coats.

Stage D — RL with multi-dimensional rewards (DiffusionNFT)

This is where Tstars-Tryon does something genuinely different. After SFT, the model is post-trained with reinforcement learning using a multi-dimensional reward across four product-relevant axes:

  • Garment Fidelity,
  • Identity Consistency,
  • Background Preservation,
  • Physical & Structural Logic.

For each conditioning, the policy samples $G$ trajectories, scores their final images, and computes a group-relative advantage in GRPO style:

$$ A_i = \frac{R_i - \mathrm{mean}\bigl({R_j}{j=1}^{G}\bigr)}{\mathrm{std}\bigl({R_j}{j=1}^{G}\bigr)}. \tag{2} $$

where:

  • $R_i$ — scalar reward of the $i$-th sampled trajectory (aggregated across the four reward dimensions),
  • $G$ — group size,
  • $A_i$ — group-relative advantage used to weight policy updates.

Interpretation: within the group, the trajectories that score above average get reinforced, and those below get pushed down. There is no separate value network. The actual policy update is performed via DiffusionNFT (Zheng et al. 2025), an online diffusion-RL method that operates through the forward process and — usefully — also yields a model that produces high-quality samples without classifier-free guidance.

Why this matters: SFT alone cannot reliably fix systematic failure modes like “the model omits the hat when there are five other reference items”. A reward signal aimed at garment fidelity directly pushes against omissions.

Stage E — Inference distillation

A 5B DiT with full CFG and many sampling steps is far too slow for production. Tstars-Tryon compresses inference along two axes.

CFG distillation. Standard classifier-free guidance computes

$$ v_{\text{cfg}}(z_t, c) = v_\theta(z_t, \emptyset) + w ,\bigl(v_\theta(z_t, c) - v_\theta(z_t, \emptyset)\bigr), \tag{3} $$

which requires two forward passes per step (conditioned and unconditioned). A student velocity field $v_\phi$ is trained to match $v_{\text{cfg}}$ in a single pass:

$$ v_\phi(z_t, c) ;\approx; v_{\text{cfg}}(z_t, c). $$

Step distillation. A DMD-style distribution-matching distillation (Yin et al. 2024) compresses many denoising steps into a few, by aligning the student’s output distribution with the teacher’s.

Combined, CFG distillation + step distillation take the 5B DiT from “research-grade slow” to 3.92 s single-garment / 6.74 s multi-garment on H200. Open-source editing baselines at comparable quality are around 200 s — a roughly 25–50× speedup.

Stage F — Prompt rewriter

A tailored model rewrites raw user prompts into precise editing instructions that explicitly say which reference image goes to which body region and what the layering relations are (“keep open, revealing inner layer”). This step matters: ordinary users do not write instructions in the format the editing DiT expects.

The full inference path looks like:

user prompt + reference images
      |
      v
  prompt rewriter
      |
      v
  text encoder + image tokenizer
      |
      v
  unified MMDiT (5B, distilled)
      |
      v
  output try-on image

Tstars-VTON Benchmark and VLM Evaluation

The paper introduces a new benchmark, Tstars-VTON Bench, designed to actually capture what matters for a production VTON system.

PropertyValue
Paired samples1780
Garment categories5
Accessory categories3
Sub-styles465
Layered items per sample1–6
EvaluationVLM Likert 1–10 across 4 dimensions

The four dimensions are Identity Consistency, Garment Fidelity, Background Preservation, and Physical & Structural Logic.

Geometric mean as overall score

Why not just average the four dimensions? Because for a real product, a model that scores 9.8 / 9.8 / 9.8 / 4.0 is not “almost perfect” — it is broken on physics. Tstars-Tryon uses a geometric mean:

$$ \text{Overall} = \left(\prod_{k=1}^{4} s_k\right)^{1/4}. \tag{4} $$

where $s_k$ is the score in dimension $k$ (1–10 Likert).

Interpretation: equivalently $\log \text{Overall} = \tfrac{1}{4}\sum_k \log s_k$. Any weak link drags the whole score down — there is no compensating a 4 with a 10. This forces “balanced excellence”, which is exactly what an industrial VTON product needs.

Two-stage VLM judging

Identity scoring needs to see the garments: a silhouette change should be attributed to a bulky coat, not to identity drift. Background and physics scoring, on the other hand, gets distracted by garment context. The benchmark splits the VLM call:

  • Stage 1 — VLM sees person + reference garments + output, scores Identity Consistency and Garment Fidelity.
  • Stage 2 — VLM sees only person + output, scores Background Preservation and Physical & Structural Logic.

Two independent API calls, less context bias.

Results

Single-garment Tstars-VTON

ModelOverallID ConsistencyGarment FidelityBackgroundPhysics
Tstars-Tryon9.3729.8898.8339.8639.241
Seedream5 lite9.3018.6399.343
Nano Banana Pro9.2298.598
GPT-Image-1.58.8928.563
FireRed-Edit-1.18.863
FLUX.2-klein-9B8.797
FLUX.2-dev8.764
QwenEdit-25118.121
FastFit6.448
CatVTON6.663
Leffa6.048
FitDiT5.152

Tstars-Tryon takes the overall lead and the top spot in three of the four dimensions, losing only Physics by a small margin to Seedream5 lite. Note the gap to academic VTON specialists: more than 3 points on a 10-point geometric mean.

Multi-garment Tstars-VTON

ModelOverallIDGarmentBackgroundPhysics
Tstars-Tryon9.1719.6198.9559.6208.883
Seedream5 lite8.9149.2728.6239.525
Nano Banana8.540
GPT-Image-1.58.3919.070
FLUX.2-klein-9B8.161
FLUX.2-dev7.775
QwenEdit-25116.441
FastFit6.039
FireRed-Edit-1.14.822

Multi-garment is where the gap really shows up. FireRed-Edit-1.1 and QwenEdit-2511 collapse hard — multiple garments break their identity and layering — while Tstars-Tryon barely budges from its single-garment performance.

Zero-shot academic benchmarks

Tstars-Tryon was not trained on VITON-HD or DressCode, yet evaluates competitively or best on them:

BenchmarkTstars-TryonFastFitFitDiTCatVTONLeffa
VITON-HD unpaired (FID / KID)8.485 / 0.5288.629 / 0.6659.979 / 1.47810.552 / 2.27210.446 / 2.640
DressCode unpaired (FID / KID)4.541 / 0.4584.397 / 0.5534.805 / 0.7125.872 / 1.60620.099 / 13.506

KID is essentially halved or better against every specialist baseline.

Latency

SettingTstars-Tryon (H200)QwenEdit-2511 / FLUX.2-dev
Single garment3.92 s~200 s
Multi-garment (5 refs avg)6.74 s~200 s

A 25–50× speedup, which is the difference between a research demo and a button in a shopping app.

Human evaluation (GSB)

Side-by-side human evaluation tells a consistent story. Against Nano Banana Pro, Tstars-Tryon wins 41.1%, ties 41.6%, loses 17.3% — and the win rate grows with garment count: 33.6% at one garment up to 54.8% at five. Against Seedream5 lite, the picture is even cleaner: 54.4% wins vs 9.0% losses, with the win rate climbing from 46.1% (1 garment) to 70.2% (5 garments).

The pattern is hard to miss: the more complex the outfit, the larger the lead.

Production deployment

Tstars-Tryon is the engine behind AI Try-On in the Taobao app, serving millions of users with tens of millions of requests — the paper claims this is the largest production-scale VTON deployment to date.

Why It Works

A few intuitions for why this stack ends up dominating both VTON specialists and frontier editors.

  • Editing > inpainting for in-the-wild data. Once you stop relying on a clean human-parsing mask, the failure surface shrinks dramatically. Extreme poses, anime avatars, and complex backgrounds become solvable.
  • Multi-image joint attention learns layering for free. A reference token for a coat and a reference token for an inner shirt can attend to the same body region in the noisy target; the model learns to associate text directives like “keep open” with attention patterns.
  • Multi-reward RL targets the actual product axes. SFT on curated data improves average quality but does not eliminate omissions. A reward that explicitly penalizes “the hat is missing” closes that loop.
  • Distillation is mandatory, not optional. 200 s per image is academically interesting but commercially irrelevant. The CFG-free property of DiffusionNFT plus DMD-style step distillation are what makes a 5B model serveable.

Limitations

Honest list, mostly drawn from the paper’s own framing:

  • Closed system. Weights, training data, and the prompt rewriter are not released. Only the Tstars-VTON benchmark is planned for release.
  • Sparse formal exposition. The actual RL update rule, distillation losses, and exact MMDiT specifics are described at a high level by reference to prior work (DiffusionNFT, DMD, MMDiT, NaViT) rather than written out in full.
  • Closed-source baseline latencies via API. The ~200 s figures for closed-source competitors are based on API calls and may include network overhead — not strictly comparable.
  • VLM judge dependence. Even with the two-stage protocol, the benchmark inherits VLM idiosyncrasies and prompt-design sensitivity.
  • Geometric mean saturation. When all four scores are above 9, even the geometric mean can blur differences.
  • Privacy and likeness. Strong identity preservation in a consumer product raises questions the paper does not deeply discuss; benchmark uses face anonymization, but production does not.
  • No formal ablation table. Stage-wise contributions (SFT vs RL vs distillation) are described qualitatively. It is hard to attribute precise quality gains.

Takeaways

  • VTON as multi-image editing, not masked inpainting, is the right reframing. A single MMDiT with 1–6 reference images and text conditioning replaces masks, warpers, and category-specific adapters.
  • A 5-stage post-training pipeline — general editing pretrain → progressive resolution → SFT on vertical data → multi-reward DiffusionNFT RL → CFG + step distillation — is what makes a 5B DiT both high-quality and serveable in seconds.
  • A geometric-mean, two-stage VLM benchmark (Tstars-VTON) captures balanced quality far better than FID/KID and arithmetic averages.
  • On both Tstars-VTON and the academic VITON-HD / DressCode benchmarks, the system beats specialized VTON baselines, top open-source editors, and proprietary closed-source models — at 25–50× lower latency.
  • The system runs in production on Taobao “AI Try-On” at tens of millions of requests, which is the strongest validation a VTON paper can have right now.

The big-picture message: a single, well-trained editing transformer with the right post-training story can outperform both specialized academic VTON and frontier editing models simultaneously — and the bottleneck is no longer architecture, it is data engines, RL rewards, and inference distillation.


Sources and materials: