Page 1 of 36

Video Virtual Try-on Challenge

- CVPR 2020 -

Andrew Jong1, Gaurav Kuppa1, Xin Liu2, Teng-Sheng Moh1, Ziwei Liu2

San Jose State University1, The Chinese University of Hong Kong2

Page 2 of 36

Challenge Description

#

Page 3 of 36

Competition Data

VVT Dataset

791 total videos

661 train

130 test

~240 frames per video at 29.97 FPS (~8 seconds per video)

Each training sample

1 cloth product image

~240 image frames per video

~240 corresponding COCO-18 keypoints

#

Page 4 of 36

Inference Task

Input

Person image

Skeleton Pose Sequence

Cloth Product image

Desired Output

Video: Cloth Product on Person following Pose Sequence

Infer on arbitrary cloth product, person, and pose inputs.

#

Page 5 of 36

Our Approach

#

Page 6 of 36

Initial Hypotheses

Break down into 2 subtasks

Person Repose

Try-on

Multiple ways to order subtasks for final result

Repose input image to video frames → Try-on on all reposed frames

Try-on on input image→ Repose tried-on to video frames

+

=

=

+

Subtask 1: Repose

Subtask 2: Try-on

#

Page 7 of 36

Hypothesized Pipeline Overview

#

Global Flow Local Attention Module

CP-VTON Geometric Matching Module

CP-VTON Try-On Module

SUBTASK 1: REPOSE

SUBTASK 2: TRY-ON

Page 8 of 36

SUBTASK 1: REPOSE

Input

Person image

Skeleton Pose Sequence

Desired Output

Video: Cloth Product on Person following Pose Sequence

#

+

=

Page 9 of 36

Repose Attempt 1: “Guided-Pix2Pix”

AlBahar et al. 2019 (ICCV)

Pretrained checkpoint, inference results on VVT test →

Image based, no temporal coherence

#

Page 10 of 36

Repose Attempt 2: “Global Flow Local Attention”

Global Flow Local Attention (GFLA)

Ren et al. 2020 (CVPR)

Claim: strong temporal coherence

Animation of purported results featured on Global Flow Local Attention (GFLA) GitHub.

#

Page 11 of 36

“Global Flow Local Attention” (GFLA) pretrained checkpoint on VVT test

Poor generalization

#

Page 12 of 36

Decide to Train a Custom GFLA Checkpoint

Observation

VVT’s test keypoints are sometimes noisy

Zoomed-in samples not present in original GFLA dataset

Therefore: Retrain Global Flow on VVT competition dataset

Author recommended minimum 10,000 steps

We trained to 30,570 steps, batch size 8

Train time: 7 days on four Tesla V100s

#

Page 13 of 36

GFLA model trained to 17,000 steps, inference on VVT TRAIN data

#

Page 14 of 36

GFLA model trained to 30,570 steps, inference on VVT TRAIN data

#

Page 15 of 36

GFLA model trained to 17,000 steps, inference on VVT TEST data

#

Page 16 of 36

GFLA model trained to 30,570 steps, inference on VVT TEST data

#

Page 17 of 36

Good quality on train

Poor quality on test

Ghost artifacts

Texture details (low and high frequency) missing

User identity not preserved

In practice, user identity can be preserved by copying user face data

However, slightly better (subjective) temporal coherence than Guided-Pix2Pix

Ran out of time to refine further

Repose Subtask: settled on GFLA custom trained to 30,570 steps

GFLA Observations

Global Flow (top)

Guided Pix2Pix (bottom)

#

Page 18 of 36

Subtask Ordering

Multiple ways to order subtasks for final result

Repose input image to video frames → Try-on on all reposed frames

Try-on on input image→ Repose tried-on to video frame

Because GFLA produced poor outputs on Repose, we must Repose → Try-on for best results

#

Page 19 of 36

SUBTASK 2: TRY-ON

Input

Cloth Product image

Person image

Desired Output

Cloth Product on Person following Pose Sequence

#

=

+

Page 20 of 36

SUBTASK 2: TRY-ON

Best existing public repository: “CP-VTON” Wang et al. 2018 (ECCV)

Two modules

Geometric Matching Module

Warp cloth to person

Try-on Module

Compose cloth and person

#

=

+

Subtask 2: Try-on

Geometric Matching Module (GMM)

Try-on Module (TOM)

Page 21 of 36

Initial Idea: Train with More Annotations

Hypothesis: annotated guidance would improve try-on understanding

Provide more accurate human representation that is invariant to occlusions

VIBE: Video Inference for Human Body Pose and Shape Estimation

Kocabas, et al. 2020 (CVPR)

Ran out of time to complete

Generate VIBE annotations on pose-synthesized output

#

Page 22 of 36

Train CP-VTON with More Examples

Combined dataset

VITON, VVT (track 4), MPV (track 3)

GMM

33,150 steps, batch 128

TOM

3,800 steps, batch 128

#

Page 23 of 36

Hypothesized Pipeline Overview

#

Global Flow Local Attention Module

CP-VTON Geometric Matching Module

CP-VTON Try-On Module

SUBTASK 1: REPOSE

SUBTASK 2: TRY-ON

Page 25 of 36

Final Results

#

Page 26 of 36

4he21d00f-g11 bj721e05j-c11 bj721d02x-q11

4he21d00f-k11 g1021d05g-k11 tn221d011-q11

#

Page 27 of 36

4he21d00f-n11 on921d00b-a11 v1021e0hl-k11

4he21d00f-q11 to721d0dq-n11 ka321d054-a11

#

Page 28 of 36

4vi21d01z-q11 gu121d0cx-q11 4he21d00f-k11

ad121d0f0-g11 te421d030-k11 dp521e0rx-t11

#

Page 29 of 36

am221d05x-c11 pc721d06p-g11 ri921e06f-q11

an621d0bw-g11 m3621d064-q12 bj721d030-k11

#

Page 30 of 36

an621d0bw-q11 c1821d04o-b11 di121d09j-k11

an621d0cg-k11 p2121d00o-a11 ts121e00m-k11

#

Page 31 of 36

10 70 an621da76-g11 cr221d01o-g12 to121d05m-c12

11 59 an621da76-i11 an621da76-g11 oa221e05o-q11

#

Page 32 of 36

12 7 an621da8n-q11 ch621d03q-q11 an621d0cg-k11

13 60 an621da9d-q11 es121d0ry-c11 an621d0bw-g11

#

Page 33 of 36

14 94 bh921e04a-e11 to721d0do-j11 gp021d088-c11

15 97 bh921e04a-j11 bj721d030-k11 op521d00w-a11

#

Page 34 of 36

Results Observations

Strengths

CP-VTON was the strongest contributor to “good” quality

Texture details present, cloth placed in correct location

Followed pose sequence

Try-on module reduced flickering

Weaknesses

Custom-trained “Global Flow Local Attention” provided extremely poor results

Ghost artifacts

Face not preserved

Especially poor on zoomed views

CP-VTON Geometric Matching has no concept of person’s orientation

Possibly addressed by MG-VTON (Dong et al. 2019)

Noisy pose points may have contributed to poor quality

#

Page 35 of 36

Future Work for the Community

Common problem: overfitting

Are published results cherry picked? No way to know

Need representative results

Need for quantitative quality metrics

Challenging task, perhaps investigate perceptual metrics?

Many papers, little code, unreproducible

Latest claims in virtual try-on have no public code

Best repository (CP-VTON) is already 2 years old

Please add your code to PapersWithCode “Virtual Try-on” task

Our code for this challenge

https://github.com/andrewjong/Global-Flow-Local-Attention-VTryon

https://github.com/andrewjong/cp-vton-modded

#

Page 36 of 36

Thank you to all the organizers!

#