Page 1 of 36
Video Virtual Try-on Challenge
- CVPR 2020 -
Andrew Jong1, Gaurav Kuppa1, Xin Liu2, Teng-Sheng Moh1, Ziwei Liu2
San Jose State University1, The Chinese University of Hong Kong2
Page 2 of 36
Challenge Description
#
Page 3 of 36
Competition Data
VVT Dataset
791 total videos
661 train
130 test
~240 frames per video at 29.97 FPS (~8 seconds per video)
Each training sample
1 cloth product image
~240 image frames per video
~240 corresponding COCO-18 keypoints
#
Page 4 of 36
Inference Task
Input
Person image
Skeleton Pose Sequence
Cloth Product image
Desired Output
Video: Cloth Product on Person following Pose Sequence
Infer on arbitrary cloth product, person, and pose inputs.
#
Page 5 of 36
Our Approach
#
Page 6 of 36
Initial Hypotheses
Break down into 2 subtasks
Person Repose
Try-on
Multiple ways to order subtasks for final result
Repose input image to video frames → Try-on on all reposed frames
Try-on on input image→ Repose tried-on to video frames
+
=
=
+
Subtask 1: Repose
Subtask 2: Try-on
#
Page 7 of 36
Hypothesized Pipeline Overview
#
Global Flow Local Attention Module
CP-VTON Geometric Matching Module
CP-VTON Try-On Module
SUBTASK 1: REPOSE
SUBTASK 2: TRY-ON
Page 8 of 36
SUBTASK 1: REPOSE
Input
Person image
Skeleton Pose Sequence
Desired Output
Video: Cloth Product on Person following Pose Sequence
#
+
=
Page 9 of 36
Repose Attempt 1: “Guided-Pix2Pix”
AlBahar et al. 2019 (ICCV)
Pretrained checkpoint, inference results on VVT test →
Image based, no temporal coherence
#
Page 10 of 36
Repose Attempt 2: “Global Flow Local Attention”
Global Flow Local Attention (GFLA)
Ren et al. 2020 (CVPR)
Claim: strong temporal coherence
Animation of purported results featured on Global Flow Local Attention (GFLA) GitHub.
#
Page 11 of 36
“Global Flow Local Attention” (GFLA) pretrained checkpoint on VVT test
Poor generalization
#
Page 12 of 36
Decide to Train a Custom GFLA Checkpoint
Observation
VVT’s test keypoints are sometimes noisy
Zoomed-in samples not present in original GFLA dataset
Therefore: Retrain Global Flow on VVT competition dataset
Author recommended minimum 10,000 steps
We trained to 30,570 steps, batch size 8
Train time: 7 days on four Tesla V100s
#
Page 17 of 36
Good quality on train
Poor quality on test
Ghost artifacts
Texture details (low and high frequency) missing
User identity not preserved
In practice, user identity can be preserved by copying user face data
However, slightly better (subjective) temporal coherence than Guided-Pix2Pix
Ran out of time to refine further
Repose Subtask: settled on GFLA custom trained to 30,570 steps
GFLA Observations
Global Flow (top)
Guided Pix2Pix (bottom)
#
Page 18 of 36
Subtask Ordering
Multiple ways to order subtasks for final result
Repose input image to video frames → Try-on on all reposed frames
Try-on on input image→ Repose tried-on to video frame
Because GFLA produced poor outputs on Repose, we must Repose → Try-on for best results
#
Page 19 of 36
SUBTASK 2: TRY-ON
Input
Cloth Product image
Person image
Desired Output
Cloth Product on Person following Pose Sequence
#
=
+
Page 20 of 36
SUBTASK 2: TRY-ON
Best existing public repository: “CP-VTON” Wang et al. 2018 (ECCV)
Two modules
Geometric Matching Module
Warp cloth to person
Try-on Module
Compose cloth and person
#
=
+
Subtask 2: Try-on
Geometric Matching Module (GMM)
Try-on Module (TOM)
Page 21 of 36
Initial Idea: Train with More Annotations
Hypothesis: annotated guidance would improve try-on understanding
Provide more accurate human representation that is invariant to occlusions
VIBE: Video Inference for Human Body Pose and Shape Estimation
Kocabas, et al. 2020 (CVPR)
Ran out of time to complete
Generate VIBE annotations on pose-synthesized output
#
Page 22 of 36
Train CP-VTON with More Examples
Combined dataset
VITON, VVT (track 4), MPV (track 3)
GMM
33,150 steps, batch 128
TOM
3,800 steps, batch 128
#
Page 23 of 36
Hypothesized Pipeline Overview
#
Global Flow Local Attention Module
CP-VTON Geometric Matching Module
CP-VTON Try-On Module
SUBTASK 1: REPOSE
SUBTASK 2: TRY-ON
Page 25 of 36
Final Results
#
Page 31 of 36
10 70 an621da76-g11 cr221d01o-g12 to121d05m-c12
11 59 an621da76-i11 an621da76-g11 oa221e05o-q11
#
Page 32 of 36
12 7 an621da8n-q11 ch621d03q-q11 an621d0cg-k11
13 60 an621da9d-q11 es121d0ry-c11 an621d0bw-g11
#
Page 33 of 36
14 94 bh921e04a-e11 to721d0do-j11 gp021d088-c11
15 97 bh921e04a-j11 bj721d030-k11 op521d00w-a11
#
Page 34 of 36
Results Observations
Strengths
CP-VTON was the strongest contributor to “good” quality
Texture details present, cloth placed in correct location
Followed pose sequence
Try-on module reduced flickering
Weaknesses
Custom-trained “Global Flow Local Attention” provided extremely poor results
Ghost artifacts
Face not preserved
Especially poor on zoomed views
CP-VTON Geometric Matching has no concept of person’s orientation
Possibly addressed by MG-VTON (Dong et al. 2019)
Noisy pose points may have contributed to poor quality
#
Page 35 of 36
Future Work for the Community
Common problem: overfitting
Are published results cherry picked? No way to know
Need representative results
Need for quantitative quality metrics
Challenging task, perhaps investigate perceptual metrics?
Many papers, little code, unreproducible
Latest claims in virtual try-on have no public code
Best repository (CP-VTON) is already 2 years old
Please add your code to PapersWithCode “Virtual Try-on” task
Our code for this challenge
https://github.com/andrewjong/Global-Flow-Local-Attention-VTryon
https://github.com/andrewjong/cp-vton-modded
#
Page 36 of 36
Thank you to all the organizers!
#