publications | Hao Chen

2026

CVPR

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, and 5 more authors

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

Website
ICLR

Tinker: Diffusion’s Gift to 3D–Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, and 1 more author

The Fourteenth International Conference on Learning Representations, 2026

Website
ICLR

Time is a feature: Exploiting temporal dynamics in diffusion language models

Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, and 4 more authors

The Fourteenth International Conference on Learning Representations, 2026

Website
AAAI

Odyssey: Open-world quadrupeds exploration and manipulation for long-horizon tasks

Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, and 5 more authors

Proceedings of the AAAI Conference on Artificial Intelligence, 2026

Website
IJCV

FreerCustom: Training-Free Multi-Concept Customization for Image and Video Generation: C. Zhao et al.

Canyu Zhao, Ganggui Ding, Wen Wang, Zhen Yang, Zide Liu, and 2 more authors

International Journal of Computer Vision, 2026

2025

TPAMI

Diffusion Models are Efficient Data Generators for Human Mesh Recovery

Yongtao Ge, Wenjia Wang, Yongfan Chen, Fanzhou Wang, Lei Yang, and 2 more authors

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

DOI
IJCV

Segment Anything in Context with Vision Foundation Models

Yang Liu, Muzhi Zhu, Hao Chen, Xinlong Wang, Bo Feng, and 4 more authors

International Journal of Computer Vision, 2025

HTML
NeurIPS

Diception: A generalist diffusion model for visual perceptual tasks

Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, and 3 more authors

Advances in Neural Information Processing Systems, 2025

HTML Code Website
NeurIPS

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, and 4 more authors

Advances in Neural Information Processing Systems, 2025

HTML Code Website
ICCV

POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, and 2 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Code
ICCV

SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

Zihui Gao, Jia-Wang Bian, Guosheng Lin, Hao Chen, and Chunhua Shen

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

HTML
ICCV

Unified Open-World Segmentation with Multi-Modal Prompts

Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, and 5 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

HTML
SIGGRAPH

Generative Video Matting

Yongtao Ge, Kangyang Xie, Guangkai Xu, Li Ke, Mingyu Liu, and 4 more authors

In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, 2025

DOI Code
CVPR

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Code Website
ICLR

Revisiting Convolution Architecture in the Realm of DNA Foundation Models

Yu Bo, Weian Mao, Yanjun Shao, Weiqiang Bai, Peng Ye, and 4 more authors

In The Thirteenth International Conference on Learning Representations, 2025

HTML Code
ICLR

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, and 3 more authors

In The Thirteenth International Conference on Learning Representations, 2025

Code
ICLR

Boltzmann-Aligned Inverse Folding Model as a Predictor of Mutational Effects on Protein-Protein Interactions

Xiaoran Jiao, Weian Mao, Wengong Jin, Peiyuan Yang, Hao Chen, and 1 more author

In The Thirteenth International Conference on Learning Representations, 2025

Abs Code

Predicting the change in binding free energy ({}Delta }Delta G\) is crucial for understanding and modulating protein-protein interactions, which are critical in drug design. Due to the scarcity of experimental {}Delta}Delta G data, existing methods focus on pre-training, while alignment receives less attention. In this work, we propose the Boltzmann Alignment technique to transfer knowledge from pre-trained inverse folding models to {}Delta}Delta G prediction. We begin by analyzing the thermodynamic definition of {}Delta}Delta G and introducing the Boltzmann distribution to connect energy with protein conformational distribution. However, the protein conformational distribution is intractable; therefore, we employ Bayes’ theorem to circumvent direct estimation and instead utilize the log-likelihood provided by protein inverse folding models for {}Delta}Delta G estimation. Compared to previous inverse folding-based methods, our method explicitly accounts for the unbound state of protein complex in the {}Delta }Delta G thermodynamic cycle, introducing a physical inductive bias and achieving both supervised and unsupervised state-of-the-art (SoTA) performance. Experimental results on SKEMPI v2 indicate that our method achieves Spearman coefficients of 0.3201 (unsupervised) and 0.5134 (supervised) on SKEMPI v2, significantly surpassing the previously reported SoTA values of 0.2632 and 0.4324, respectively. Futhermore, we demonstrate the capability of our method on binding energy prediction, protein-protein docking and antibody optimization tasks.
ICLR

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, and 3 more authors

In The Thirteenth International Conference on Learning Representations, 2025

Code Website
ICLR

Framer: Interactive Frame Interpolation

Wen Wang, Qiuyu Wang, Kecheng Zheng, Hao Ouyang, Zhekai Chen, and 4 more authors

In The Thirteenth International Conference on Learning Representations, 2025

Video Code Website

2024

3DV

LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation

Weijie Ma, Jingwei Jiang, Yang Yang, Zehui Chen, and Hao Chen

2024
ICML

Generative Active Learning for Long-tailed Instance Segmentation

Muzhi Zhu, Chengxiang Fan, Hao Chen, Yang Liu, Weian Mao, and 2 more authors

In Forty-First International Conference on Machine Learning, 2024

Abs Code

Recently, large-scale language-image generative models have gained widespread attention and many works have utilized generated data from these models to further enhance the performance of perception tasks. However, not all generated data can positively impact downstream models, and these methods do not thoroughly explore how to better select and utilize generated data. On the other hand, there is still a lack of research oriented towards active learning on generated data. In this paper, we explore how to perform active learning specifically for generated data in the long-tailed instance segmentation task. Subsequently, we propose BSGAL, a new algorithm that estimates the contribution of the current batch-generated data based on gradient cache. BSGAL is meticulously designed to cater for unlimited generated data and complex downstream segmentation tasks. BSGAL outperforms the baseline approach and effectually improves the performance of long-tailed segmentation.
ICML

Floating Anchor Diffusion Model for Multi-motif Scaffolding

Ke Liu, Weian Mao, Shuaike Shen, Xiaoran Jiao, Zheng Sun, and 2 more authors

In Forty-First International Conference on Machine Learning, 2024

Code
TPAMI

Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, and 5 more authors

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Abs DOI Video Website

We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.
AAAI

DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation

Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, and 1 more author

In Proceedings of the AAAI Conference on Artificial Intelligence, 2024

Code
NeurIPS

A Simple Image Segmentation Framework via In-Context Examples

Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, and 2 more authors

In , 2024

Code
NeurIPS

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

Muzhi Zhu, Yang Liu, Zekai Luo, Chenchen Jing, Hao Chen, and 3 more authors

Advances in Neural Information Processing Systems, 2024

Code
AAAI

Revisiting Open-Set Panoptic Segmentation

Yufei Yin, Hao Chen, Wengang Zhou, Jiajun Deng, Haiming Xu, and 1 more author

Proceedings of the AAAI Conference on Artificial Intelligence, 2024

DOI
AAAI

Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning

Chenchen Jing, Yukun Li, Hao Chen, and Chunhua Shen

Proceedings of the AAAI Conference on Artificial Intelligence, 2024

DOI Code
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

Mingyang Zhang, Hao Chen, Chunhua Shen, Zhen Yang, Linlin Ou, and 2 more authors

In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024

Abs DOI

Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning methods designed for LLMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead.To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models.At a 50% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability. The code is available at https://github.com/aim-uofa/LoRAPrune.
ECCV

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Zhekai Chen, Wen Wang, Zhen Yang, Zeqing Yuan, Hao Chen, and 1 more author

In The 17th European Conference on Computer Vision ECCV 2024, Aug 2024

Code
CVPR

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, and 2 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Aug 2024

Code Website
CVPR

DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data

Chengxiang Fan, Muzhi Zhu, Hao Chen, Yang Liu, Weijia Wu, and 2 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Aug 2024

Code
ICLR

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and 1 more author

In The Twelfth International Conference on Learning Representations, Aug 2024

Code
IJCV

AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts

Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and 1 more author

International Journal of Computer Vision, Aug 2024

DOI Code Website
ICLR

De Novo Protein Design Using Geometric Vector Field Networks

Weian Mao, Muzhi Zhu, Zheng Sun, Shuaike Shen, Lin Yuanbo Wu, and 2 more authors

In The Twelfth International Conference on Learning Representations, Aug 2024

Abs

Advances like protein diffusion have marked revolutionary progress in {}textit{de novo} protein design, a central topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Only a few basic encoders, like IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we introduce the Vector Field Network (VFN), that enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential {}textit{universal encoder}\. In protein diffusion (frame modeling), VFN exhibits a impressive performance advantage over IPA, excelling in terms of both designability ({}textbf{67.04}{}% vs. 53.58}%) and diversity ({}textbf{66.54}{}% vs. 51.98}%). In inverse folding(frame and atom modeling), VFN outperforms the previous SoTA model, PiFold ({}textbf{54.7}{}% vs. 51.66}%), on sequence recovery rate; we also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA ({}textbf{62.67}{}% vs. 55.65}%), LM-Design, by a substantial margin. Code is available at https://github.com/aim-uofa/VFN
ICLR

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, and 3 more authors

In The Thirteenth International Conference on Learning Representations, Aug 2024

Code

2023

NeurIPS

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, and 4 more authors

Advances in Neural Information Processing Systems, Aug 2023

Code Website
CVPR

Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes

Rui Li, Dong Gong, Wei Yin, Hao Chen, Yu Zhu, and 4 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Aug 2023
CVPR

Learning Conditional Attributes for Compositional Zero-Shot Learning

Qingsheng Wang, Lingqiao Liu, Chenchen Jing, Hao Chen, Guoqiang Liang, and 2 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Aug 2023
IJCV

A Dynamic Feature Interaction Framework for Multi-task Visual Perception

Yuling Xi, Hao Chen, Ning Wang, Peng Wang, Yanning Zhang, and 2 more authors

International Journal of Computer Vision, Aug 2023

DOI
ICCV

FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models

Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, and 1 more author

In Proceedings of the IEEE/CVF International Conference on Computer Vision, Aug 2023

Code Website
ICLR

Object-Aware Inversion and Reassembly for Image Editing

Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bohan Zhuang, and 1 more author

In The Twelfth International Conference on Learning Representations, Aug 2023

Video Code Website
ICCV

CTVIS: Consistent Training for Online Video Instance Segmentation

Kaining Ying, Qing Zhong, Weian Mao, Zhenhua Wang, Hao Chen, and 5 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision, Aug 2023
ICCV

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, and 3 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision, Aug 2023

Awarded Code

Champion in CVPR2023 Monocular Depth Estimation Challenge
ICCV

SegPrompt: Boosting Open-World Segmentation via Category-Level Prompt Learning

Muzhi Zhu, Hengtao Li, Hao Chen, Chengxiang Fan, Weian Mao, and 3 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision, Aug 2023

2022

TPAMI

Instance and panoptic segmentation using conditional convolutions

Zhi Tian, Bowen Zhang, Hao Chen, and Chunhua Shen

IEEE Transactions on Pattern Analysis and Machine Intelligence, Aug 2022

2021

TPAMI

Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting

Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, and 2 more authors

IEEE Transactions on Pattern Analysis and Machine Intelligence, Aug 2021
IJCV

Exploring the capacity of an orderless box discretization network for multi-orientation scene text detection

Yuliang Liu, Tong He, Hao Chen, Xinyu Wang, Canjie Luo, and 3 more authors

International Journal of Computer Vision, Aug 2021
CVPR

Boxinst: High-performance instance segmentation with box annotations

Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2021
CVPR

Generic perceptual loss for modeling structured output dependencies

Yifan Liu, Hao Chen, Yu Chen, Wei Yin, and Chunhua Shen

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Aug 2021

2020

TPAMI

FCOS: A simple and strong anchor-free object detector

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He

IEEE transactions on pattern analysis and machine intelligence, Aug 2020
ECCV

Conditional convolutions for instance segmentation

Zhi Tian, Chunhua Shen, and Hao Chen

In European conference on computer vision, Aug 2020
CVPR

Blendmask: Top-down meets bottom-up for instance segmentation

Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, and 1 more author

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2020

Code
CVPR

Abcnet: Real-time scene text spotting with adaptive bezier-curve network

Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and 1 more author

In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2020

Code
CVPR

NAS-FCOS: Fast neural architecture search for object detection

Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, and 2 more authors

In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2020
CVPR

Memory-efficient hierarchical neural architecture search for image denoising

Haokui Zhang, Ying Li, Hao Chen, and Chunhua Shen

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2020
WACV

Architecture search of dynamic cells for semantic video segmentation

Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid

In Proceedings of the ieee/cvf winter conference on applications of computer vision, Aug 2020

2019

ICCV

FCOS: Fully Convolutional One-Stage Object Detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He

In Proceedings of the IEEE/CVF international conference on computer vision, Aug 2019

Code
IJCAI

Light-Weight Hybrid Convolutional Network for Liver Tumor Segmentation.

Jianpeng Zhang, Yutong Xie, Pingping Zhang, Hao Chen, Yong Xia, and 1 more author

In IJCAI, Aug 2019
CVPR

Fast neural architecture search of compact semantic segmentation models via auxiliary cells

Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2019
TPAMI

Adversarial learning of structure-aware fully convolutional networks for landmark localization

Yu Chen, Chunhua Shen, Hao Chen, Xiu-Shen Wei, Lingqiao Liu, and 1 more author

IEEE transactions on pattern analysis and machine intelligence, Aug 2019