|
UV-CoT
UV-CoT
removes the need for manual annotation. Given an input image, the model automatically generates initial (seed) bounding boxes and answers questions based on these regions. An evaluator multi-modal LLM (MLLM) then scores the answers as an indirect measure of region quality. Finally, the target model is optimized via preference optimization, encouraging it to favor regions associated with better answers.
Visualization of preference data generated by our method. Preferred bounding boxes are shown in red. Dis-preferred bounding boxes are in blue.
Visualization of our UV-CoT
inference.
Model-generated bounding boxes are shown in red.
@misc{zhao2025unsupervisedvisualchainofthoughtreasoning, title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization}, author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang}, year={2025}, eprint={2504.18397}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.18397}, }