InterSketch: Interleaved Visual–Textual Chain-of-Thought with Tool-Generated Sketches and Reinforcement Learning

Published in International Conference on Machine Learning (ICML), 2025

This work studies long-horizon interleaved visual-text reasoning by introducing InterSketch, an agentic vision-language framework that uses tool-generated intermediate sketches to support multi-step reasoning.

My contribution focused on unified understanding-generation models for interleaved image-text reasoning. I built scalable data transformation pipelines for interleaved multimodal supervision and conducted controlled evaluation experiments to analyze multi-step reasoning behaviors on visual reasoning benchmarks.

Recommended citation: Z. Ning, W. Tong, X. Kong, S. Ma, Z. Shang, J. Ni, T. Hu, Y. X. Chng, Jixuan Ying, Z. Wu, J. Yang, W. Liu, H. Deng, L. Lu. "InterSketch: Interleaved Visual–Textual Chain-of-Thought with Tool-Generated Sketches and Reinforcement Learning." ICML 2026, under review.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)