About
I am Jixuan Ying, a junior undergraduate at Tsinghua University and a research intern at SenseTime Research / Multimodal Large Model Group.
My research focuses on multimodal reasoning, especially interleaved image-text reasoning in unified understanding-generation models. More broadly, I am interested in vision-language models, multimodal Chain-of-Thought, and visual reasoning. I also work on efficient visual architectures and linear attention for vision and generation.
Recently, I have been working on interleaved multimodal reasoning, with a focus on multimodal Chain-of-Thought, context confusion in visual-text settings, and how higher-quality interleaved supervision and training strategies can improve reasoning robustness and generalization.
Links:
- Google Scholar
- Email: yingjx23@mails.tsinghua.edu.cn
