Research
My work sits at the intersection of reinforcement learning and large language models, and roughly alternates between two modes:
- Understanding how RL works for Foundation Models — studying empirical regularities behind RL post-training, such as how to spend compute well and when reasoning transfers across domains (IsoCompute, Guru).
- Methods for RLVR training — designing algorithms for reward shaping, credit assignment, and more stable optimization in reasoning RL (TIPS).
Going forward, I am also drawn to world models and planning: learning dynamics predictors from interaction and coupling them with RL for search, planning, and data-efficient control.
Publications and preprints
Papers sorted by recency. Representative papers are highlighted.
Zhoujun Cheng*, Yutao Xie*, Yuxiao Qu*, Amrith Setlur*, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar
International Conference on Machine Learning (ICML), 2026
blog / arXiv / bibtex
A compute-optimal workflow for scaling on-policy RL sampling in LLM post-training.
Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Erran Li, Xiaolong Wang
International Conference on Learning Representations (ICLR), 2026
arXiv / code / bibtex
A self-distillation method leveraging mutual-information-based reward for finer credit assignment of RLVR.
Institute of Foundation Models, MBZUAI (31 authors, including Yutao Xie)
ArXiv Preprint, 2025 (Tech Report / Model Release)
project / models / arXiv / bibtex
K2-Think is a 32B model that rivals much larger ones via six training/inference technique pillars.
Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
Conference on Neural Information Processing Systems (NeurIPS) Datasets & Benchmarks, 2025
arXiv / website / bibtex
A cross-domain RL dataset with analysis showing domain-dependent RL gains across six reasoning domains.
Yutao Xie, Jun Wang, Cheng Chen, Taixin Yin, Shiyu Yang, Zhiyuan Li, Ye Zhang, Juyang Ke, Le Song, Lin Gan
Information Processing & Management (IPM), 2024
journal / bibtex
A DDP inference system loaded with an AST-based detector method to help monitoring in smart farms.
Misc ▸
Outside research, the same habits follow me into everything else I spend time with. I am simply drawn to what people make—and what making things reveals about them.
Away from the desk I lift weights , boulder badly as a beginner , and spend more time than is reasonable dialing in pour-over coffee .
What follows is not a full list, and it is not in any special order. I just wrote down a few things that came to mind while I was putting this page together.
A few favorites ▸
Magic Realism
One Hundred Years of Solitude and most of García Márquez's works
Existentialism
Steppenwolf
Lyrical Introspection
Klingsor's Last Summer
Historical Narrative
Hamilton
Prog-rock Folk
冀西南林路行
Post-rock
Lost in 21st Century and most of Wangwen (惘闻)
Atmospheric Black Metal
Gu Yan · Zuriaake
Psychedelic Concept
The Dark Side of the Moon
Environmental Storytelling ARPG
Dark Souls / Elden Ring
Philosophical CRPG
Disco Elysium / Planescape: Torment
Historical Mystery
Pentiment