I am currently an expert researcher at Tencent Hunyuan LLM team, working on pretraining data reformulation. I received my Ph.D. from the University of Chinese Academy of Sciences (UCAS), directed by Prof Songlin Hu, where my research focused on Natural Language Processing. Previously, I worked at XiaoHongShu, Kwai, and Baidu.
Research Interests
I am committed to long-term, in-depth research on pretraining data reformulation, including:
- Long-context Data Reformulation — Synthesizing long-context training data to enhance long-range modeling of LLMs.
- Knowledge Data Reformulation — Creating diverse entry points (narratives, contexts, co-occurring entities) for the same knowledge, so that long-tail facts are no longer long-tail in pretraining corpora.
- Long Agentic Data Reformulation — My latest focus. Agentic post-training struggles to generalize; the key is agentic pretraining. Since environments (e.g., GPU clusters, distributed systems, commercial platforms, etc.) are the hardest to scale, pretraining should lower the activation cost of post-training — making correct paths easier to sample.
I also explore scalable oversight — what do we do when humans can no longer effectively supervise AI? Welcome to discuss or join me (wuxing@iie.ac.cn).
Work Experience
Baidu · PaddlePaddle
Text Pretraining
2019.08 — 2020.07
→
Kwai · MMU
Text & Multimodal Pretraining
2020.08 — 2023.07
→
XiaoHongShu · HiLab
Long-context Pretraining
2023.08 — 2025.11
→
Tencent · HunYuan
Data Reformulation
2025.12 — Now
News
- 🎉🎉🎉 NextLong and EntropyLong validated on Tencent HY 3.0, Zhipu GLM-5, and XiaoHongShu Dots, demonstrating consistent improvements across production LLMs.
- 2026.01 EntropyLong accepted as ICLR 2026 conference paper!
- 2025.11 LiteLong accepted as AAAI 2026 conference paper!
- 2025.09 LongMagpie accepted as NeurIPS 2025 conference paper!
- 2025.05 NextLong accepted as ICML 2025 conference paper!
- 2025.01 Quest accepted as ICLR 2025 conference paper!
Selected Publications
ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding
Conditional BERT Contextual Augmentation
For a full list of publications, please visit my Google Scholar profile.