I am currently an expert researcher at Tencent Hunyuan LLM team, working on pretraining data reformulation. I received my Ph.D. from the University of Chinese Academy of Sciences (UCAS), directed by Prof Songlin Hu, where my research focused on Natural Language Processing. Previously, I worked at XiaoHongShu, Kwai, and Baidu.

Research Interests

I am committed to long-term, in-depth research on pretraining data reformulation, including:

  • Long-context Data Reformulation — Synthesizing long-context training data to enhance long-range modeling of LLMs.
  • Knowledge Data Reformulation — Creating diverse entry points (narratives, contexts, co-occurring entities) for the same knowledge, so that long-tail facts are no longer long-tail in pretraining corpora.
  • Long Agentic Data Reformulation — My latest focus. Agentic post-training struggles to generalize; the key is agentic pretraining. Since environments (e.g., GPU clusters, distributed systems, commercial platforms, etc.) are the hardest to scale, pretraining should lower the activation cost of post-training — making correct paths easier to sample.

I also explore scalable oversight — what do we do when humans can no longer effectively supervise AI? Welcome to discuss or join me (wuxing@iie.ac.cn).

Work Experience

Baidu · PaddlePaddle
Text Pretraining
2019.08 — 2020.07
Kwai · MMU
Text & Multimodal Pretraining
2020.08 — 2023.07
XiaoHongShu · HiLab
Long-context Pretraining
2023.08 — 2025.11
Tencent · HunYuan
Data Reformulation
2025.12 — Now

News

  • 🎉🎉🎉 NextLong and EntropyLong validated on Tencent HY 3.0, Zhipu GLM-5, and XiaoHongShu Dots, demonstrating consistent improvements across production LLMs.
  • 2026.01 EntropyLong accepted as ICLR 2026 conference paper!
  • 2025.11 LiteLong accepted as AAAI 2026 conference paper!
  • 2025.09 LongMagpie accepted as NeurIPS 2025 conference paper!
  • 2025.05 NextLong accepted as ICML 2025 conference paper!
  • 2025.01 Quest accepted as ICLR 2025 conference paper!

Selected Publications

PolicyLong: Towards On-Policy Context Extension
Correspondence to Xing Wu NEW Under Review
EntropyLong: Effective Long-Context Training via Predictive Uncertainty
Correspondence to Xing Wu ICLR 2026 Validated on GLM-5
NextLong: Toward Effective Long-Context Training without Long Documents
Correspondence to Xing Wu ICML 2025 Validated on HY 3.0, GLM-5, Dots
ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding
Xing Wu, et al. COLING 2020 200+ Citations
Conditional BERT Contextual Augmentation
Xing Wu, et al. ICCS 2019 500+ Citations

For a full list of publications, please visit my Google Scholar profile.

Selected Blogs