I am an incoming Ph.D. student at Carnegie Mellon University (CMU), affiliated with the Language Technologies Institute (LTI) in the School of Computer Science (SCS), and the newly founded Institute for Computer-Aided Reasoning in Mathematics (ICARM). I am working on verifiable reasoning agents at Apodex, mentored by Dr. Kaiyu Yang.
I received my B.S. in Computer Science from California Institute of Technology (Caltech), with a minor in Robotics, advised by Prof. Steven Low and Prof. Günter Niemeyer.
I study machine reasoning, in both formal and informal worlds: how AI systems can derive conclusions, act on them, evolve as tasks drift, and remain reliable when settings become complex.
(1) Formal reasoning. In mathematics, code, and other formalizable domains, verifiability gives us a rare foundation for building dependable reasoning systems. I develop models and infrastructure [LeanDojo][Lean Copilot] that form a bidirectional loop between formal environments and ML systems, enabling efficient human-AI collaboration in theorem proving [Human-AI Formalization]. Building on this loop, I move beyond static models toward adaptive agents that evolve with growing libraries [LeanAgent] and expanding contexts [LeanProgress].
(2) Informal reasoning. In open-ended natural-language settings, I continue to study how agents adapt [Adaptation][Steer2Adapt] and generalize [Generalization][Idiom]. These goals, however, become harder to pursue with a lack of symbolic verifiers or reliable reward signals. One way forward is to develop agentic evaluations that provide rich behavioral signals while remaining meaningful across diverse scenarios [Interactive Evaluation][Personality Illusion][Rethinking Psychometrics]. Then at runtime, the complementary challenge is to equip reasoning systems to identify, mitigate, and resist failures as they inevitably arise [Reasoning Failures][A-Not-B].
Earlier, I also worked on making reasoning systems more efficient through novel architecture design [Delay Space][DelayNet].
My long-term research goal is to build AI systems whose reasoning is as creative as human intuition and as dependable as formal logic.
AI evaluation is increasingly moving beyond static responses toward systems that act through tools, environments, users, and other agents. But the field risks adding interaction faster than it develops the scientific foundations for evaluating interaction. We argue that interactive evaluation needs a design science, not just more agent benchmarks. We define interactive evaluation, propose a 2-axis taxonomy, and outline principles for designing evaluations that are interpretable, comparable, and scientifically useful.
Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports reliably predict behavior. We compare broad Big 5 traits with the Theory of Planned Behavior, which targets intentions toward specific behaviors and better predicts when self-reports translate into action.
We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice.
Steer2Adapt adapts LLMs by composing reusable steering vectors instead of learning new task-specific directions from scratch. With only a handful of examples, it dynamically discovers combinations of semantic basis vectors for reasoning and safety tasks, yielding data-efficient, stable, and transparent inference-time adaptation.
We present the first comprehensive survey dedicated to reasoning failures in LLMs. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities.
LLMs say they have personalities, but they don’t act like it. Alignment today shapes language, not behavior. This linguistic–behavioral dissociation cautions against equating coherent self-reports with cognitive depth.
We introduce Lean Copilot, a framework for running neural network inference directly in Lean. It enables various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users, including tools for suggesting proof steps (tactics), selecting premises, and searching for complete proofs using LLMs.
What operations can you perform efficiently when you use the “time of arrival” of a signal’s edge to represent a number? We present negative-logarithmic delay space arithmetic as a completely new approach to temporal coding. Under this approach, general purpose arithmetic is transformed to a “soft” version of the standard temporal operations in such a way that preserves all of the algebraic identities.
LeanAgent continuously learns and improves on ever-expanding mathematical knowledge without forgetting what it learned before. It has a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity.
LLMs struggling with long proofs? We present LeanProgress, which uses a novel critic model where “distance” to the goal state acts as a key signal for step prediction, boosting performance on neural theorem proving in Lean. Our method achieves 75.1% prediction accuracy, with a 3.8% gain in proof search with step prediction.
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations.
LLMs show sharply different generalization behaviors: supervised fine-tuning often narrows capability, while reinforcement-learning tuning tends to preserve it. We introduce a benchmark that decomposes reasoning into atomic skills such as calculation, fact retrieval, simulation, enumeration, and diagnosis, enabling fine-grained analysis beyond coarse accuracy metrics.
Motivated by the crucial cognitive phenomenon of A-not-B errors, we present the first systematic evaluation on the surprisingly vulnerable inhibitory control abilities of LLMs. We reveal that this weakness undermines LLMs' trustworthy reasoning capabilities across diverse domains, and introduce various mitigations.
To compile a dictionary of East Asian idiom translations demands much time and creativity even for expert translators. To alleviate such burden, we automate high-quality data generation with GPT-4, and discover Pareto-optimal prompting strategies on both faithfulness and creativity, outperforming existing translation engines and human baseline.
We introduce energy-efficient convolution that improves the energy per pixel of each convolution frame by more than 2× compared to the state-of-the-art while improving the energy delay product by four orders of magnitude, by developing a new temporal arithmetic with a negative log transformation.
Can LLMs generate mathematical proofs that can be rigorously checked? We release LeanDojo: an open-source playground consisting of toolkits, benchmarks, and models for LLMs to prove formal theorems in the Lean proof assistant.
Selected Media
My research has been covered by many media. Some representative ones include:
I enjoy long walks, running, cycling, and badminton. I also love reading and writing, and sharing good meals with friends. When I’m on vacation, I like traveling, collecting fridge magnets, and doing nature photography — and I finally have time to dive into longer books.
Earlier on, I explored a range of spare-time activities, including math and algorithm contests, debate tournaments, and music production.
Last updated: June 2026. Website template credit: Jon Barron.