Machine-Generated Code Deserves Machine-Checked Proofs

For more than a hundred years, the dream of a “proof machine”, a mechanical procedure that automatically carries out mathematical reasoning according to formal rules, has haunted mathematics and computer science. In its strongest form, the dream proved to be theoretically impossible, but in the process, the theory of computation was born. And despite the limitations, it continued to inspire generations of mathematicians and computer scientists, who, along the way, built the tools to draw as close to it as possible: proof assistants and theorem provers. Yet, mechanizing mathematical proofs has remained a daunting task, requiring enormous human effort.

Last February, I sat at my laptop and watched an AI coding agent write 7,800 lines of such a proof in four days. This post is about that experience, and about the longer arc it belongs to.

The Oldest Dream in Computer Science

In the early 20th century, a foundational crisis shook mathematics. Paradoxes in set theory exposed the shaky foundations of the informal reasoning mathematicians relied on. In 1921, David Hilbert, one of the most renowned mathematicians of his time, proposed a foundational program to the mathematical community, one that would change the course of mathematical logic and computer science. Hilbert’s ambition was to have a formal, unifying theory of mathematics, one that would put all of mathematics on solid ground. He proposed to express all of mathematics in a formal language, and derive all mathematical truths from a finite set of axioms and inference rules. Such axiomatization had to be consistent, meaning it cannot prove false statements, and complete, meaning that it can prove all true statements.

But Hilbert’s vision went even further: he wanted the system to be decidable, meaning that there would be a mechanical procedure, a “proof machine”, so to speak, that could determine whether any given statement is provable from the axioms.

The dream did not last long. In 1931, Kurt Godel proved that any formal system powerful enough to express arithmetic cannot prove its own consistency, and that there will always be true statements within it that cannot be proved. The consistency and completeness that Hilbert sought were mathematically impossible. A few years later, Turing and Church independently settled the Entscheidungsproblem: there is no algorithm that can decide the provability of a given mathematical statement. The mechanical procedure Hilbert dreamed of does not exist.

Godel's grave at Princeton Cemetery, 2017
Godel's grave at Princeton Cemetery, Easter 2017.

Hilbert’s vision for a proof machine failed. But the concept of computation itself was born from this quest. Turing’s proof that the decision problem is unsolvable required him to define, with mathematical precision, what a “mechanical procedure” actually was. The result was the Turing machine, and with it the theoretical foundations of computer science.

And within the boundaries that Godel, Church, and Turing drew, the dream persisted. An enormous amount of mathematics and software can be formally verified. It simply cannot be done automatically for all statements. What is possible is machine-checked proof: a human (or a machine) constructs a proof in a formal language, and a small, trusted program, the proof checker, verifies that every step follows from the rules of logic. The proof checker only needs to implement the theory correctly. Starting in the 1960s, researchers began building the tools to do exactly this:

The list is by no means exhaustive; many other proof assistants and theorem provers have been developed over the years, with Lean being a recent addition that has gained significant traction in the mathematical community.

A separate but related line of work pursued a different goal: deciding mathematical validity automatically. Automated theorem proving, from the Davis-Putnam algorithm for propositional satisfiability (1960) to Robinson’s resolution principle (1965) to modern SAT and SMT solvers, aims to discharge proof obligations without human guidance, at least for restricted classes of problems.

These efforts narrowed the gap between Hilbert’s dream and reality and have led to remarkable results in formalizing complex mathematical theorems and verifying significant pieces of software. Some of the most notable results include:

But the bottleneck has always been the human effort required to construct machine-checked proofs. For decades, formal verification has been a discipline of the devoted few, requiring crushing labor for results that most of the world never sees.

Enter AI

And now something is shifting.

We are at a point that most of us working in the field could not have anticipated. Large language models, systems trained on vast corpora of text and code, have developed the ability to reason mathematically and generate proofs in formal systems. These proofs can be mechanically checked for validity by the same proof assistants that researchers have been building for decades.

The results have been remarkable. In the past year, we’ve seen:

Together, these results suggest that the bottleneck that has constrained formal verification for decades may be breaking.

My Own Experience with AI-Assisted Verification

I started experimenting with LLMs for proof generation only recently. A few months ago, commercial models were not yet capable of handling complex proofs, and I was skeptical that they ever could. After a nudge by a colleague, who informed me about Opus 4.6’s remarkable capabilities for generating Agda proofs, I decided to try it for myself. I bought a subscription to Claude Max and decided to give it a non-trivial problem: the correctness proof of a compiler transformation. I had, a few years ago, attempted to do this proof manually and set it aside – the reasoning was complex and the time it required too great.

The transformation is part of the CertiRocq verified compiler for pure functional programming languages, which my collaborators and I have been developing for several years. It translates functional programs into A-normal form (ANF), a low-level, functional intermediate representation. My plan was to reuse a proof technique that I had already used and published on a related transformation (translation to Continuation-Passing Style, CPS) and ask the LLM to adapt it to the new setting.

The results were astonishing. In about four days, Claude Code was able to generate an end-to-end machine-checked proof for the transformation, a proof consisting of roughly 7,800 lines of code. My initial prompt was short and simple: I asked the LLM to look at the proof of the CPS transformation and use the same technique to prove the ANF transformation correct. Then the LLM started working: planning the proof, generating proof skeletons and helper lemmas, compiling them, diagnosing errors, and iterating.

My role was to guide the process: I would point it at the next proof obligation to tackle. In a few situations where the proof got stuck, I had to figure out the underlying reasoning and provide guidance on how to tackle the proof, in natural language. But I did not write any proof code myself. The LLM handled all the plumbing: setting up inductions, performing case analyses, applying lemmas from the codebase and the Rocq standard library.

The LLM’s performance was not perfect, of course. It made mistakes, and it got stuck on some cases. But the overall trajectory was consistently progressing towards a complete proof. An important aspect that makes this kind of work possible is the feedback loop with the proof checker. The LLM would write a proof script, compile it, and if it failed, it would diagnose the error and iterate, without human involvement. This process keeps the LLM on track towards a correct proof, and allows it to work through the night without supervision. Sometimes, I left the LLM working unattended for many hours and came back to find entire lemmas and proof cases completed.

Since this first experiment, I have been using LLMs for proof mechanization in other projects as well, with similar results. The experience has been transformative, and it has given me a glimpse of what the future of formal verification might look like.

The Vision: Verified AI-Generated Software

Why does this matter beyond the small and niche world of software verification?

The process of software development is rapidly changing. As AI generates more and more of the world’s software, the question of correctness becomes more pressing. But for the first time in history, we have a hint that formal verification may be applicable at scale, using the exact same tools that generate the code. So, why not use these tools to generate machine-checked proofs of correctness for the very software they produce?

The end-to-end vision is this:

  1. An LLM writes software in a proof-oriented language (Rocq, Lean, F*), along with a mathematical proof of its correctness.
  2. A proof assistant mechanically checks that the proof is valid. The LLM is not trusted. Only the mathematics matters.
  3. A verified compiler compiles the verified source to an executable. The compiler itself has been proved correct, so compilation cannot introduce bugs.

The result is software with end-to-end mathematical guarantees, from specification to executable. The user need not trust the LLM, the compiler, or the proof. The only things the user needs to trust are the specification, which is the mathematical statement of what the software is supposed to do, and the proof checker, which ensures that the proof is valid.

As the preliminary results suggest, this vision may not be far from reality. However, there are still challenges to overcome.

What comes next?

There is still work ahead. But for the first time in history, what once seemed out of reach is happening: machines are writing code together with machine-checked proofs of its correctness. The attempt to automate mathematical proof gave birth to the concept of computation. A century later, computation is returning the favor.