Cheap production, scarce judgment

Two articles from late April and early May 2026 about AI in mathematics are best read together. Tim Gowers, a Cambridge mathematician and Fields medalist, wrote a detailed blog post describing how ChatGPT 5.5 Pro produced what he called “PhD-level research” in about an hour, with “no serious mathematical input” from him. Terence Tao, a UCLA mathematician and Fields medalist, said in a Nature interview called “The job description is changing” that mathematics now has to reconsider basic questions: what counts as a proof, what counts as a paper, and what the profession is for. As Tao said, “if we don’t ask these questions ourselves, then they will get answered for us by a technology company or decided by financial incentives.” Some people see the post as proof that advanced AI can now do independent research. Others think it’s just a Fields medalist being impressed by a clever trick. Both views miss what the post really shows, and what mathematicians should do in response.

A case study, not a sample

Gowers’s post is a serious case study by a mathematician who knows the field well. Gowers explained his methods, asked Isaac Rajagopal—the MIT student whose earlier paper the model built on—to check the result, and was explicit that he considered the output non-trivial. The model took a result from Rajagopal’s recent paper and pushed it further. Rajagopal described the main technique as “completely original” and said it was the kind of idea he would have been proud to come up with after a week or two of thinking. Gowers’s own view is cautious: he calls the result “a perfectly reasonable chapter in a combinatorics PhD,” not a major breakthrough, but “definitely a non-trivial extension.” That’s more careful than most commentary on AI and mathematics.

But it is still just one example.

What we have is one published success. There isn’t a matching blog post called “ChatGPT 5.5 Pro spent an hour producing a confident, plausible, but subtly wrong proof of a small open problem and I almost believed it.” Cases like that almost certainly exist, but we don’t see them because people rarely write up failures with the same care. The case selection is doing a lot of work here, and conclusions drawn from a single case should be discounted accordingly. This is the same issue with every viral “AI did X” story: we’re looking at the right tail of a distribution whose shape we’ve barely begun to measure.

We don’t know how many similar problems the model would fail on, or how much the success depended on how the problem was presented. One example can’t show us how often fluent proofs would fall apart under close review, or whether the problem or its main technique is similar to anything in the training data. These are real questions. Any claim about “AI doing research-level mathematics” needs to answer them to be meaningful.

The jagged frontier

FrontierMath is a benchmark of original research-level problems, put together with input from Fields medalists — Gowers among them. When it launched in late 2024, the best models solved under two percent. By early May 2026, several frontier models, GPT-5.5 Pro included, score above fifty percent on its main problem set. The numbers come from BenchLM’s leaderboard, a secondary tracker but a useful snapshot.

DeepMind’s math agent Aletheia shows a similar trend from another angle. On a curated set of ten research-level problems, it got six right. On seven hundred open problems from Thomas Bloom’s online database of Erdős conjectures, it solved four on its own. That’s about sixty percent on the curated set, but less than one percent on the open problems. It’s the jagged frontier described by Dell’Acqua and colleagues in a recent Organization Science study with 758 Boston Consulting Group consultants. AI help improved results on tasks within its strengths, but on tasks just outside its abilities, consultants using AI were nineteen percent less likely to get the right answer. Early results in math look similar. Gowers’s case is one good outcome, but we don’t know about the rest yet.

Workflow, not output

In Gowers’s case, a person posed the problem, the model built on earlier published work with the original author involved, the AI generated arguments, and people checked and evaluated the outcome. Calling this “AI-produced”—as one popular headline said, “with zero human help”—oversimplifies what actually happened. Any rules we create need to address the whole process, not just the AI’s part.

Credit shows who picked the question, whose work contributed to the answer, and who is responsible if something goes wrong. Gowers suggests that maybe nobody needs credit in the usual sense for an AI-assisted result. I think that’s too hasty. If a proof is wrong, or a claim of novelty turns out to be overstated, somebody has to be responsible—and it can’t be the model. Removing authorship from the picture doesn’t get rid of that responsibility. It just makes it harder to find.

The cost of checking proofs doesn’t get enough attention. In theory, math has an advantage: a careful reader can spot a wrong proof. That helps protect against mistakes from AI. But checking proofs takes expert time, and that doesn’t scale as quickly as AI can generate new work. If thousands of AI-assisted papers start showing up on arXiv, the main challenge will shift from creating proofs to checking them. The informal trust systems math has relied on may not keep up either.

Equity

Most of the discussion so far has ignored the issue of access. Gowers says he was “fortunate to have been given access” to ChatGPT 5.5 Pro before it was widely available. Now, the model is only available through ChatGPT’s Pro, Business, and Enterprise plans, or a separate paid API. Other AI labs also have internal tools that only some researchers can use. If important research depends more and more on expensive subscriptions and special access, then who gets into PhD programs, who gets hired, and who gets published will start to depend on who can afford the tools, making existing inequalities worse. One early commenter on Gowers’s post brought this up directly. Gowers, to his credit, replied that it’s “potentially a very bad aspect” of the current situation and suggested some ways to address it.

Pedagogy

Learning math often means struggling with a tough problem and finally solving it. In the Nature interview, Tao says graduate students who avoid using AI might have fewer chances, and he’s probably right. Gowers adds that people who have solved hard problems themselves are usually better at using AI for them, “just as very good coders are better at vibe coding than not such good coders.” If graduate students use AI to skip the slow process of making mistakes and learning from them, the field might see a short-term boost in productivity but lose the deep expertise needed to guide and check AI’s work. No one has a clear answer yet for how to train mathematicians in this new environment, and that question is more important than any test score.

Who gets edged out

When AI takes over work that mathematicians used to do, who gets left out, and what do we lose when that happens?

Mathematics isn’t just about results. It’s a practice—something people do, much like making music or writing poetry. In his 1994 essay “On Proof and Progress in Mathematics,” William Thurston said the real purpose of mathematics is human understanding, not just proving things. For Thurston, the goal is to know why something is true, in a way that others can also understand. Francis Su, in Mathematics for Human Flourishing, takes this further. He says that doing math builds patience, focus, and the ability to stick with hard problems—qualities that are not just extras in life, but essential parts of it.

If AI takes over more of the work, the question isn’t just whether the answers are right, but whether anyone is still doing the work of practicing math. Creating a verified theorem is one thing. Creating a mathematician is something else. A machine can do the first, but only a person can become the second.

If a graduate student uses AI to skip a problem they would have spent a month struggling with, they get the answer but lose a month of the work that helps make them a mathematician. Whether that trade is worth it depends on what we think the struggle is for. If we see it as just slow theorem-making, AI is an improvement. If we see it as how someone becomes a mathematician, AI may be taking away something important.

Whose institutions?

Tao puts the main point clearly: the rules for AI-assisted math are being set right now. DeepMind, when talking about Aletheia, has already suggested a five-level system for classifying AI-assisted math results, from “negligible novelty” to “landmark breakthrough,” combined with three levels of AI involvement. To DeepMind’s credit, they say this system came from “extensive discussions with the mathematical community” and is meant to help the wider debate. It’s a good first try, but it’s still a system proposed by an AI company—not by a math society, not approved by a journal, and not created through a community process. That’s exactly what Tao warned about. Whether mathematicians end up with good practices depends on the choices they make now, not on what AI can do. If mathematicians don’t make those choices on purpose, others will make them by default—tech companies, journals, or hiring committees reacting to whatever appears on arXiv.

Mathematicians need institutions for evaluating AI-assisted claims — for disclosure, verification, credit, and access — and they have to build them, deliberately and soon. They also have to stay honest about what could be lost as the change takes hold: people who get edged out, kinds of training that disappear, the shreds of equity already in short supply.