Sometimes I forget there's a whole other world out there where AI models aren't just used for basic tasks such as simple research and quick content summaries. Out in the land of bigwigs, they're instead being used to help with everything from financial analysis to scientific research. That's why their mathematical capabilities are so important—plus it's a general marker of reasoning capabilities.
Which is why mathematical benchmarks exist. Benchmarks such as FrontierMath, which its maker, Epoch AI, has just dropped and which is putting LLMs through their paces with «hundreds of original, expert-crafted mathematics problems designed to evaluate advanced reasoning capabilities in AI systems» (via Ars Technica).
While today's AI models don't tend to struggle with other mathematical benchmarks such as GSM-8k and MATH, according to Epoch AI, «they solve less than 2% of FrontierMath problems, revealing a substantial gap between current AI capabilities and the collective prowess of the mathematics community».
To be clear, these are hard problems. As in, so hard that they «typically require hours or days for expert mathematicians to solve», ranging «from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory».
What's so different about this benchmark is that solving these mathematical problems requires «extended chains of precise reasoning, with each step building exactly on what came before».
AI models have traditionally not been great at extended reasoning in general, let alone for super-advanced math. This makes sense when you consider what AI models, at bottom, are doing. Using LLMs as an example, these are trained on tons of data to figure out what each next word would most likely be based on this data. Although of course there's plenty of room for directing the model more towards different words, the process is essentially probabilistic.
Of late, however, we've seen AI models apply their
Read more on pcgamer.com