OpenAI’s new AI agent, Deep Research, has shattered previous records on “Humanity’s Last Exam”, the world’s hardest AI reasoning benchmark, achieving an unprecedented 26.6% accuracy—a 183% improvement in just two weeks. Humanity’s Last Exam is designed to be nearly impossible, testing AI with some of the hardest reasoning and problem-solving challenges known to man. Even the best human minds would struggle with it. The fact that OpenAI Deep Research has jumped from 9.4% to 26.6% in just two weeks shows an incredible rate of progress. While OpenAI’s ChatGPT o3-mini also made strides, scoring up to 13% depending on capacity, Deep Research’s web-search capabilities give it an advantage over other models. Despite the rapid progress, these scores highlight the immense difficulty of true complex reasoning, even for advanced AI models. Given the test’s extreme challenge, the leap from 9.4% to 26.6% in just two weeks is remarkable—but without a clear human baseline, it’s unclear how AI truly compares to human expert-level reasoning.
My Take
While AI’s ability to tackle reasoning challenges is improving at an astonishing pace, true intelligence isn’t just about scoring higher—it’s about understanding context, applying knowledge across domains, and making sound judgments. Expert predictions on AGI vary widely, with Sam Altman confident that OpenAI knows how to build it and expects superintelligence within a decade.
#ArtificialIntelligence #MachineLearning #AIResearch #TechInnovation #FutureOfAI #DeepLearning #AIbenchmarks
Link to article:
Credit: Techradar
This post reflects my own thoughts and analysis, whether informed by media reports, personal insights, or professional experience. While enhanced with AI assistance, it has been thoroughly reviewed and edited to ensure clarity and relevance.