Best Human Benchmark Tests

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

TechCrunch

A new AI benchmark tests whether chatbots protect human well-being

AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human well-being or just maximize for engagement. A ...

Android Police

OpenAI's simulated reasoning AI models matched human levels on ARC-AGI benchmark — Here's what that means for you

Benjamin is a business consultant, coach, designer, musician, artist, and writer, living in the remote mountains of Vermont. He has 20+ years experience in tech, an educational background in the arts, ...

Android

OpenAI Tests GPT-5 on Human Jobs: Benchmark Shows AI Matching Experts

OpenAI’s new GDPval benchmark tested GPT-5 on real-world jobs across nine industries, revealing that the AI matched or outperformed experts 40% of the time. While not a full replacement, OpenAI ...

ExtremeTech

OpenAI’s New GPT‑5.4 Surpasses Human Benchmark in Desktop Navigation and Reasoning Tests

Share on Facebook (opens in a new window) Share on X (opens in a new window) Share on Reddit (opens in a new window) Share on Hacker News (opens in a new window) Share on Flipboard (opens in a new ...

The Atlantic

Chatbots Are Cheating on Their Benchmark Tests

AI programs train on questions they’re later tested on. So how do we know if they’re getting smarter? Illustration by The Atlantic. Source: Getty. Unlike conventional computer programs, generative AI ...

Morning Overview on MSN

Human scientists still trounce the best AI agents on complex tasks, Nature study finds

The latest generation of AI agents can draft code, summarize papers, and churn through datasets at speeds no human can match.

Nature

How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest

The technology firm OpenAI made headlines last month when its latest experimental chatbot model, o3, achieved a high score on a test that marks progress towards artificial general intelligence (AGI).

The Spectrum

AIMomentz Launches Open AI Image Evaluation Platform With Human Preference Benchmark and Provenance Tracking

Text-based AI models have LMArena, which reached a $1.7 billion valuation by letting humans compare GPT, Claude, and Gemini in blind A/B tests. The resulting human preference data became the industry ...

MIT Technology Review

AI benchmarks are broken. Here’s what we need instead.

One-off tests don’t measure AI’s true impact. We’re better off shifting to more human-centered, context-specific methods. For decades, artificial intelligence has been evaluated through the question ...

Morning Overview on MSN

Nature study: Human scientists still crush the best AI agents on complex, multi-step tasks

The best AI agents money can buy still cannot do what a trained scientist does every day: read several research papers, spot ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results