Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...
AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human well-being or just maximize for engagement. A ...
Benjamin is a business consultant, coach, designer, musician, artist, and writer, living in the remote mountains of Vermont. He has 20+ years experience in tech, an educational background in the arts, ...
OpenAI’s new GDPval benchmark tested GPT-5 on real-world jobs across nine industries, revealing that the AI matched or outperformed experts 40% of the time. While not a full replacement, OpenAI ...
Share on Facebook (opens in a new window) Share on X (opens in a new window) Share on Reddit (opens in a new window) Share on Hacker News (opens in a new window) Share on Flipboard (opens in a new ...
AI programs train on questions they’re later tested on. So how do we know if they’re getting smarter? Illustration by The Atlantic. Source: Getty. Unlike conventional computer programs, generative AI ...
The latest generation of AI agents can draft code, summarize papers, and churn through datasets at speeds no human can match.
The technology firm OpenAI made headlines last month when its latest experimental chatbot model, o3, achieved a high score on a test that marks progress towards artificial general intelligence (AGI).
Text-based AI models have LMArena, which reached a $1.7 billion valuation by letting humans compare GPT, Claude, and Gemini in blind A/B tests. The resulting human preference data became the industry ...
One-off tests don’t measure AI’s true impact. We’re better off shifting to more human-centered, context-specific methods. For decades, artificial intelligence has been evaluated through the question ...
The best AI agents money can buy still cannot do what a trained scientist does every day: read several research papers, spot ...