GPT-5.4 Just Beat Humans at Real Computer Work — Scoring 75% Where Experts Scored 72.4%

OpenAI's GPT-5.4 scored 75% on the OSWorld-Verified benchmark — a test that measures the ability to complete real desktop productivity tasks like navigating software, filling out forms, and managing files. Human experts scored 72.4%.

This is the first time any AI model has outperformed humans at autonomous computer work.

What makes this different from previous benchmarks:

It's not trivia or math — it's real-world desktop productivity
GPT-5.4 can see your screen, move the mouse, type, and complete multi-step workflows
It jumped from GPT-5.2's 47.3% to 75% — a massive leap in one generation
1 million token context window — it can process entire codebases or documents at once
Three variants: Standard, Thinking (deep reasoning), and Pro

This isn't about replacing people. It's about what becomes possible when AI can operate a computer as well as you can. Think automated testing, data entry at scale, customer onboarding, report generation — all running autonomously.

GPT-5.4 Just Beat Humans at Real Computer Work — Scoring 75% Where Experts Scored 72.4%

Share this article