OpenAI's GPT-5.4 scored 75% on the OSWorld-Verified benchmark — a test that measures the ability to complete real desktop productivity tasks like navigating software, filling out forms, and managing files. Human experts scored 72.4%.
This is the first time any AI model has outperformed humans at autonomous computer work.
What makes this different from previous benchmarks:
- It's not trivia or math — it's real-world desktop productivity
- GPT-5.4 can see your screen, move the mouse, type, and complete multi-step workflows
- It jumped from GPT-5.2's 47.3% to 75% — a massive leap in one generation
- 1 million token context window — it can process entire codebases or documents at once
- Three variants: Standard, Thinking (deep reasoning), and Pro
This isn't about replacing people. It's about what becomes possible when AI can operate a computer as well as you can. Think automated testing, data entry at scale, customer onboarding, report generation — all running autonomously.