GPT-4o can’t even solve the captcha? The success rate of the SOTA model is only 40%

avatar
36kr
06-04
This article is machine translated
Show original
Here is the English translation:

Can the Current Strongest Multimodal Agent Not Even Solve Captchas?

The MetaAgentX team has launched an open research platform focused on "Multimodal Interactive Agent × CAPTCHA (Human-Machine Verification) Problems" - Open CaptchaWorld.

This platform is specifically designed to test Agents' ability to solve CAPTCHAs.

Test results show: Human average success rate reaches 93.3%, while SOTA multimodal models average only 5%-40%.

Even GPT-4o was stumped.

CAPTCHA is a Major Bottleneck for Agent Deployment at the Current Stage

When deploying multimodal Agents in real webpage scenarios, have you also been blocked by human-machine verification (CAPTCHA)?

The project team discovered that many large Benchmarks (including AgentBench, VisualWebArena, etc.) deliberately skipped webpages containing CAPTCHAs during construction, as if this roadblock didn't exist.

But reality is harsh: CAPTCHA is never a "special case", but an unavoidable presence in any actual task, especially common in e-commerce, login, ticketing, and other high-value webpages.

Thus, Open CaptchaWorld and its Benchmark were born: a CAPTCHA solving platform and evaluation benchmark for multimodal large model Agents - designed specifically for visual-language-action interaction tasks.

Whether it's OpenAI's o3, Anthropic's Claude-3.7-sonnet, or Gemini-2.5-pro, these latest multimodal large model Agents, despite excellent performance in static perception tasks (such as image-text Q&A, UI understanding), often get stuck at the CAPTCHA stage in real interactive environments:

  • WebAgent often gets "stuck" by CAPTCHAs when executing end-to-end tasks;
  • Mainstream evaluation sets like AgentBench and VisualWebArena generally filter out webpages containing CAPTCHAs;
  • Past CAPTCHA research (like reCAPTCHA, DeepCAPTCHA) focused more on static recognition, with severe insufficiency in assessing interaction, multi-step planning, and state tracking capabilities.

To systematically evaluate Agents' real performance in CAPTCHAs, the research team designed a brand new open benchmark and platform - Open CaptchaWorld.

This platform not only includes the latest modern CAPTCHAs but also has diverse types (20 types), all operated in real web browsing environments, truly reproducing the challenges Agents actually encounter:

"Image decoding + Understanding rules + Planning actions + Step-by-step interaction" = A real test of Agent capabilities.

[The translation continues in the same manner for the rest of the text, maintaining the specified translation rules for specific terms.]

Overall, the figure reveals that multimodal Agents are not always "more expensive means stronger" in real interactive tasks, and highlights the important value of the Open CaptchaWorld platform in analyzing Agent utility and deployability.

Future model design should focus more on synergistic optimization between efficiency and performance.

The Open CaptchaWorld platform provides new insights for Agent developers and benchmark designers.

It also reveals—

  • The current real "shortcomings" of Agents—long-sequence task dynamic interaction and planning interaction capabilities;
  • Blind spots in existing Benchmark assessments—largely omitting the indispensable "human-machine verification" step in real deployments;
  • New model design directions—how to improve Agent automation and robustness in real-world web tasks;
  • New Captcha design in the Agent era—current Captchas will eventually be broken by growing Agent capabilities, and we also need to continuously update and design new Captchas to adapt to technological developments.

The proposal of Open CaptchaWorld aims to encourage researchers to no longer avoid the CAPTCHA issue when training and evaluating Agents, but to bravely face it, because in the real world, if an Agent cannot even pass a verification code, it cannot be implemented.

More details are welcome to be read in the original text.

Paper link: https://arxiv.org/abs/2505.24878 Huggingface Spaces: https://huggingface.co/spaces/YaxinLuo/Open_CaptchaWorld

Code repository & Data link: https://github.com/MetaAgentX/OpenCaptchaWorld

This article is from the WeChat public account "Quantum Bit", authored by the MetaAgentX team, published with authorization from 36kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments