Google's Gemini 3.5 Flash Nearly Ties GPT-5.5 — for One-Third the Price

On June 25, Google embedded "computer use" as a native capability in Gemini 3.5 Flash. The AI can view a screen, click, fill out forms, open maps, and consult search results to decide its next step — all without a human in the loop. On the OSWorld-Verified benchmark, which measures computer-control ability, the model scored 78.4. GPT-5.5 scored 78.7 — a gap of 0.3 points. The price is one-third. Google charges $1.50 per million input tokens; OpenAI charges $5. The same ratio holds on output: $9 from Google versus $30 from OpenAI.

Taken at face value, the story is simple: comparable performance at a fraction of the cost. But deploying AI agents in real workflows involves considerations that benchmarks alone can't settle.

The complexity of chaining multiple models has collapsed into one

Previously, using computer-use capabilities required routing tasks through separate models. One handled screen perception while another processed search results and yet another managed map data — each handoff a potential point of failure. Developers had to stitch these pipelines together, and every seam introduced risk.

This integration simplifies that architecture. A single Gemini 3.5 Flash instance now handles screen recognition, search grounding, and maps integration together. Agents can sustain context across consecutive tasks within the same workflow, reducing the context loss and error risk that came from bouncing between models.

Google's move has to be read in competitive context. OpenAI's Operator and Anthropic Claude's computer-control features were already on the market. With Gemini 3.5 Pro's release pushed to July, Google chose to deepen the Flash lineup's capabilities first — winning developer mindshare before the flagship arrives. The strategy: match accuracy closely, then win on price.

The downstream effect on the developer ecosystem goes beyond benchmark numbers. Developers already on Google's API don't need to sign new contracts or swap out frameworks. Adding computer use is, in many cases, a single additional call in an existing pipeline. The lower the switching cost, the faster the migration.

What happens when an AI agent makes a mistake

An agent that directly views and clicks a screen opens up repetitive tasks — web form entry, internal system operations, copy-paste data work — to automation. Legacy systems that were hard to reach via API suddenly become automation candidates once an agent can simply read the screen. For small teams stretched thin, that shift carries real weight.

But there's meaningful skepticism too. The fact that agents operate on live systems means errors have different consequences than a bad text generation. A wrong sentence in a draft gets deleted; a wrong button click or a field filled with bad data changes system state. During early testing of OpenAI's Operator, unexpected screen transitions and unintended form submissions were reported. The more capable the agent, the deeper a single mistake can reach.

Security researchers flag additional concerns. An agent that reads and manipulates screens becomes a pathway to authentication credentials, personal data, and internal systems. Prompt injection — where an attacker plants content on screen to steer the agent's actions — is among the more prominent threat vectors. Choosing to deploy based solely on benchmark scores and pricing is, effectively, a choice not to examine these risks.

As the range of tasks an agent can handle grows, so does the blast radius of a malfunction. As the unintended form submissions in Operator's early tests showed, the scope of permissions granted to an agent determines how far a single mistake can travel. The gap between teams that build control procedures in advance and those that don't becomes visible the moment something goes wrong.

Draw the boundaries before you hand the agent the mouse

Since computer use is currently available through APIs and enterprise platforms first, most teams can't deploy it to a live workflow today without a development environment. But the point at which this capability reaches no-code services and workflow automation platforms may arrive sooner than expected. There are decisions worth making before the technology is in reach.

Start by separating tasks where errors can be undone from those where they can't. Work that's reversible if an agent slips up is a good fit for automation — drafting invoices, cleaning data sets, anything designed so a human reviews before anything is finalized. For tasks where execution sends something into the world — contract delivery, customer notifications — build in a human confirmation step before the agent gets final say.

Decide in advance how much access the agent gets. Giving an agent full account permissions versus restricting it to a specific folder or system produces very different blast radii when something goes wrong. Narrower permissions make it easier to contain the damage from unexpected behavior.

Know where you can review agent logs. If the tool you deploy doesn't record which screen the agent was on and what it clicked, tracing the root cause of a problem becomes difficult. Audit trails and log access are things to ask vendors about before signing any deployment agreement.

These three checks don't depend on which model you choose. Google's computer use, OpenAI's Operator, Anthropic's screen-control feature — the same principles apply the moment an agent is touching a live system.

A price of one-third lowers the cost barrier to agent deployment. As that barrier drops, more teams run more experiments. More experiments mean more failure cases. The outcomes diverge for teams that start with procedures versus those that don't.

Before the agent picks up the mouse, decide where it's allowed to click.