FDM-1: The Model That Learns to Use Computers by Watching Humans

Standard Intelligence has unveiled FDM-1, a new “computer-action” model designed to learn how to use computers by watching video. Early demonstrations show it performing complex CAD modeling, identifying software bugs, and even driving a real car through San Francisco.

Key details

FDM-1 was trained on 11 million hours of screen recordings - roughly 550,000× larger than the biggest open dataset of its kind. Its core innovation lies in reverse-engineering user intent: the model infers which actions must have produced each visual frame, effectively learning workflows directly from observation.

Unlike most current systems, FDM-1 can process nearly two hours of continuous screen activity in a single context window, giving it around 50× more visual context than existing models. This extended memory enables it to follow complex, multi-step processes without losing continuity.

Demonstrations highlight its versatility: the model can build mechanical gears in Blender, debug software environments, and even control a real vehicle using arrow-key inputs and live sensor feeds - all with under an hour of task-specific training data.

Why important?

Language models learned to write by training on the internet’s text. FDM-1 applies the same principle to action — learning how humans work, design, and operate systems by training on video. If scalable, this approach dramatically expands the usable training corpus for computer-use agents, raising the ceiling on what autonomous software systems can learn to do.

Sources:

https://si.inc/posts/fdm1/
https://x.com/si_pbc/status/2025978967698858020
https://x.com/si_pbc/status/2025978967698858020