Friday, March 6, 2026

D4RT: Teaching artificial intelligence to see the world in four dimensions

Share

We present D4RT, a unified artificial intelligence model for 4D scene reconstruction and tracking in space and time.

Every time we look at the world, we perform a remarkable feat of memory and anticipation. We see and understand things as they are at this moment, as they were a moment ago, and as they will be in the next moment. Our mental model of the world maintains a persistent representation of reality, and we employ this model to draw intuitive conclusions about the causal relationship between the past, present, and future.

To facilitate machines see the world more like we do, we can equip them with cameras, but this only solves the data entry problem. To understand this signal, computers must solve a sophisticated, inverse problem: record a video – which is a sequence of flat 2D projections – and recover or understand the luxurious, volumetric 3D world in motion.

Today we present D4RT (dynamic 4D reconstruction and tracking)a up-to-date artificial intelligence model that combines active scene reconstruction into a single, competent structure, bringing us closer to the next frontier of artificial intelligence: the complete perception of our active reality.

The challenge of the fourth dimension

To understand a active scene captured in 2D video, an AI model must track every pixel of every object moving in three dimensions of space and a fourth dimension of time. Additionally, it must decouple this motion from camera movement, maintaining a consistent representation even when objects move behind each other or leave the frame entirely. Traditionally, capturing this level of geometry and motion from 2D videos requires computationally intensive processes or a collection of specialized AI models – some for depth, others for motion or camera angles – resulting in snail-paced and piecemeal AI reconstructions.

D4RT’s simplified architecture and pioneering query engine put it at the forefront of 4D reconstruction, while being up to 300 times more competent than previous methods – brisk enough for real-time applications in robotics, augmented reality and more.

How D4RT works: a query-driven approach

D4RT operates as a unified codec transformer architecture. The encoder first converts the input video signal into a compressed representation of the geometry and motion of the scene. Unlike older systems that used separate modules for different tasks, D4RT calculates only what it needs, using a adaptable query engine centered around one basic question:

“Where is given pixel from a found video in 3D space in any way timeas seen from A selected camera?”

Resisting our previous workthe lightweight decoder then interrogates this representation to answer specific instances of the question posed. Because queries are independent, they can be processed in parallel on up-to-date AI hardware. This makes D4RT incredibly brisk and scalable, whether it’s tracking just a few points or reconstructing an entire scene.

Latest Posts

More News