My research has focused on describing and learning the dynamics of the 3D world through the problem of Scene Flow. I strongly believe in The Bitter Lesson, but to me it’s clear that general embodied AI systems need a deep intuition for the 3D world to be robust and sample efficient. My focus on describing dynamics was driven by the lack of scalable, data-driven methods at the time to learn this; while we have self-supervised problem of next frame prediction to get structure, when I started there were no data-driven methods to learn motion, and I felt this was a critical gap.
We need scalable Scene Flow methods, i.e. methods that improve by adding more raw data and more parameters. When I started this project, scene flow methods were either feed-forward supervised methods using human annotations (or from the synthetic dataset generator), or they were very expensive optimization methods. Worse, almost all of these methods did not run on full-size point clouds; they would downsample the point cloud to 8,196 points instead of the 50,000+ points in the (ground removed!) full point clouds. This is a critical limitation, as it meant they are fundamentally unsuitable to detecting motion on all but the largest objects. This left us with only a couple optimization and feed-forward baseline scene flow methods that even tried to seriously solve the full scene flow problem.
ZeroFlow is a very simple idea: distill one of the few (very) expensive optimization methods (Neural Scene Flow Prior) into one of the few feed-forward networks that could handle full-size point clouds (FastFlow3D). This was far more successful than we expected, and ZeroFlow was state-of-the-art on the Argoverse 2 Self-Supervised Scene Flow Leaderboard (beating out the optimization teacher!). It was also 1000x faster than the best optimization methods, and 1000x cheaper to train than the human supervised methods.
While conceptually simple, ZeroFlow had several important take-home messages:
After publishing ZeroFlow, we spent a long time looking at visualizations of its flow results to get a deeper understanding of its shortcomings. We realized it (and all of the baselines) systematically failed to describe most small object motion (e.g. Pedestrians). Worse, we didn’t know about these systematic failures using the standard metrics because, by construction, small objects have a very small fraction of the total points in a point cloud, and so their error contribution was reduced to a rounding error compared to large objects.
In order to properly quantify this failure, we proposed a new metric, Bucket Normalized Scene Flow, that reported error per class, and normalized these errors by point speed to report a percentage of motion described — it’s clear that 0.5m/s error on a 0.5m/s walking pedestrian is far worse than 0.6m/s error on a 25m/s driving car.
To show that this wasn’t an impossible gap to close, we proposed a very simple and crude supervised baseline, TrackFlow, constructed by running an off-the-shelf 3D detector on each point cloud and then associating boxes across frames with a 3D Kalman filter to produce flow. Despite the crude construction without any scene flow specific training, it was state-of-the-art by a slim margin on the old metrics but by an enormous margin on our new metric; it was the first method to describe more than 50% of pedestrian motion correctly (hence the name, I Can’t Believe It’s Not Scene Flow!).
The key take-home messages were:
In order to push the field to close this gap, we hosted the Argoverse 2 2024 Scene Flow Challenge as part of the CVPR 2024 Workshop on Autonomous Driving. The goal was to minimize the mean normalized dynamic error of our new metric Bucket Normalized Scene Flow, and featured both a supervised and an unsupervised track. The most surprising result was the winning supervised method Flow4D was able to halve the error compared to the next best method, our baseline TrackFlow, and it did so with a novel feed forward architecture that was better able to learn general 3D motion cues, while using no additional fancy training tricks like class rebalancing.
Our key take-home message was that feed-forward architecture choice was a critically underexplored aspect of scene flow, and ZeroFlow and other prior work clearly suffered from inferior network design.
Under our new metric from I Can’t Believe It’s Not Scene Flow!, it became clear that ZeroFlow’s poor performance was at least partially inherited from the systematic limitations of its teacher. This motivated the need for a high-quality offline optimization method that, even if expensive, could describe the motion of small objects well.
To do this, we proposed EulerFlow, a simple, unsupervised test-time optimization method that fits a neural flow volume to the entire sequence of point clouds. This full sequence formulation, combined with multi-step optimization losses, results in extremely high quality unsupervised flow, allowing EulerFlow to capture state-of-the-art on the Argoverse 2 2024 Scene Flow Challenge leaderboard, beating out all prior art, including all prior supervised methods. EulerFlow also displayed a number of emergent capabilities: it is able to extract long tail, small object motion such as birds flying, and it is able to do 3D point tracking across arbitrary time horizons for object using Euler integration.
The key take-home messages:
When I started, there were no model zoos and the open source codebases that were available were a mess. I sat down and wrote the ZeroFlow codebase from scratch, which then turned into SceneFlowZoo with several other baseline implementations.
As part of I Can’t Believe It’s Not Scene Flow! we also released a standalone dataloader and evaluation package which we used as the basis of the Argoverse 2 2024 Scene Flow Challenge. This codebase, BucketedSceneFlowEval is used by the model zoo, but is deep learning library agnostic (it produces everything in numpy arrays) and is thinly wrapped in the SceneFlowZoo codebase.