Autonomous vehicles, robots, and other ML-learning systems have to case their surroundings to navigate and operate in the real world. They are often guided by 3D sensors such as Lidar, radar, and depth-sensing cameras and use scene understanding technology to process data these devices capture.
3D scene understanding is critical in object detection, human-centric understanding, and graphics. Though computer vision has made significant progress through mobile 3D object detection and transparent object detection, the number of tools that can be applied to 3D data is still limited.
Sign up for your weekly dose of what's up in emerging technology.
To improve 3D scene understanding, Google has now developed TensorFlow 3D — a highly modular library to bring 3D deep learning capabilities to TensorFlow.
What is 3D Scene understanding?
The current computer vision systems tell only a little about an object’s location in a 3D space and how agents such as robots interact. This is not enough to understand the environment entirely. Recent researches have focused on obtaining geometric understanding of the scene to overcome this shortcoming. As opposed to capturing an image plane representation of the objects, their representation as they exist in the 3D world helps in applications such as human-centric understanding, graphics, and object detection.
The newly introduced library from TensorFlow provides a set of operations, loss function, data processing tools, metrics, and other models for developing, training, and deploying state-of-art 3D scene understanding models.
- For the training and evaluation of standard 3D scene understanding data sets, TF 3D offers unified dataset specification and configuration.
- It supports datasets such as Waymo Open, ScanNet, and Rio. Users can also convert other datasets such as Kitti and NuScenes and use them as usual.
- TF 3D can be leveraged for different 3D deep learning research types like quick prototyping and deploying real-time inference systems.
Currently, TF 3D supports three pipelines:
3D Semantic Segmentation: The captured 3D data contains open space apart from the set of objects of interest. Since most of the 3D data is sparse, applying standard implementation of convolutions is computationally intensive and requires large memory space.
To overcome this, TF 3D uses submanifold sparse convolution to process 3D sparse data more efficiently. It uses the U-Net architecture to extract features from each voxel. The U-Net network contains sparse convolution blocks with pooling and un-pooling operations. Further, the model uses various CUDA techniques to fasten computations such as hashing, partitioning, and bit operations.
The submanifold sparse convolutional networks are applied on a 3D semantic segmentation model, which outputs a per-voxel semantic score basis. The scores can be mapped back to predict the semantic label per point.
3D Instance Segmentation: In addition to predicting semantics, it is crucial to group the voxels that belong to the same object. Here the instance embedding vectors map the voxels to an embedding space. In this space, voxels from the same object instance are placed together, whereas those corresponding to different objects are kept far apart. During the inference process, the model uses the greedy algorithm to pick one instance at a time to group them into segments based on the distance between the voxel embeddings.
3D Object Detection: The model for 3D object detection determines parameters such as voxel size, center, rotation matrices, and object semantic scores. Further, to compress hundreds of thousands of per-voxel box predictions to a few box proposals, a box proposal mechanism is used. At the time of training, box prediction and classification losses are applied on per-voxel predictions. In particular, a dynamic box classification loss is used that classifies a box overlapping with ground-truth as positive and others as negative.